# Proportions with mean()

One of the most common tasks I want to do is calculate the proportion of observations (e.g., rows in a data set) that meet a particular condition. For example, what is the proportion of missing data, or people over the age of 18?

There is a suprisingly easy solution to handle this problem: by combining boolean vectors and `mean()`

.

## Step 1: creating a boolean vector

We start with boolean vectors, which is a vector that is `TRUE`

whenever our observation meets our condition, or `FALSE`

whenever it’s not. We create this boolean vector by submitting our observations to some sort of conditional statement (or relevant function like `is.na()`

). Let’s take a look at a few examples:

```
x <- letters[1:10]
x == "b" # return a boolean vector which is TRUE whenver x is "b"
#> [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x <- 1:10
x > 5 # TRUE whenever x is greater than 5
#> [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
x > 5 & x %% 2 == 0 # TRUE when x > 5 AND divisible by 2
#> [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
x <- c(1, 2, NA, 4)
is.na(x) # TRUE when x is a missing value
#> [1] FALSE FALSE TRUE FALSE
```

If you’re unsure with how the above works, take a look at this page on R Programming Operators.

With this under our belt, it seems simple enough to create a boolean vector that tells us when our observations meet some condition (`TRUE`

) or not (`FALSE`

).

## Step 2: calculating the proportion of TRUE

From this point, all we need to do is wrap our conditional statment inside `mean()`

:

```
x <- 1:10
mean(x > 5) # proportion of values in x greater than 5
#> [1] 0.5
```

How/why does this work? If you take a look at the help page with `?mean()`

, you’ll read that the arguement `x`

can be a logical vector. But what does this mean. Well, when you use a boolean vector, `mean()`

first converts it to a numeric vector. This means that every `TRUE`

becomes `1`

, and every `FALSE`

becomes `0`

:

```
x <- 1:10
as.numeric(x > 5)
#> [1] 0 0 0 0 0 1 1 1 1 1
```

It then computes the mean of these 1’s and 0’s. At this point, you just need to think a little. How is the mean calculated? Well, it’s the sum of all the values, divided by their length? So the sum of a vector of 1’s and 0’s will be the total number of 1’s! Divided by the length then gives you the proportion. As a side note, you might realise that you can use `sum()`

instead of `mean()`

if you want to calculate the frequency. Let’s break this right down:

```
x <- 1:10
x
#> [1] 1 2 3 4 5 6 7 8 9 10
test <- x > 5
test
#> [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
as.numeric(test)
#> [1] 0 0 0 0 0 1 1 1 1 1
sum(test)
#> [1] 5
length(test)
#> [1] 10
sum(test) / length(test)
#> [1] 0.5
mean(test)
#> [1] 0.5
```

## Some useful examples

At this point, we can apply this to all sorts of problems. Here are some examples using the mtcars data set:

```
d <- mtcars
head(d)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Proportion of rows (cars) with cyl == 6 (6 cylinders)
mean(d$cyl == 6)
#> [1] 0.21875
# Proportions of rows (cars) with hp > 250 (horsepower over 200)
mean(d$hp > 250)
#> [1] 0.0625
# Proportion of cars with 8-cylinders and that get more than 15 Miles/(US) gallon
mean(d$cyl == 8 & d$hp > 15)
#> [1] 0.4375
```

## Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out my GitHub repository, blogR.