# Proportions with mean()

One of the most common tasks I want to do is calculate the proportion of observations (e.g., rows in a data set) that meet a particular condition. For example, what is the proportion of missing data, or people over the age of 18?

There is a suprisingly easy solution to handle this problem: by combining boolean vectors and `mean()`.

## Step 1: creating a boolean vector #

We start with boolean vectors, which is a vector that is `TRUE` whenever our observation meets our condition, or `FALSE` whenever it’s not. We create this boolean vector by submitting our observations to some sort of conditional statement (or relevant function like `is.na()`). Let’s take a look at a few examples:

``````x <- letters[1:10]
x == "b"  # return a boolean vector which is TRUE whenver x is "b"
#>  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

x <- 1:10
x > 5  # TRUE whenever x is greater than 5
#>  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

x > 5 & x %% 2 == 0  # TRUE when x > 5 AND divisible by 2
#>  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE

x <- c(1, 2, NA, 4)
is.na(x) # TRUE when x is a missing value
#> [1] FALSE FALSE  TRUE FALSE
``````

If you’re unsure with how the above works, take a look at this page on R Programming Operators.

With this under our belt, it seems simple enough to create a boolean vector that tells us when our observations meet some condition (`TRUE`) or not (`FALSE`).

## Step 2: calculating the proportion of TRUE #

From this point, all we need to do is wrap our conditional statment inside `mean()`:

``````x <- 1:10
mean(x > 5)  # proportion of values in x greater than 5
#> [1] 0.5
``````

How/why does this work? If you take a look at the help page with `?mean()`, you’ll read that the arguement `x` can be a logical vector. But what does this mean. Well, when you use a boolean vector, `mean()` first converts it to a numeric vector. This means that every `TRUE` becomes `1`, and every `FALSE` becomes `0`:

``````x <- 1:10
as.numeric(x > 5)
#>  [1] 0 0 0 0 0 1 1 1 1 1
``````

It then computes the mean of these 1’s and 0’s. At this point, you just need to think a little. How is the mean calculated? Well, it’s the sum of all the values, divided by their length? So the sum of a vector of 1’s and 0’s will be the total number of 1’s! Divided by the length then gives you the proportion. As a side note, you might realise that you can use `sum()` instead of `mean()` if you want to calculate the frequency. Let’s break this right down:

``````x <- 1:10
x
#>  [1]  1  2  3  4  5  6  7  8  9 10

test <- x > 5
test
#>  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

as.numeric(test)
#>  [1] 0 0 0 0 0 1 1 1 1 1

sum(test)
#> [1] 5

length(test)
#> [1] 10

sum(test) / length(test)
#> [1] 0.5

mean(test)
#> [1] 0.5
``````

## Some useful examples #

At this point, we can apply this to all sorts of problems. Here are some examples using the mtcars data set:

``````d <- mtcars
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# Proportion of rows (cars) with cyl == 6 (6 cylinders)
mean(d\$cyl == 6)
#> [1] 0.21875

# Proportions of rows (cars) with hp > 250 (horsepower over 200)
mean(d\$hp > 250)
#> [1] 0.0625

# Proportion of cars with 8-cylinders and that get more than 15 Miles/(US) gallon
mean(d\$cyl == 8 & d\$hp > 15)
#> [1] 0.4375
``````

## Sign off #

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out my GitHub repository, blogR.