Proportions with mean()
One of the most common tasks I want to do is calculate the proportion of observations (e.g., rows in a data set) that meet a particular condition. For example, what is the proportion of missing data, or people over the age of 18?
There is a suprisingly easy solution to handle this problem: by combining boolean vectors and mean()
.
Step 1: creating a boolean vector #
We start with boolean vectors, which is a vector that is TRUE
whenever our observation meets our condition, or FALSE
whenever it’s not. We create this boolean vector by submitting our observations to some sort of conditional statement (or relevant function like is.na()
). Let’s take a look at a few examples:
x <- letters[1:10]
x == "b" # return a boolean vector which is TRUE whenver x is "b"
#> [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x <- 1:10
x > 5 # TRUE whenever x is greater than 5
#> [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
x > 5 & x %% 2 == 0 # TRUE when x > 5 AND divisible by 2
#> [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
x <- c(1, 2, NA, 4)
is.na(x) # TRUE when x is a missing value
#> [1] FALSE FALSE TRUE FALSE
If you’re unsure with how the above works, take a look at this page on R Programming Operators.
With this under our belt, it seems simple enough to create a boolean vector that tells us when our observations meet some condition (TRUE
) or not (FALSE
).
Step 2: calculating the proportion of TRUE #
From this point, all we need to do is wrap our conditional statment inside mean()
:
x <- 1:10
mean(x > 5) # proportion of values in x greater than 5
#> [1] 0.5
How/why does this work? If you take a look at the help page with ?mean()
, you’ll read that the arguement x
can be a logical vector. But what does this mean. Well, when you use a boolean vector, mean()
first converts it to a numeric vector. This means that every TRUE
becomes 1
, and every FALSE
becomes 0
:
x <- 1:10
as.numeric(x > 5)
#> [1] 0 0 0 0 0 1 1 1 1 1
It then computes the mean of these 1’s and 0’s. At this point, you just need to think a little. How is the mean calculated? Well, it’s the sum of all the values, divided by their length? So the sum of a vector of 1’s and 0’s will be the total number of 1’s! Divided by the length then gives you the proportion. As a side note, you might realise that you can use sum()
instead of mean()
if you want to calculate the frequency. Let’s break this right down:
x <- 1:10
x
#> [1] 1 2 3 4 5 6 7 8 9 10
test <- x > 5
test
#> [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
as.numeric(test)
#> [1] 0 0 0 0 0 1 1 1 1 1
sum(test)
#> [1] 5
length(test)
#> [1] 10
sum(test) / length(test)
#> [1] 0.5
mean(test)
#> [1] 0.5
Some useful examples #
At this point, we can apply this to all sorts of problems. Here are some examples using the mtcars data set:
d <- mtcars
head(d)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Proportion of rows (cars) with cyl == 6 (6 cylinders)
mean(d$cyl == 6)
#> [1] 0.21875
# Proportions of rows (cars) with hp > 250 (horsepower over 200)
mean(d$hp > 250)
#> [1] 0.0625
# Proportion of cars with 8-cylinders and that get more than 15 Miles/(US) gallon
mean(d$cyl == 8 & d$hp > 15)
#> [1] 0.4375
Sign off #
Thanks for reading and I hope this was useful for you.
For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.
If you’d like the code that produced this blog, check out my GitHub repository, blogR.