Proportions with mean()  

One of the most common tasks I want to do is calculate the proportion of observations (e.g., rows in a data set) that meet a particular condition. For example, what is the proportion of missing data, or people over the age of 18?

There is a suprisingly easy solution to handle this problem: by combining boolean vectors and mean().

Step 1: creating a boolean vector #

We start with boolean vectors, which is a vector that is TRUE whenever our observation meets our condition, or FALSE whenever it’s not. We create this boolean vector by submitting our observations to some sort of conditional statement (or relevant function like is.na()). Let’s take a look at a few examples:

x <- letters[1:10]
x == "b"  # return a boolean vector which is TRUE whenver x is "b"
#>  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

x <- 1:10
x > 5  # TRUE whenever x is greater than 5
#>  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

x > 5 & x %% 2 == 0  # TRUE when x > 5 AND divisible by 2
#>  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE

x <- c(1, 2, NA, 4)
is.na(x) # TRUE when x is a missing value
#> [1] FALSE FALSE  TRUE FALSE

If you’re unsure with how the above works, take a look at this page on R Programming Operators.

With this under our belt, it seems simple enough to create a boolean vector that tells us when our observations meet some condition (TRUE) or not (FALSE).

Step 2: calculating the proportion of TRUE #

From this point, all we need to do is wrap our conditional statment inside mean():

x <- 1:10
mean(x > 5)  # proportion of values in x greater than 5
#> [1] 0.5

How/why does this work? If you take a look at the help page with ?mean(), you’ll read that the arguement x can be a logical vector. But what does this mean. Well, when you use a boolean vector, mean() first converts it to a numeric vector. This means that every TRUE becomes 1, and every FALSE becomes 0:

x <- 1:10
as.numeric(x > 5)
#>  [1] 0 0 0 0 0 1 1 1 1 1

It then computes the mean of these 1’s and 0’s. At this point, you just need to think a little. How is the mean calculated? Well, it’s the sum of all the values, divided by their length? So the sum of a vector of 1’s and 0’s will be the total number of 1’s! Divided by the length then gives you the proportion. As a side note, you might realise that you can use sum() instead of mean() if you want to calculate the frequency. Let’s break this right down:

x <- 1:10
x
#>  [1]  1  2  3  4  5  6  7  8  9 10

test <- x > 5
test
#>  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

as.numeric(test)
#>  [1] 0 0 0 0 0 1 1 1 1 1

sum(test)
#> [1] 5

length(test)
#> [1] 10

sum(test) / length(test)
#> [1] 0.5

mean(test)
#> [1] 0.5

Some useful examples #

At this point, we can apply this to all sorts of problems. Here are some examples using the mtcars data set:

d <- mtcars
head(d)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# Proportion of rows (cars) with cyl == 6 (6 cylinders)
mean(d$cyl == 6)
#> [1] 0.21875

# Proportions of rows (cars) with hp > 250 (horsepower over 200)
mean(d$hp > 250)
#> [1] 0.0625

# Proportion of cars with 8-cylinders and that get more than 15 Miles/(US) gallon
mean(d$cyl == 8 & d$hp > 15)
#> [1] 0.4375

Sign off #

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out my GitHub repository, blogR.

 
42
Kudos
 
42
Kudos

Now read this

Data science opinions and tools to support them at rstudio::conf

@drsimonj here to share my big takeaways from rstudio::conf 2017. My aim here is to share the broad data science opinions and challenges that I feel bring together the R community right now, and perhaps offer some guidance to anyone... Continue →