Proportions with mean()  

One of the most common tasks I want to do is calculate the proportion of observations (e.g., rows in a data set) that meet a particular condition. For example, what is the proportion of missing data, or people over the age of 18?

There is a suprisingly easy solution to handle this problem: by combining boolean vectors and mean().

Step 1: creating a boolean vector #

We start with boolean vectors, which is a vector that is TRUE whenever our observation meets our condition, or FALSE whenever it’s not. We create this boolean vector by submitting our observations to some sort of conditional statement (or relevant function like is.na()). Let’s take a look at a few examples:

x <- letters[1:10]
x == "b"  # return a boolean vector which is TRUE whenver x is "b"
#>  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

x <- 1:10
x > 5  # TRUE whenever x is greater than 5
#>  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

x > 5 & x %% 2 == 0  # TRUE when x > 5 AND divisible by 2
#>  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE

x <- c(1, 2, NA, 4)
is.na(x) # TRUE when x is a missing value
#> [1] FALSE FALSE  TRUE FALSE

If you’re unsure with how the above works, take a look at this page on R Programming Operators.

With this under our belt, it seems simple enough to create a boolean vector that tells us when our observations meet some condition (TRUE) or not (FALSE).

Step 2: calculating the proportion of TRUE #

From this point, all we need to do is wrap our conditional statment inside mean():

x <- 1:10
mean(x > 5)  # proportion of values in x greater than 5
#> [1] 0.5

How/why does this work? If you take a look at the help page with ?mean(), you’ll read that the arguement x can be a logical vector. But what does this mean. Well, when you use a boolean vector, mean() first converts it to a numeric vector. This means that every TRUE becomes 1, and every FALSE becomes 0:

x <- 1:10
as.numeric(x > 5)
#>  [1] 0 0 0 0 0 1 1 1 1 1

It then computes the mean of these 1’s and 0’s. At this point, you just need to think a little. How is the mean calculated? Well, it’s the sum of all the values, divided by their length? So the sum of a vector of 1’s and 0’s will be the total number of 1’s! Divided by the length then gives you the proportion. As a side note, you might realise that you can use sum() instead of mean() if you want to calculate the frequency. Let’s break this right down:

x <- 1:10
x
#>  [1]  1  2  3  4  5  6  7  8  9 10

test <- x > 5
test
#>  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

as.numeric(test)
#>  [1] 0 0 0 0 0 1 1 1 1 1

sum(test)
#> [1] 5

length(test)
#> [1] 10

sum(test) / length(test)
#> [1] 0.5

mean(test)
#> [1] 0.5

Some useful examples #

At this point, we can apply this to all sorts of problems. Here are some examples using the mtcars data set:

d <- mtcars
head(d)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# Proportion of rows (cars) with cyl == 6 (6 cylinders)
mean(d$cyl == 6)
#> [1] 0.21875

# Proportions of rows (cars) with hp > 250 (horsepower over 200)
mean(d$hp > 250)
#> [1] 0.0625

# Proportion of cars with 8-cylinders and that get more than 15 Miles/(US) gallon
mean(d$cyl == 8 & d$hp > 15)
#> [1] 0.4375

Sign off #

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out my GitHub repository, blogR.

 
42
Kudos
 
42
Kudos

Now read this

With our powers combined! xgboost and pipelearner

@drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner. Why a post on xgboost and pipelearner? # xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it.... Continue →