Guide to tidy git analysis  

@drsimonj here to help you embark on git repo analyses!

Ever wondered who contributes to git repos? How their contributions have changed over time? What sort of conventions different authors use in their commit messages? Maybe you were inspired by Mara Averick to contribute to tidyverse packages and wonder how you fit in?

This post – intended for intermediate R users – will help you answer these sorts of questions using tidy R tools.

Install and load these packages to follow along:

# Parts 1 and 2
library(tidyverse)
library(glue)
library(stringr)
library(forcats)

# Part 3
library(tidygraph)
library(ggraph)
library(tidytext)

Part 1: Git repo to a tidy data frame #

Get a git repo #

We’ll explore the open-source ggplot2 repo by copying it to our local machine with git clone, typically run on a command-line like:

git clone <repository_url> <directory>

Find the <repository_url> for github.com projects by clicking “Clone or download”.

ggplot2_gitclone.png

<directory> is optional but useful for us to clone into a specific location.

The R code below clones the ggplot2 git repo into a temporary directory called "git_repo" (let Alexander Matrunich teach you more about temp directories and files here). system() invokes these commands from R (instead of using a command-line directly) and the glue package beautifully handles the strings for us.

# Remote repository URL
repo_url <- "https://github.com/tidyverse/ggplot2.git"

# Directory into which git repo will be cloned
clone_dir <- file.path(tempdir(), "git_repo")

# Create command
clone_cmd <- glue("git clone {repo_url} {clone_dir}")

# Invoke command
system(clone_cmd)

Check the directory contents:

list.files(clone_dir)
#>  [1] "_pkgdown.yml"      "appveyor.yml"      "codecov.yml"      
#>  [4] "CONTRIBUTING.md"   "cran-comments.md"  "data"             
#>  [7] "data-raw"          "DESCRIPTION"       "ggplot2.Rproj"    
#> [10] "icons"             "inst"              "ISSUE_TEMPLATE.md"
#> [13] "LICENSE"           "man"               "NAMESPACE"        
#> [16] "NEWS"              "NEWS.md"           "R"                
#> [19] "README.md"         "README.Rmd"        "revdep"           
#> [22] "tests"             "vignettes"

Get tidy git history #

You can now access the git history of the local repo using git log (making sure to target the right directory with -C). Examine the last few commits via:

system(glue('git -C {clone_dir} log -3'))
#> commit 3c9c504fdb2f2c6b70a98d5a02f6cdf2c8cc62c9
#> Merge: cab8a54e 449bc039
#> Author: Lionel Henry <lionel.hry@gmail.com>
#> Date:   Thu Mar 22 18:42:25 2018 +0100
#> 
#>     Merge pull request #2491 from tidyverse/tidyeval-facets
#>     
#>     Port facets to tidy eval
#> 
#> commit 449bc039a75dd06e0ff52233c5d747f81b4d7c30
#> Author: Lionel Henry <lionel.hry@gmail.com>
#> Date:   Thu Mar 22 17:54:12 2018 +0100
#> 
#>     Remove dependency on plyr::as.quoted()
#> 
#> commit dc3c9855fd18f78ba8ef14d7e485334ac47e3b16
#> Author: Lionel Henry <lionel.hry@gmail.com>
#> Date:   Thu Mar 22 16:39:12 2018 +0100
#> 
#>     Call formula interface classic instead of historical

This default output is nice but difficult to parse. Fortunately, git log has the --pretty option, which the code below uses to create a command to return nicely formatted logs (learn more about log formatting here):

log_format_options <- c(datetime = "cd", commit = "h", parents = "p", author = "an", subject = "s")
option_delim <- "\t"
log_format   <- glue("%{log_format_options}") %>% collapse(option_delim)
log_options  <- glue('--pretty=format:"{log_format}" --date=format:"%Y-%m-%d %H:%M:%S"')
log_cmd      <- glue('git -C {clone_dir} log {log_options}')
log_cmd
#> git -C /var/folders/f3/0qlt4tvx7lld3r1wx8q6z9gnrc2yg8/T//RtmpHj21yK/git_repo log --pretty=format:"%cd    %h  %p  %an %s" --date=format:"%Y-%m-%d %H:%M:%S"

This outputs each commit as a string of tab-separated values:

system(glue('{log_cmd} -3'))
#> 2018-03-22 18:42:25  3c9c504f    cab8a54e 449bc039   Lionel Henry    Merge pull request #2491 from tidyverse/tidyeval-facets
#> 2018-03-22 17:55:23  449bc039    dc3c9855    Lionel Henry    Remove dependency on plyr::as.quoted()
#> 2018-03-22 17:55:23  dc3c9855    1182b9f3    Lionel Henry    Call formula interface classic instead of historical

The R code below executes this for the entire repo, captures the output (via intern = TRUE), splits commit strings into vectors of values (thanks to stringr), and converts them to a named tibble.

history_logs <- system(log_cmd, intern = TRUE) %>% 
  str_split_fixed(option_delim, length(log_format_options)) %>% 
  as_tibble() %>% 
  setNames(names(log_format_options))

history_logs
#> # A tibble: 3,946 x 5
#>    datetime            commit   parents           author       subject    
#>    <chr>               <chr>    <chr>             <chr>        <chr>      
#>  1 2018-03-22 18:42:25 3c9c504f cab8a54e 449bc039 Lionel Henry Merge pull…
#>  2 2018-03-22 17:55:23 449bc039 dc3c9855          Lionel Henry Remove dep…
#>  3 2018-03-22 17:55:23 dc3c9855 1182b9f3          Lionel Henry Call formu…
#>  4 2018-03-22 17:55:23 1182b9f3 b2144a8e          Lionel Henry Document t…
#>  5 2018-03-22 17:55:23 b2144a8e 13db76d2          Lionel Henry Mention ti…
#>  6 2018-03-22 17:55:23 13db76d2 95253c66          Lionel Henry Accept var…
#>  7 2018-03-22 17:55:23 95253c66 2983aa0d          Lionel Henry Accept var…
#>  8 2018-03-22 17:55:23 2983aa0d 64a00c7d          Lionel Henry Rename as_…
#>  9 2018-03-22 17:55:23 64a00c7d f186615a          Lionel Henry Use proper…
#> 10 2018-03-22 17:55:23 f186615a cab8a54e          Lionel Henry Support qu…
#> # ... with 3,936 more rows

The entire git commit history is now a tidy data frame (tibble)! We’ll finish this section with two minor additions.

First, the parents column can contain space-separated strings when a commit was a merge of multiple, for example. The code below converts this to a list-column of character vectors (let Jenny Bryan teach you more about this here):

history_logs <- history_logs %>% 
  mutate(parents = str_split(parents, " "))

history_logs
#> # A tibble: 3,946 x 5
#>    datetime            commit   parents   author       subject            
#>    <chr>               <chr>    <list>    <chr>        <chr>              
#>  1 2018-03-22 18:42:25 3c9c504f <chr [2]> Lionel Henry Merge pull request…
#>  2 2018-03-22 17:55:23 449bc039 <chr [1]> Lionel Henry Remove dependency …
#>  3 2018-03-22 17:55:23 dc3c9855 <chr [1]> Lionel Henry Call formula inter…
#>  4 2018-03-22 17:55:23 1182b9f3 <chr [1]> Lionel Henry Document tidy eval…
#>  5 2018-03-22 17:55:23 b2144a8e <chr [1]> Lionel Henry Mention tidy eval …
#>  6 2018-03-22 17:55:23 13db76d2 <chr [1]> Lionel Henry Accept vars() spec…
#>  7 2018-03-22 17:55:23 95253c66 <chr [1]> Lionel Henry Accept vars() spec…
#>  8 2018-03-22 17:55:23 2983aa0d <chr [1]> Lionel Henry Rename as_facets_s…
#>  9 2018-03-22 17:55:23 64a00c7d <chr [1]> Lionel Henry Use proper quosure…
#> 10 2018-03-22 17:55:23 f186615a <chr [1]> Lionel Henry Support quosures i…
#> # ... with 3,936 more rows

Finally, be sure to assign branch numbers to commits. There’s surely a better way to do this, but here’s one (very untidy) method:

# Start with NA
history_logs <- history_logs %>% mutate(branch = NA_integer_)

# Create a boolean vector to represent free columns (1000 should be plenty!)
free_col <- rep(TRUE, 1000)

for (i in seq_len(nrow(history_logs) - 1)) { # - 1 to ignore root
  # Check current branch col and assign open col if NA
  branch <- history_logs$branch[i]

  if (is.na(branch)) {
    branch <- which.max(free_col)
    free_col[branch] <- FALSE
    history_logs$branch[i] <- branch
  }

  # Go through parents
  parents <- history_logs$parents[[i]]

  for (p in parents) {
    parent_col <- history_logs$branch[history_logs$commit == p]

    # If col is missing, assign it to same branch (if first parent) or new
    # branch (if other)
    if (is.na(parent_col)) {
      parent_col <- if_else(p == parents[1], branch, which.max(free_col))

    # If NOT missing this means a split has occurred. Assign parent the lowest
    # and re-open both cols (parent closed at the end)
    } else {
      free_col[c(branch, parent_col)] <- TRUE
      parent_col <- min(branch, parent_col)

    }

    # Close parent col and assign
    free_col[parent_col] <- FALSE
    history_logs$branch[history_logs$commit == p] <- parent_col
  }
}

We now also have branch values with 1 being the root.

history_logs
#> # A tibble: 3,946 x 6
#>    datetime            commit   parents   author       subject      branch
#>    <chr>               <chr>    <list>    <chr>        <chr>         <int>
#>  1 2018-03-22 18:42:25 3c9c504f <chr [2]> Lionel Henry Merge pull …      1
#>  2 2018-03-22 17:55:23 449bc039 <chr [1]> Lionel Henry Remove depe…      2
#>  3 2018-03-22 17:55:23 dc3c9855 <chr [1]> Lionel Henry Call formul…      2
#>  4 2018-03-22 17:55:23 1182b9f3 <chr [1]> Lionel Henry Document ti…      2
#>  5 2018-03-22 17:55:23 b2144a8e <chr [1]> Lionel Henry Mention tid…      2
#>  6 2018-03-22 17:55:23 13db76d2 <chr [1]> Lionel Henry Accept vars…      2
#>  7 2018-03-22 17:55:23 95253c66 <chr [1]> Lionel Henry Accept vars…      2
#>  8 2018-03-22 17:55:23 2983aa0d <chr [1]> Lionel Henry Rename as_f…      2
#>  9 2018-03-22 17:55:23 64a00c7d <chr [1]> Lionel Henry Use proper …      2
#> 10 2018-03-22 17:55:23 f186615a <chr [1]> Lionel Henry Support quo…      2
#> # ... with 3,936 more rows

This rounds off the section on getting a git commit history into a tidy data frame. Remove the local git repo, which is no longer needed:

unlink(clone_dir, recursive = TRUE)

Part 2: Tidy Analysis #

You’re now ready to embark on a tidy git repo analysis! For example, which authors make the most commits?

history_logs %>% 
  count(author, sort = TRUE)
#> # A tibble: 165 x 2
#>    author                         n
#>    <chr>                      <int>
#>  1 hadley                      2285
#>  2 Winston Chang                650
#>  3 Hadley Wickham               104
#>  4 Kohske Takahashi @ jurina    103
#>  5 Kohske Takahashi at Haruna    70
#>  6 hadley wickham                69
#>  7 Jean-Olivier Irisson          63
#>  8 Lionel Henry                  50
#>  9 Thomas Lin Pedersen           47
#> 10 Kirill Müller                 40
#> # ... with 155 more rows

Some authors appear under different names, which we can quickly correct based on these top cases:

history_logs <- history_logs %>% 
  mutate(author = case_when(
    str_detect(tolower(author), "hadley") ~ "Hadley Wickham",
    str_detect(tolower(author), "kohske takahashi") ~ "Kohske Takahashi",
    TRUE ~ str_to_title(author)
  ))

Now, again, authors by commit frequency:

history_logs %>% 
  count(author) %>% 
  arrange(desc(n))
#> # A tibble: 157 x 2
#>    author                   n
#>    <chr>                <int>
#>  1 Hadley Wickham        2458
#>  2 Winston Chang          650
#>  3 Kohske Takahashi       232
#>  4 Jean-Olivier Irisson    63
#>  5 Lionel Henry            50
#>  6 Thomas Lin Pedersen     47
#>  7 Kirill Müller           40
#>  8 Kara Woo                36
#>  9 Brian Diggs             27
#> 10 Jake Russ               20
#> # ... with 147 more rows

And top-ten visualized with ggplot2:

history_logs %>% 
  count(author) %>% 
  top_n(10, n) %>% 
  mutate(author = fct_reorder(author, n)) %>% 
  ggplot(aes(author, n)) +
    geom_col(aes(fill = n), show.legend = FALSE) +
    coord_flip() +
    theme_minimal() +
    ggtitle("ggplot2 authors with most commits") +
    labs(x = NULL, y = "Number of commits", caption = "Post by @drsimonj")

unnamed-chunk-18-1.png

Not surprising to see Hadley Wikham topping the charts. Otherwise, the analysis options are pretty endless from here!

Part 3: Advanced Topics #

I’d like to touch on two advanced topics before leaving you to embark on astounding git repo analyses.

Git repo as a relational graph #

A git history is a relational structure where commits are nodes and connections between them are directed edges (from parent to child).

The code below converts our tidy data frame into a tidy relational structure made up of two data frames (nodes and edges) thanks to tidygraph (learn more from the package creator, Thomas Lin Pedersen, in blog posts like this).

# Convert commit to a factor (for ordering nodes)
history_logs <- history_logs %>% 
  mutate(commit = factor(commit))

# Nodes are the commits (keeping relevant info)
nodes <- history_logs %>% 
  select(-parents) %>% 
  arrange(commit)

# Edges are connections between commits and their parents
edges <- history_logs %>% 
  select(commit, parents) %>% 
  unnest(parents) %>% 
  mutate(parents = factor(parents, levels = levels(commit))) %>% 
  transmute(from = as.integer(parents), to = as.integer(commit)) %>% 
  drop_na()

# Create tidy directed graph object
git_graph <- tbl_graph(nodes = nodes, edges = edges, directed = TRUE)
git_graph
#> # A tbl_graph: 3946 nodes and 4371 edges
#> #
#> # A directed acyclic simple graph with 1 component
#> #
#> # Node Data: 3,946 x 5 (active)
#>   datetime            commit   author           subject             branch
#>   <chr>               <fct>    <chr>            <chr>                <int>
#> 1 2011-07-19 22:40:33 002e510d Kohske Takahashi remove style param…      4
#> 2 2008-07-05 20:00:50 0036ad7f Hadley Wickham   HAcks to get tile …      1
#> 3 2010-02-05 10:25:26 0072cf4c Hadley Wickham   Fix bug with empty…      1
#> 4 2013-12-02 13:51:05 008018cc Josef Fruehwald  added stat_ellipse       9
#> 5 2010-12-23 10:22:36 00821913 Hadley Wickham   Fix doc name             1
#> 6 2011-12-23 06:24:28 008312c0 Hadley Wickham   Merge pull request…      1
#> # ... with 3,940 more rows
#> #
#> # Edge Data: 4,371 x 2
#>    from    to
#>   <int> <int>
#> 1  3070   968
#> 2  1071   968
#> 3  3370  1071
#> # ... with 4,368 more rows

Using ggraph (another of Thomas’ awesome packages with more detailed info in posts like this) a default visualization could look something like this:

git_graph %>% 
  ggraph() +
    geom_edge_link(alpha = .1) +
    geom_node_point(aes(color = factor(branch)), alpha = .3) +
    theme_graph() +
    theme(legend.position = "none")
#> Using `nicely` as default layout

unnamed-chunk-21-1.png

This looks cool but not right! A "manual" layout is needed for a linear visualisation of the git history (see the dplyr network for example).

For convenience, this is a template pipeline that will take the tidy graph object, ensure the proper layout is used, and create the basic plot:

ggraph_git <- . %>%
  # Set node x,y coordinates
  activate(nodes) %>% 
  mutate(x = datetime, y = branch) %>% 
  # Plot with correct layout
  create_layout(layout = "manual", node.positions = as_tibble(activate(., nodes))) %>% 
  {ggraph(., layout = "manual") + theme_graph() + labs(caption = "Post by @drsimonj")}

Using this pipeline:

git_graph %>% 
  ggraph_git() +
    geom_edge_link(alpha = .1) +
    geom_node_point(aes(color = factor(branch)), alpha = .3) +
    theme(legend.position = "none") +
    ggtitle("Commit history of ggplot2")

unnamed-chunk-23-1.png

Much better! You can go crazy with how you’d like to visualise the repo. For example, here we filter on a specific date range:

git_graph %>% 
  activate(nodes) %>% 
  filter(datetime > "2015-11-01", datetime < "2016-08-01") %>% 
  ggraph_git() +
    geom_edge_link(alpha = .1) +
    geom_node_point(aes(color = factor(branch)), alpha = .3) +
    theme(legend.position = "none") +
    ggtitle("Git history of ggplot2",
            subtitle = "2015-11 to 2016-08")

unnamed-chunk-24-1.png

Here commits are highlighted for the top authors:

# 10 most-common authors
top_authors <- git_graph %>% 
  activate(nodes) %>% 
  as_tibble() %>% 
  count(author, sort = TRUE) %>% 
  top_n(10, n) %>% 
  pull(author)

# Plot
git_graph %>% 
  activate(nodes) %>%
  filter(datetime > "2015-11-01", datetime < "2016-08-01") %>% 
  mutate(author = factor(author, levels = top_authors),
         author = fct_explicit_na(author, na_level = "Other")) %>% 
  ggraph_git() +
    geom_edge_link(alpha = .1) +
    geom_node_point(aes(color = author), alpha = .3) +
    theme(legend.position = "bottom") +
    ggtitle("ggplot2 commits by author",
            subtitle = "2015-11 to 2016-08")

unnamed-chunk-25-1.png

I hope this gives you enough to start having fun with these sorts of visualisations!

Text Mining Commit Messages #

Commit messages are simple but great material for tidy text-mining tools like the brilliant tidytext package, best learned from Text Mining with R: A Tidy Approach, by Julia Silge and David Robinson. Here are some examples using commit subjects to get you started.

Get commit subjects into a tidy format and remove stop words.

data(stop_words)

tidy_subjects <- history_logs %>%
  unnest_tokens(word, subject) %>% 
  anti_join(stop_words)
#> Joining, by = "word"

tidy_subjects
#> # A tibble: 16,477 x 6
#>    datetime            commit   parents   author       branch word      
#>    <chr>               <fct>    <list>    <chr>         <int> <chr>     
#>  1 2018-03-22 18:42:25 3c9c504f <chr [2]> Lionel Henry      1 merge     
#>  2 2018-03-22 18:42:25 3c9c504f <chr [2]> Lionel Henry      1 pull      
#>  3 2018-03-22 18:42:25 3c9c504f <chr [2]> Lionel Henry      1 request   
#>  4 2018-03-22 18:42:25 3c9c504f <chr [2]> Lionel Henry      1 2491      
#>  5 2018-03-22 18:42:25 3c9c504f <chr [2]> Lionel Henry      1 tidyverse 
#>  6 2018-03-22 18:42:25 3c9c504f <chr [2]> Lionel Henry      1 tidyeval  
#>  7 2018-03-22 18:42:25 3c9c504f <chr [2]> Lionel Henry      1 facets    
#>  8 2018-03-22 17:55:23 449bc039 <chr [1]> Lionel Henry      2 remove    
#>  9 2018-03-22 17:55:23 449bc039 <chr [1]> Lionel Henry      2 dependency
#> 10 2018-03-22 17:55:23 449bc039 <chr [1]> Lionel Henry      2 plyr      
#> # ... with 16,467 more rows

What are the ten most frequently used words?

tidy_subjects %>%
  count(word) %>% 
  top_n(10, n) %>% 
  mutate(word = fct_reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
    geom_col(aes(fill = n), show.legend = FALSE) +
    coord_flip() +
    theme_minimal() +
    ggtitle("Most-used words in ggplot2 commit subjects") +
    labs(x = NULL, y = "Word frequency", caption = "Post by @drsimonj")

unnamed-chunk-27-1.png

Or how about the words that most-frequently follow “fix”, the most-used word:

history_logs %>% 
  select(commit, author, subject) %>% 
  unnest_tokens(bigram, subject, token = "ngrams", n = 2) %>% 
  separate(bigram, c("word1", "word2"), sep = " ") %>% 
  filter(word1 == "fix") %>%
  anti_join(stop_words, by = c("word2" = "word")) %>% 
  count(word2, sort = TRUE)
#> # A tibble: 223 x 2
#>    word2             n
#>    <chr>         <int>
#>  1 bug              57
#>  2 typo             37
#>  3 geom             11
#>  4 guides           11
#>  5 bugs              7
#>  6 doc               7
#>  7 boxplot           6
#>  8 axis              5
#>  9 documentation     5
#> 10 dumb              5
#> # ... with 213 more rows

Unsurprisingly, it seems like many of the commits involve fixing bugs and typos, as well as challenges with geoms and guides.

We’ve now covered more than enough for you to explore and analyse git repos in a tidy R framework. Don’t forget to share your findings with the world and let me know about it!

Sign off #

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

 
42
Kudos
 
42
Kudos

Now read this

ourworldindata: an R data package

@drsimonj here to introduce ourworldindata: a new data package for R. The ourworldindata package contains data frames that are generated by combining datasets from OurWorldInData.org: “an online publication that shows how living... Continue →