Data science opinions and tools to support them at rstudio::conf  

@drsimonj here to share my big takeaways from rstudio::conf 2017. My aim here is to share the broad data science opinions and challenges that I feel bring together the R community right now, and perhaps offer some guidance to anyone wanting to get into the R community.

DISCLAIMER: this is based on my experience, my primary interests, the talks I attended, the people I met, etc. I’m also very jet lagged after flying back to Australia! If I’ve missed something important to you (which I’m sure I have), please comment in whichever medium (Twitter, Facebook, etc.) and get the discussion going!

 My overall experience

I’ll start by saying that I had a great time. RStudio went all out and nailed everything from getting high-quality speakers, to booking a great venue and organizing a social event at Harry Potter world I won’t forget. But if I do, Hilary Parker took some great shots!

 Opinionated Analysis Development

I didn’t see it’s value immediately, but Hilary Parker‘s talk on opinionated analysis development struck me as brilliant once I started writing this post.

To crudely (and probably inaccurately) summarize, Hilary was advocating that we focus on defining opinions about data science processes (like “we should be using version control”) and then push for software that supports them. I liked this idea and will do my best to discuss the conference themes in this way.

In line with Hilary’s talk, I think the broad goals driving discussion at the conference include reducing wasted time, minimizing the chance of human error, and maximizing utility. As a human factors researcher, this couldn’t make me happier!

So, what were some of the broad opinions and challenges about data science processes that the R community discussed?

 Write code for humans

A strong opinion is that code should be written for humans. This doesn’t mean that we need to communicate how every function works. Rather, if I share my code with someone, can they easily get a sense of what various chunks are doing? This means less time trying to interpret code and minimizing the chance of making errors along the way.

The R software of choice for writing human-friendly code most definitely includes the tidyverse. It’s not just the functions, but it’s the philosophy behind the tidyverse that makes it so apt for writing readable code. This was evident from the start when Hadley Wickham gave an opening keynote on the topic.

The use of the magrittr package’s pipe operator %>% and packages like dplyr and ggplot2 was prolific. Whether you like these tools or not, understanding them is crucial for communicating with the R community right now. They’re also quickly becoming the basis for many other packages like Julia Silge and David Robinson’s tidytext.

 Use Version Control

Another big opinion surrounding the optimization of time and reduction of errors is that we should be using proper version control. Accurately tracking how our code evolves and being able to share that evolution with others (or our future selves) is critical. Without version control, simple mistakes like mislabelling a file can have devastating consequences. It’s also an essential process for working collaboratively.

The R community’s current software of choice for version control and collaboratively developing code are Git and GitHub. You’ll find that most of the top R packages are now hosted on GitHub, which enables good version control as well as an open environment that is easy for multiple people to work in at the same time. Jenny Bryan gave a notable tutorial on this topic.

 We need better modeling processes

Data science is a hot topic and everyone agrees that we should model data in a way that is valid and useful. But the process by which we carry out modeling is still poorly defined and a major challenge when you consider human biases. This was really emphasized by data journalist, Andrew Flowers.

Modeling needs standardized processes, and tools to support them, to reduce error and maximize the validity of our work. We have a strong foundation. For example, in addition to the topics described above, we have great processes and tools for optimizing code such as Dirk Eddelbuettel’s Rcpp or Winston Chang’s Profvis. But what do we do when it comes to modeling data?

The community is gearing up to start tackling this problem. Before the conference, I publicized a new development package, pipelearner, which is related to this problem. It was great to meet with people who were already testing it out (no doubt thanks to Mara Averick tweeting about it).

As an idea, pipelearner is a start, but it’s a long way away from what we’ll ultimately need. An exciting prospect is that Max Kuhn, creator of the caret package, has joined RStudio. We can only hope that he’ll be focussing on this problem and we’ll be seeing some proper opinions and tools soon.

 Present work in a visual format that is easy and enjoyable to digest

Another broadly held opinion is that we should present our work in a visual format that is easy and enjoyable to digest. This doesn’t just mean creating good plots with ggplot2, but synthesising entire reports or documents that can be easily understood. It’s not sufficient to do an analysis because you always have to present that analysis to someone! That someone could be a fellow data scientist or a manager who knows nothing about statistics or programming. To me, enjoyable is an important aspect. The details of data science work can be boring to most people. Presenting information so that people want to engage with it can make a big difference.

The software tools of choice here are R Markdown and Shiny. Shiny is the go-to for interactivity. R Markdown is fast becoming the base for almost all top R visualization/communication formats, so I’d recommend starting to learn it before Shiny. With advances in tools like R Notebooks, many people are already creating their scripts in R Markdown-based documents that can be instantly converted to visually appealing formats. There were many great talks on these topics such as:

 Be social

Being social is strongly valued by this community. This doesn’t always get directly discussed, but it is something we all seem to share implicitly. The first meaning of “social” here is to be engaged with other community members. This can be sharing your content or, if you don’t have content to share, liking, commenting, or otherwise engaging with the content that people do share. Keeping connected like this helps the R community to develop fast. When package errors are found, or challenging problems come up, these penetrate into the community and are often resolved incredibly quickly. The other meaning of “social” is to be friendly and supportive. More advanced community members are happy to help those in need, making a move into the R community a pleasant experience for newcomers (I can speak to this point directly). Expecting friendly dialect makes people more likely to share their work and problems. Overall, I think that some of the growing success of R can be attributed to the value that the R community places on being social.

In general, I feel that the best software for being a social R user is Twitter. For example, my lightning talk on corrr got considerable attention outside the room as soon as David Robinson tweeted about it.

Start to follow other R users, engage with them, and you can always look at other users’ lists. E.g., you can check out my daRk magic list that isolates some of my favorite R tweeters (though it needs some updating). Stackoverflow is another great place to get or give R help - but be sure to get a feel for it before diving in if you’re new. Of course, emails are also good.

So speaking of being social, here comes my sign off and contact details…

 Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

 
16
Kudos
 
16
Kudos

Now read this

Quick plot of all variables

This post will explain a data pipeline for plotting all (or selected types) of the variables in a data frame in a facetted plot. The goal is to be able to glean useful information about the distributions of each variable, without having... Continue →