The Hitchhiker's Guide to the Hadleyverse

Photo Credit: Hubblesite

The Hitchhiker's Guide to the Hadleyverse

May I ask you something? How long have you been using R? If you are just starting with it there is a (small) chance you don’t know who Hadley Wickham is. If this is not your case, feel free to jump to the third paragraph… ok, so where we were?, of course!, let me introduce you Mr. Wickham:

Hadley Wickham is an adjunct assistant professor at Rice University, and Chief Scientist at RStudio. He is also a main contributor to R in several ways: as developer of packages, as author of books, as member of the R Foundation, as participant in forums such as stackoverflow, google groups, or R mail lists, as speaker in many R related forums, as intensive twitter user of the #rstats hashtag, and a long etc.

The Hadleyverse

The “Hadleyverse” is the collection of R packages developed by Hadley Wickham, including tools for data manipulation, plotting, creation of packages, etc. But be patient, we will see them in detail later. The “Hadleyverse” concept implies that by loading and working with those packages you can not only extend the features of base R, but even change the way you code and the strategy you follow to analyse your data (for example, both plyr and dplyr packages follow the “split-apply-combine” rule).

Is not clear what is the origin of the word, since this is the kind of words that are usually born drinking a beer during a R Users meeting, but searching the web we see some of the first mentions here (3 Oct 2013), here (7 Apr 2014), or here (20 aug 2014). Anyway, now is a well know concept among R users, and it has inspired a package, a docker, a presentation, and of course this post.

A general map of the Hadleyverse

For this guide of the Hadleyverse we will consider all the packages authored by Hadley Wickham as main author or collaborator which are published on CRAN. A good source of information of the CRAN packages can be found in the “packages.rds” file, that we will download and process to get the packages of the Hadleyverse:

library(lazyeval)
library(dplyr)
library(igraph)

#Import as data frame the RDS file with packages information. It can be obtained from CRAN,
download.file("http://cran.r-project.org/web/packages/packages.rds", "packages.rds")
rds <- readRDS(file="packages.rds")
data <- as.data.frame(rds, stringsAsFactors = FALSE)

# Clean column names
data <- data[,!duplicated(names(data))] #Eliminate duplicated names column
names(data) <- gsub(" ","_", names(data))
names(data) <- gsub("/","_", names(data))
names(data) <- gsub("@","_", names(data))

#And now we convert it to a tbl_df class to use it with dplyr
data <- tbl_df(data)
data

## Source: local data frame [6,333 x 40]
## 
##        Package Version Priority                                   Depends
## 1           A3   0.9.2       NA            R (>= 2.15.0), xtable, pbapply
## 2          abc     2.0       NA R (>= 2.10), nnet, quantreg, MASS, locfit
## 3  ABCanalysis     1.0       NA                               R (>= 2.10)
## 4     abcdeFBA     0.4       NA    Rglpk,rgl,corrplot,lattice,R (>= 2.10)
## 5  ABCExtremes     1.0       NA                 SpatialExtremes, combinat
## 6     ABCoptim 0.13.11       NA                                        NA
## 7        ABCp2     1.1       NA                                      MASS
## 8     abctools     1.0       NA   R (>= 2.10), abc, abind, parallel, plyr
## 9          abd   0.2-7       NA   R (>= 3.0), nlme, lattice, grid, mosaic
## 10        abf2   0.7-0       NA                                        NA
## ..         ...     ...      ...                                       ...
## Variables not shown: Imports (chr), LinkingTo (chr), Suggests (chr),
##   Enhances (chr), License (chr), License_is_FOSS (chr),
##   License_restricts_use (chr), OS_type (chr), Archs (chr), MD5sum (chr),
##   NeedsCompilation (chr), Authors_R (chr), Author (chr), BugReports (chr),
##   Contact (chr), Copyright (chr), Description (chr), Encoding (chr),
##   Language (chr), Maintainer (chr), Title (chr), URL (chr),
##   SystemRequirements (chr), Type (chr), Path (chr), Classification_ACM
##   (chr), Classification_JEL (chr), Classification_MSC (chr), Published
##   (chr), VignetteBuilder (chr), Additional_repositories (chr),
##   Reverse_depends (chr), Reverse_imports (chr), Reverse_linking_to (chr),
##   Reverse_suggests (chr), Reverse_enhances (chr)

# Filter all packages authored by Hadley Wickham, and select a subset of variables
hadley <- data %>%
  filter(grepl("Hadley Wickham|Hadley\nWickham", Author)) %>%
  select(Package, Author, Depends, Imports, Suggests, LinkingTo, Enhances)

#Vector of packages
packages <- unique(hadley$Package)
length(packages)

## [1] 55

We see that the Hadleyverse is composed by 55 packages. To obtain a “map” of all of them we will analyse how they relate each other by four different ways: Depends, Imports, Suggests, LinkingTo, and Enhances. We obtain the relationships with the following code:

#Create a function to extract names of related packages in the form package1, package2
relations <- function(var){
  temp <- strsplit(var, ",") #Split string of dependences
  package2 <- unlist(temp) #
  #Eliminate some characters...
  package2 <- gsub(" ","", package2) 
  package2 <- gsub("\\(.*\\)","",package2)
  package2 <- gsub("\n","",package2)
  package1 <- rep(hadley$Package,unlist(lapply(temp,length))) #Obtain the corresponding id
  df <- data.frame(package1,package2, stringsAsFactors = FALSE)
  #We want only related packages created by H.W.
  df <- df %>%
    filter(package2%in%packages,
           package2!=package1
           )
  return(df)
}

#Apply the function to each variable and collapse the resulting list to a single data frame
hadley2 <- lapply(hadley, relations)
hadley2 <- do.call("rbind", hadley2)

#Eliminate possible duplicates
edges <- tbl_df(distinct(hadley2))  
edges

## Source: local data frame [139 x 2]
## 
##      package1   package2
## 1  clusterfly     rggobi
## 2       ggmap    ggplot2
## 3    nullabor    ggplot2
## 4    tourrGui      tourr
## 5   bigrquery       httr
## 6   bigrquery assertthat
## 7   bigrquery      dplyr
## 8       broom       plyr
## 9       broom      dplyr
## 10      broom      tidyr
## ..        ...        ...

Now that we have a list of relations, we will use the igraph package to obtain our map. First we will create the igraph object, define the graphical properties of the network and discover communities (Clusters of nodes). A node will be bigger when is more connected to other packages, and the communities will be identified with colours.

#We create the igraph object
g <- graph.data.frame(edges, vertices= packages,  directed = F) # We create the igraph object based on the "edges" data frame

# Edges Properties

E(g)$arrow.width <- 0 # I don't want end of arrows to be displayed but that can change in the future
E(g)$curved <- 0.2 #Make edges curved
E(g)$color  <- "#BFBFBF"
E(g)$width  <- 10

# Vertex Properties
V(g)$label.family <- "sans" #Label font family
V(g)$label.cex <- 3 # Label font size proportional to 12
V(g)$label.color <- "#333333" # Label font color
V(g)$label.font <- 2 #1 plain, 2 bold, 3 italic, 4 bold and italic
V(g)$size <- degree(g, mode = "in", loops = F) #Size proportional to degree
cl <- optimal.community(g) #Find communities in the network

#Color of vertices based on communities
V(g)$color <- unlist(c("#E2D200", "#BFBFBF", "#46ACC8", "#E58601", rep("#BFBFBF",6))[cl$membership])
V(g)$frame.color <- unlist(c("#E2D200", "#BFBFBF", "#46ACC8", "#E58601", rep("#BFBFBF",6))[cl$membership])
set.seed(123)
layout <- layout.kamada.kawai(g)

There are several possible layouts to plot the network, I choose one which seems to correctly separate the communities, but you can try different configurations by checking this. Finally we save the plot in a png file which I reproduce below:

png(filename="hadleyverse.png", width=2*1920, height=2*1080) #call the png writer
plot(g, margin=-0.1, layout=layout)
dev.off()

hadleyverse

We detected three communities or groups of packages, let me call them “systems” of the Hadleyverse, and some set of isolated dots representing those packages which are not related to any other, so I will call them “comets”.

The “1st generation of graphics and data transformation” system.

Main stars: ggplot2, plyr

This is the system of the “classic” tools of the Hadleyverse, with two main groups, the graphical tools leaded by “ggplot2”, and the data transformation leaded by “plyr”. Both plyr and ggplot2 have several years of development, and they are well documented and widely used by the community, although recently both packages have evolved into “dplyr” and “ggvis”, so there is less development efforts over them, and can be expected that they will be completely replaced by the new generation in the medium term.

Tools for graphics:

  • classifly: Visualise high-dimensional classification boundaries with GGobi.
  • clusterfly: Visualising high-dimensional clustering algorithms.
  • DescribeDisplay: Turn GGobi graphics into publication quality R graphics.
  • geozoo: Zoo of Geometric Objects.
  • Ggally: Ally to ggplot2.
  • ggmap: Plotting maps in R with ggplot2.
  • ggplot2: An implementation of the Grammar of Graphics in R.
  • ggsubplot: Embed subplots in ggplot2 graphics in R.
  • gtable: Tools to make it easier to work with “tables” of grobs.
  • meifly: An R package for exploring ensembles of (generalised) linear models.
  • nullabor: Easy graphical inference for R.
  • profr: An alternative profiling package for R.
  • rggobi: Interface between R and GGobi.
  • scales: Graphical scales.
  • tourr: An implementation of tour algorithms in R.
  • tourrGui: A Tour GUI using gWidgets.
  • wesanderson: A Wes Anderson color palette for R.

Tools for data manipulation:

  • itertools: Iterator Tools.
  • plyr: Splitting, applying and combining large problems into simpler problems.
  • reshape: Flexible rearrange, reshape and aggregate data.
  • reshape2: Flexibly reshape data: a reboot of the reshape package.

Data packages:

  • HistData: A collection of data sets that are interesting and important in the history of statistics and data visualization.

The “tools for programmers and reproducibility” system

Main stars: testthat, knitr

This is the system of the R programmers, where the big star is “testthat”, developed in the words of Hadley Wickham “because I discovered I was spending too much time recreating bugs that I had previously fixed. While I was writing the original code or fixing the bug, I’d perform many interactive tests to make sure the code worked, but I never had a system for retaining these tests and running them, again and again.”

In this system you will find four main groups: Tools to help developing R packages, to simplify writing good code, to manipulate specific classes of data, and for reproducibility.

Tools to make your life easier while creating R packages:

  • devtools: Tools to make an R developer’s life easier.
  • rappdirs: A port of AppDirs for R.
  • Rd2roxygen: package documentation.
  • roxygen2: In-source documentation for R.
  • rstudioapi: Safely access rstudio’s api (when available).

Tools to make your life easier while coding in general:

  • evaluate: A version of eval for R that returns more information about what happened.
  • testthat: An R package to make testing fun.
  • magrittr: R package to bring forward-piping features ala F#’s |> operator. Ceci n’est pas un pipe.
  • memoise: Easy memoisation for R.
  • plumbr: Mutable dynamic data structures for R.
  • pryr: Pry open the covers of R.

Tools to manipulate specific classes of data:

  • httr: A friendly http package for R.
  • lubridate: Make working with dates in R just that little bit easier.
  • rvest: Simple web scraping for R.
  • stringr: Wrapper for R string functions to make them more consistent, simpler and easier to use.

Tools for reproducibility:

  • knitr: A general-purpose tool for dynamic report generation in R.
  • rmarkdown: Dynamic Documents for R.

The “Data manipulation” system:

Main star: dplyr

This is the system to ease the data processing, it includes the new generation versions of reshape2 and plyr, called tidyr and dplyr, allowing to stop suffering with messy data and bringing new verbs to analyse data such as filter, select, mutate, arrange and summarise. In combination with magrittr’s “%>%” pipe it can completely change the way you work with data frames in R.

Tools to manipulate data:

  • assertthat: User friendly assertions for R.
  • bigrquery: An interface to Google’s bigquery from R.
  • broom: Convert statistical analysis objects from R into tidy format.
  • dplyr: Plyr specialised for data frames: faster & with remote datastores.
  • lazyeval: A strategy for doing non-standard evaluation (NSE) in R.
  • RMySQL: An R interface for MySQL.
  • RSQLite: R interface for SQLite.
  • tidyr: Easily tidy data with spread and gather functions.

Data packages:

  • fueleconomy: EPA fuel economy data in an R package.
  • nycflights13: All out-bound flights from NYC in 2013 + useful metadata.

Comets of the Hadleyverse

These are packages which doesn’t belong to any of the other systems, where three of them are data packages:

  • babynames: All baby names data from the SSA.
  • fda: Functional Data Analysis.
  • hflights: Flights departing Houston in 2011.
  • namespace: Provide namespace management functions not (yet) present in base R.
  • nasaweather: Data from the 2006 ASA data expo.
  • plotrix: Plotrix library for R.

In progress work

Besides those packages in CRAN, taking a look into the github repositories of Hadley Wickham and RStudio we can find out what packages he is currently working in that maybe soon we can see in CRAN. Some promising examples are:

  • fastread: Faster ways to read data.
  • haven: Read SPSS, Stata and SAS files from R.
  • lineprof: Visualise line profiling results in R.
  • purrr: A FP package for R in the spirit of underscore.js.
  • rv2: Simple rv package to practice developing packages with.
  • tanglekit: R bindings for Brett Victor’s tangle.js.
  • xml2: Bindings to libxml2.

Conclusions

I hope you have enjoyed this short guide to the Hadleyverse, and I wish you can now travel with more confidence through it. For sure you will find a package which suits your needs or will make your life more easy, just give them a chance!. And in case you are asking how is possible for a single person to develop so many packages, maybe is because he uses this, or maybe because he understands the power of frustration, or maybe because he just wants to impact the world by being useful. But maybe is better to let himself to answer that question here.

Thank you for reading and thanks Hadley Wickham for the Hadleyverse!

42.

Edit (February 23rd)

I have received some feedback from Hadley Wickham about how he would classify his packages: ingest, data manipulation, visualization, data, packages, programming. This is a more accurate way to group them, but I also wanted on this post to show some authomatic classification based on network analysis. (And doing so, I can include some R code for fun!).

I also didn’t separate between packages where H.W. is main author from those where he is a contributor. One way to do such classification could be based on the “Author” variable based on the [ctb] (Contributor) or [aut] (Author) tags. Nevertheless not all packages use this notation.

Full comments from Hadley Wickham below, thank you!