The free, open, and proprietary flavors of R

Photo Credit: lonelyradio

The free, open, and proprietary flavors of R

R: A free software project.

R was announced to the world on August 4th, 1993, when Ross Ihaka sent the following email to the “S-news” email list:

About a year ago Robert Gentleman and I considered the problem of obtaining decent statistical software for our undergraduate Macintosh lab. After considering the options, we decided that the most satisfactory alternative was to write our own. We started by writing a small lisp interpreter. Next we expanded its data structures with atomic vector types and altered its evaluation semantics to include lazy evaluation of closure arguments and argument binding by tag as well as order. Finally we added some syntactic sugar to make it look somewhat like S. We call the result “R”.

As referred by prof. Ihaka in the “R: Past and future” article, only in june of 1995 R became Free Software, when the source code was released under the GNU General Public License (GPL), from the Free Software Foundation. Following the FSF definition, “a program is free software, for you, a particular user, if:

  • You have the freedom to run the program as you wish, for any purpose.
  • You have the freedom to modify the program to suit your needs. (To make this freedom effective in practice, you must have access to the source code, since making changes in a program without having the source code is exceedingly difficult.)
  • You have the freedom to redistribute copies, either gratis or for a fee.
  • You have the freedom to distribute modified versions of the program, so that the community can benefit from your improvements.”

Free and non-free extensions (a.k.a Packages)

Following the conditions of the GPL, to create a commercial (i.e. non free) package is perfectly possible, as long as you are not modifiyng or including code from another package which is licensed under GPL. The R Foundation clarify their position respects to non free packages in this statement.

Nevertheless, the use of FOSS (Free or open source software) is encouraged in CRAN, and some of the common options for developers are the GPL, MIT, or BSD licenses. A similar guideline is followed by Bioconductor, and Github.

Let’s write some code to analyze the different licenses used by the packages submitted to CRAN:

  library(rvest)
library(dplyr)
library(ggvis)
url <- "http://cran.r-project.org/web/packages/available_packages_by_date.html"
page <- html(url)
packages <- html_table(page)
packages <- tbl_df(packages[[1]])
packages

## Source: local data frame [6,050 x 3]
## 
##          Date    Package
## 1  2014-11-10         BH
## 2  2014-11-10  DepthProc
## 3  2014-11-10   EBglmnet
## 4  2014-11-10       GSIF
## 5  2014-11-10   ionflows
## 6  2014-11-10        ips
## 7  2014-11-10       ISBF
## 8  2014-11-10 lpSolveAPI
## 9  2014-11-10      mixlm
## 10 2014-11-10  ModelGood
## ..        ...        ...
## Variables not shown: Title (chr)

Now we have a data frame with the information of the available packages on CRAN. We included a “License” column where we will store the information about the package

  obtainLicenses <- function(x){
  output <- rep("license", length(x))
  for (i in 1:length(x)){
    Licenses <- readLines(paste0("http://cran.r-project.org/web/packages/",x[i], "/DESCRIPTION"))
    Licenses <- License[grep("^License", Licenses)]
    Licenses <- gsub("License: ", "", Licenses)
    output[i] <- Licenses
  }
  return(output)
}
packages <- mutate(packages, Licenses=obtainLicenses(Package)) #Warning: Very slow!
packages %>%
  select(Package, Licenses)

## Source: local data frame [6,044 x 2]
## 
##              Package                  Licenses
## 1  EntropyEstimation                GPL (>= 3)
## 2         extraTrees        Apache License 2.0
## 3        MetaLandSim                GPL (>= 2)
## 4               nCal                GPL (>= 2)
## 5            rugarch                     GPL-3
## 6              spaMM                  CeCILL-2
## 7              dummy                GPL (>= 2)
## 8           geometry GPL (>= 3) + file LICENSE
## 9             LinCal                     GPL-2
## 10          miceadds              GPL (>= 2.0)
## ..               ...                       ...

I am not particularly proud of that code, since it hides a for loop, and is quite slow. Nevertheless, the “ReadLines” function is not vectorized, then I need to open one connection (text file) at a time. If somebody knows a better way to do, keep it for your self… or better share it in the comments :).

We can find out how many different licenses are used in CRAN:

  length(unique(packages$Licenses))

## [1] 121

Or check the most common ones…

  packages %>%
  group_by(Licenses) %>%
  summarise(n=n()) %>%
  arrange(-n)

## Source: local data frame [121 x 2]
## 
##                       Licenses    n
## 1                   GPL (>= 2) 2468
## 2                        GPL-2 1308
## 3                        GPL-3  562
## 4                          GPL  377
## 5                   GPL (>= 3)  282
## 6           MIT + file LICENSE  157
## 7                GPL-2 | GPL-3  132
## 8                 GPL (>= 2.0)   71
## 9                       LGPL-3   66
## 10 BSD_3_clause + file LICENSE   49
## ..                         ...  ...

We can group some licenses into a reduced set. To do so, we consider just the first word of each License:

  group_lic <- function(x){
  gsub("(^[a-zA-Z]+).*","\\1",x)
  }
packages <- packages %>%
  mutate(LicenseGroup=group_lic(Licenses))

We make some corrections with acronyms…

  packages <- packages %>%
  mutate(LicenseGroup=ifelse(LicenseGroup=="GNU", "GPL", LicenseGroup),
         LicenseGroup=ifelse(LicenseGroup=="Mozilla", "MPL", LicenseGroup),
         LicenseGroup=ifelse(LicenseGroup=="Common", "CPL", LicenseGroup),
         LicenseGroup=ifelse(LicenseGroup=="FreeBSD", "BSD", LicenseGroup)
         )

And we plot.

  all_values <- function(x) {
  if(is.null(x)) return(NULL)
  paste0(c("License", "Packages"), ": ", format(x)[c(1,3)], collapse = "<br/>")
}

packages %>%
  group_by(LicenseGroup) %>%
  summarise(n=n()) %>%
  mutate(LG2=factor(LicenseGroup),
         LG2=reorder(LG2, n, FUN=function(x) -x)
         ) %>%
  ggvis(~LG2, ~n) %>%
  layer_bars() %>% add_tooltip(all_values, "hover")

## Warning: Can't output dynamic/interactive ggvis plots in a knitr document.
## Generating a static (non-dynamic, non-interactive) version of the plot.

ggvis_plot

The vast majority of packages chose GPL, LGPL, or other free software licenses. Although there are some with restrictions, such as those who use the NC (NonCommercial) option of the Creative Commons license:

  packages %>%
  filter(LicenseGroup=="CC",
         grepl("NC",Licenses)) %>%
  select(Package, Licenses)

## Source: local data frame [15 x 2]
## 
##                    Package                       Licenses
## 1                sdcTarget                   CC BY-NC 4.0
## 2                 nettools                CC BY-NC-SA 4.0
## 3                RTriangle                CC BY-NC-SA 4.0
## 4       gettingtothebottom                CC BY-NC-SA 4.0
## 5                cvxclustr                CC BY-NC-SA 4.0
## 6                      FGN                CC BY-NC-SA 3.0
## 7             spikeSlabGAM                CC BY-NC-SA 3.0
## 8                     isa2                CC BY-NC-SA 3.0
## 9            spatialkernel CC BY-NC-SA 3.0 + file LICENSE
## 10            DATforDCEMRI                CC BY-NC-SA 3.0
## 11                nutshell             CC BY-NC-ND 3.0 US
## 12 nutshell.audioscrobbler                CC BY-NC-SA 3.0
## 13           nutshell.bbdb             CC BY-NC-ND 3.0 US
## 14                    tnet    CC BY-NC 3.0 + file LICENSE
## 15                   r2stl                CC BY-NC-SA 3.0

Other special license cases are those given by a “file License”, where the license is specified in a file (Example), or the “Unlimited” license(Example), which is actually not unlimited, but restricted to national laws (where most cases “All rights reserved” is implied). For those packages is better to contact the authors to modify the code or use it for commercial purposes.

The other open and commercial flavors of R

Two of the problems that companies can found when using R for business are its lack of commercial support, and that it is not ready to use with big data (At least not directly). Then, a few firms offer enterprise oriented modified versions of R, generally under commercial licenses.

Some of these versions are:

To finish

The fact that R is a free software has been crucial to its development and adoption among the data analysis community. When the code is open, anybody can verify that a certain algorithm is well implemented, collaborate with improvements, or correct bugs in a fast way. This spirit of collaboration promoted by open source/free software licences is inherited by R users, which have contributed to improve the characteristics of the software, developing packages making that any theoretical development of statistics or data analysis is almost immediately available to the world.

Nevertheless, those who require technical support or flawless integration with big data implementations can find high standard solutions provided by reputed companies such as Revolution R or Oracle. The adoption of R in the business world is now a reality, and besides the referred examples, many other big IT/analytics firms are recommending to integrate R with their systems. But we will speak about that integration in a further post…

Thank you for reading, and if you like the post please let your best friends know!

Edit 2014-11-10

  • Thanks to Hadley Wickham for mentioning that there is available a .rds file with information of CRAN packages .

  • Thanks to Noam Ross for mentioning some R versions not included in the first version of this post. A comparison of some R versions can be found here