class: center, middle, inverse, title-slide # Efficient R programming: ## tips, tricks, and some interesting packages ### Adolfo Alvarez, Analyx ### 2019-05-23 --- class: center, middle # Why R? --- class: center, middle # Who I am? --- class: center, middle # Who we are? --- background-image: url("img/AX2018.jpg") background-size: cover --- background-image: url("img/jobs.png") background-size: cover --- class: center, middle # 12.10.2019 ### https://www.facebook.com/analyx.company/ --- class: inverse, center, middle # I. The art of reusable code ![](img/reuse.gif) --- # Code that you can reuse is code you don't need to write again! Reusable code is: - Easy to understand - Easy to share - Easy to adapt --- # Easy to understand - Use proper names - Comment your code! - Use a code style > Does making code efficient conflict with wanting to make code easy to understand? Making your code easy to understand often leads to code that is well architected and easy to test. --- # The Fundamental Theorem of Readability > Code should be written to minimize the time it would take for someone else to understand it. > > --- The art of reusable code -- Someone else can be yourself after a while! -- .center[ <img src="img/theart.png" width="40%" /> ] --- class: inverse, center, middle # II. Code style ![](img/style.gif) --- # Naming objects ```r function1 <- function(v){ retval <- 0 for (i in 1:length(v)){ retval = retval + v[i]*v[i] } return(sqrt(retval)) } ``` What does this function do? -- The names function1 and retval doesn’t pack much information. Instead, use a name that describes the variable’s value.! --- # Naming objects ```r euclidean_norm <- function(v){ sum_squares <- 0 for (i in 1:length(v)){ sum_squares = sum_squares + v[i]*v[i] } return(sqrt(sum_squares)) } ``` -- ## We are not taking care of efficiency yet! --- # Some recommendations for naming stuff in R - Variable and function names should use only lowercase letters, numbers, and `_`. - Use underscores (`_`) (so called snake case) to separate words within a name. ```r # Good day_one day_1 # Bad DayOne dayone ``` --- # Some recommendations for naming stuff in R Generally, variable names should be nouns and function names should be verbs. Strive for names that are concise and meaningful (this is not easy!). ```r # Good day_one # Bad first_day_of_the_month djm1 ``` --- # Some recommendations for naming stuff in R Where possible, avoid re-using names of common functions and variables. This will cause confusion for the readers of your code. ```r # Bad T <- FALSE c <- 10 mean <- function(x) sum(x) ``` --- # Some recommendations for naming stuff in R - R has strict rules about what constitutes a valid name. - A syntactic name must consist of letters, digits, `.` and `_` but can’t begin with `_` or a digit. - Additionally, you can’t use any of the reserved words like TRUE, NULL, if, and function (see the complete list in ?Reserved). - A name that doesn’t follow these rules is a non-syntactic name; if you try to use them, you’ll get an error: ```r _abc <- 1 if <- 10 ``` It’s possible to override these rules and use any name, i.e., any sequence of characters, by surrounding it with backticks: ```r `_abc` <- 1 `if` <- 10 ``` --- # Two tricks for changing names - In your code: The "Rename in Scope" function of RStudio ```r function1 <- function(v){ sum_of_squares <- 0 for (i in 1:length(v)){ sum_of_squares = sum_of_squares + v[i]*v[i] } return(sqrt(sum_of_squares)) } ``` --- # Two tricks for changing names - In your data: The janitor package clean_names() and make_clean_names() ```r library(janitor) ugly_names <- c("var 1", "var 20%", "var...3", "vár 4") make_clean_names(ugly_names) ``` ``` ## [1] "var_1" "var_20_percent" "var_3" "var_4" ``` --- # A harder example, what this function does? ```r function1<-function(p1=5,p2=1,p3=2){output<-matrix(ncol=p1,nrow=p1,0) for(row in 1:(p1-1)){output[row,row-1] <-p3 output[row,row+1] <-p3} output[row-1, row] <-p3 output[row+1, row] <-p3 diag(output)<-p2 return(output) } ``` -- ```r function1(5,4,3) ``` ``` ## [,1] [,2] [,3] [,4] [,5] ## [1,] 4 3 0 0 0 ## [2,] 3 4 3 0 0 ## [3,] 0 3 4 3 0 ## [4,] 0 0 3 4 3 ## [5,] 0 0 0 3 4 ``` --- # Good names are not enough ```r create_tridiagonal<-function(n=5,d1=1,d2=2){output_matrix<-matrix(ncol=n,nrow=n,0) for(row in 1:(n-1)){output_matrix[row,row-1] <-d2 output_matrix[row,row+1] <-d2} output_matrix[row-1, row] <-d2 output_matrix[row+1, row] <-d2 diag(output_matrix)<-d1 return(output_matrix) } ``` -- ```r create_tridiagonal(5,4,3) ``` ``` ## [,1] [,2] [,3] [,4] [,5] ## [1,] 4 3 0 0 0 ## [2,] 3 4 3 0 0 ## [3,] 0 3 4 3 0 ## [4,] 0 0 3 4 3 ## [5,] 0 0 0 3 4 ``` --- class: center, middle # Coding style >Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. .footnote[ [1] https://style.tidyverse.org/ ] --- # Coding style - Names - Spacing (e.g. x[, 1] x[,1]) - Indenting (e.g. tabs, 2/4 spaces) - Organization (e.g. loading libraries sequentially load all libraries at the start) - Documentation(e.g. commented code use comments to record important findings and analysis decisions) https://beautifyrstats.netlify.com --- #More info about coding style https://style.tidyverse.org/ Two R packages support this style guide: - styler allows you to interactively restyle selected text, files, or entire projects. It includes an RStudio add-in, the easiest way to re-style existing code. - lintr performs automated checks to confirm that you conform to the style guide. --- # Demo of the styler package ```r create_tridiagonal<-function(n=5,d1=1,d2=2){output_matrix<-matrix(ncol=n,nrow=n,0) for(row in 1:(n-1)){output_matrix[row,row-1] <-d2 output_matrix[row,row+1] <-d2} output_matrix[row-1, row] <-d2 output_matrix[row+1, row] <-d2 diag(output_matrix)<-d1 return(output_matrix) } ``` ```r create_tridiagonal <- function(n = 5, d1 = 1, d2 = 2) { output_matrix <- matrix(ncol = n, nrow = n, 0) for (row in 1:(n - 1)) { output_matrix[row, row - 1] <- d2 output_matrix[row, row + 1] <- d2 } output_matrix[row - 1, row] <- d2 output_matrix[row + 1, row] <- d2 diag(output_matrix) <- d1 return(output_matrix) } ``` --- # Demo of the lintr package ```r lintr::lint("lynter_example.R") ``` --- class: inverse, center, middle # III. Measuring the efficiency of your code ![](img/speed.gif) --- # Measuring efficiency - Readability is a subjective measure, and we already cover it. - Most of the time you will easily know if your code is working or not. - Test your code! (shinytest, testthat...) But today we will focus on: - Memory - Time --- # Memory By default in R you need to limit your analysis to RAM memory, but what to do if you are short of memory? .pull-left[ - Buy more RAM - Ask your company/university/city permission to use more powerful servers. (e.g. http://www.man.poznan.pl/online/pl/) - Use cloud companies - Use Big data packages (bigmemory, Matrix) ] -- .pull-right[ OR - Optimize your code so some memory intensive calculations are not needed ] --- # How to measure memory in R ```r a <- rnorm(1000000) print(object.size(a), units = "Mb") ``` ``` ## 7.6 Mb ``` ```r library(pryr) ``` ``` ## Registered S3 method overwritten by 'pryr': ## method from ## print.bytes Rcpp ``` ```r object_size(a) ``` ``` ## 8 MB ``` ```r b <- a object_size(b) ``` ``` ## 8 MB ``` What will be the result of `object.size(a,b)` ? --- # How to measure memory in R ```r object_size(a,b) ``` ``` ## 8 MB ``` ```r b[1] <- 0 object_size(a,b) ``` ``` ## 16 MB ``` ```r mem_used() ``` ``` ## 57.3 MB ``` ```r mem_change(x <- 1:1e6) ``` ``` ## 784 B ``` ```r mem_change(rm(x)) ``` ``` ## 384 B ``` --- # Garbage collector Does some of you use the gc() function? -- ![](img/excelent.gif) --- # How fast is your computer? ```r install.packages("benchmarkme") library("benchmarkme") ``` Things you can check with the benchmarkme package: The package has a few useful functions for extracting system specs: - RAM: get_ram() - CPUs: get_cpu() - BLAS library: get_linear_algebra() - Is byte compiling enabled: get_byte_compiler() - General platform info: get_platform_info() - R version: get_r_version() Try yours! --- # Perform tests and compare with other machines ```r res = benchmark_std(runs = 3) rank_results(res) plot(res) ``` --- # Measuring time - As usual, there are several ways to measure execution time of your code. - The R base provide with Sys.time() and system.time() ```r start_time <- Sys.time() Sys.sleep(3) end_time <- Sys.time() end_time - start_time ``` ``` ## Time difference of 3.007137 secs ``` ```r system.time({ Sys.sleep(3)}) ``` ``` ## user system elapsed ## 0.000 0.000 3.003 ``` - "User CPU time” gives the CPU time spent by the current process and “system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. (Post by William Dunlap) --- # Another alternatives for measuring time The tictoc package may be familar for those coming from Matlab world ```r library(tictoc) tic("sleeping") Sys.sleep(3) toc() ``` ``` ## sleeping: 3.007 sec elapsed ``` --- # Measuring the time: Benchmark packages - The bench package ```r x <- runif(100) (lb <- bench::mark( sqrt(x), x ^ 0.5 )) ``` ``` ## # A tibble: 2 x 6 ## expression min median `itr/sec` mem_alloc `gc/sec` ## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> ## 1 sqrt(x) 435.98ns 467.87ns 2019635. 848B 0 ## 2 x^0.5 8.39µs 9.31µs 107508. 848B 0 ``` --- # Measuring the time: Benchmark packages - The microbenchmark package ```r library(microbenchmark) x <- runif(100) microbenchmark( sqrt(x), x ^ 0.5 ) ``` ``` ## Unit: nanoseconds ## expr min lq mean median uq max neval ## sqrt(x) 459 623.0 730.57 700.5 777.0 2887 100 ## x^0.5 8835 9012.5 9322.02 9079.5 9244.5 29433 100 ``` --- # Measuring the time: Benchmark packages - And the rbenchmark package ```r library(rbenchmark) x <- runif(100) benchmark( "method1" = sqrt(x), "method2" = x ^ 0.5 ) ``` ``` ## test replications elapsed relative user.self sys.self user.child ## 1 method1 100 0.001 1 0.001 0 0 ## 2 method2 100 0.002 2 0.001 0 0 ## sys.child ## 1 0 ## 2 0 ``` --- class: inverse, center, middle # IV. How to improve? ![](img/turtle.gif) --- # Profiling - In case we want to find out which part of our code is the most time consuming we can use profiling tools. - Again, there are many options available but I recommend you the profvis package - You can profile a piece of code by using ```r profvis({ #The code to be evaluated }) ``` ## Two tips - If the code is too fast, there is nothing to profile, and it will not run! - If you want to profile a single function, save that function in an .R file and profile it with: ```r source("my_function.R") profvis({ myfunction(x) }) ``` --- # What are the two main time consuming parts of our "create_tridiagonal" function? ```r create_tridiagonal <- function(n = 5, d1 = 1, d2 = 2) { output_matrix <- matrix(ncol = n, nrow = n, 0) for (row in 1:(n - 1)) { output_matrix[row, row - 1] <- d2 output_matrix[row, row + 1] <- d2 } output_matrix[row - 1, row] <- d2 output_matrix[row + 1, row] <- d2 diag(output_matrix) <- d1 return(output_matrix) } ``` ```r source("create_tridiagonal.R") profvis::profvis(create_tridiagonal(5000,1,-1)) ``` --- # What are the two main time consuming parts of our "create_tridiagonal" function? ![](img/profile_example.png) --- # Three things to improve - Create the matrix() and assign the diag() in one step - Inside the for() cycle fill the upper and low diagonal in one step ```r create_tridiagonal2 <- function(n = 5, d1 = 1, d2 = 2) { #We create the matrix and assign the diagonal in one step output_matrix <- diag(d1, n) #"row" is a function in base r, so I prefer "i" as counter # We put the d2 value in one step instead of 2 for (i in 1:(n - 1)) { output_matrix[i, c(i - 1, i + 1)] <- d2 } output_matrix[c(i - 1, i + 1), i] <- d2 return(output_matrix) } ``` --- #Now let's compare ```r library(rbenchmark) benchmark(m1 <- create_tridiagonal(5000,1,-1), m2 <- create_tridiagonal2(5000,1,-1), replications = 3) ``` ``` ## test replications elapsed relative ## 1 m1 <- create_tridiagonal(5000, 1, -1) 3 0.672 2.375 ## 2 m2 <- create_tridiagonal2(5000, 1, -1) 3 0.283 1.000 ## user.self sys.self user.child sys.child ## 1 0.274 0.399 0 0 ## 2 0.132 0.152 0 0 ``` --- class: inverse, middle, center # Conclusions ![](img/code.png) --- class: center # The slides are available at ## [http://adolfoalvarez.cl/efficient_r/Pazur.html](http://adolfoalvarez.cl/efficient_r/Pazur.html) <iframe width="800" height="450" src="http://adolfoalvarez.cl/efficient_r/Pazur.html" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- class: inverse, middle, center # Thank you! ## Keep in touch: ## www.analyx.com / facebook.com/analyx.company ## adolfo.alvarez@analyx.com ## Twitter: @adolfoalvarez