Revolution Analytics recently released Revolution Open R, a downstream version of R built using Intel’s Math Kernel Library (MKL). The post mentions that comparable improvements are observed on Mac OS X where the ATLAS blas library is used. A reader also expressed his hesitation in the Comments section for a lack of a comparison with ATLAS and OpenBLAS. This concept of using a different version of BLAS is documented in the R Administration manual, and has been compared in the past here and here. Now, as an avid R user, I *should* be using a more optimal version of R if it exists and is easy to obtain (install/compile), especially if the improvements are up to 40% as reported by the Domino Data Lab. I decided to follow the framework set out by this post to compare timings for the different versions of R on a `t2.micro`

instance on Amazon EC2 running Ubuntu 14.04.

First, I install R and the various versions of BLAS and lapack and download the benchmark script:

sudo apt-get install libblas3gf libopenblas-base libatlas3gf-base liblapack3gf libopenblas-dev liblapack-dev libatlas-dev R-base R-base-dev wget http://r.research.att.com/benchmarks/R-benchmark-25.R echo "install.packages('SuppDists', dep=TRUE, repo='http://cran.stat.ucla.edu')" | sudo R --vanilla ## needed for R-benchmarks-25.R

One could switch which blas and lapack library are used via the following commands:

sudo update-alternatives --config libblas.so.3 ## select from 3 versions of blas: blas, atlas, openblas sudo update-alternatives --config liblapack.so.3 ## select from 2 versions of lapack: lapack and atlas-lapack

Run `R`

, issue `Ctrl-z`

to send the process to the background, and see that the selected BLAS and lapack libraries are used by R by:

ps aux | grep R ## find the process id for R lsof -p PROCESS_ID_JUST_FOUND | grep 'blas\|lapack'

Now run the benchmarks on different versions:

# selection: libblas + lapack cat R-benchmark-25.R | time R --slave ... 171.71user 1.22system 2:53.01elapsed 99%CPU (0avgtext+0avgdata 425068maxresident)k 4960inputs+0outputs (32major+164552minor)pagefaults 0swaps 173.01 # selection: atlas + lapack cat R-benchmark-25.R | time R --slave ... 69.05user 1.16system 1:10.27elapsed 99%CPU (0avgtext+0avgdata 432620maxresident)k 2824inputs+0outputs (15major+130664minor)pagefaults 0swaps 70.27 # selection: openblas + lapack cat R-benchmark-25.R | time R --slave ... 70.69user 1.19system 1:11.93elapsed 99%CPU (0avgtext+0avgdata 429136maxresident)k 1592inputs+0outputs (6major+131181minor)pagefaults 0swaps 71.93 # selection: atlas + atlas-lapack cat R-benchmark-25.R | time R --slave ... 68.02user 1.14system 1:09.21elapsed 99%CPU (0avgtext+0avgdata 432240maxresident)k 2904inputs+0outputs (12major+124761minor)pagefaults 0swaps 69.93

As can be seen, there’s about a **60% improvement** using OpenBLAS or ATLAS over the standard libblas+lapack. What about MKL? Let’s test RRO:

sudo apt-get remove R-base R-base-dev wget http://mran.revolutionanalytics.com/install/RRO-8.0-Beta-Ubuntu-14.04.x86_64.tar.gz tar -xzf RRO-8.0-Beta-Ubuntu-14.04.x86_64.tar.gz ./install.sh # check that it is using a different version of blas and lapack using lsof again cat R-benchmark-25.R | time R --slave ... 51.19user 0.98system 0:52.24elapsed 99%CPU (0avgtext+0avgdata 417840maxresident)k 2208inputs+0outputs (11major+131128minor)pagefaults 0swaps 52.24

This is a **70% improvement** over the standard libblas+lapack version, and a **25% improvement** over the ATLAS/OpenBLAS version. This is quite a substantial improvement!

## Python

Although I don’t use Python much for data analysis (I use it as a general language for everything else), I wanted to repeat similar benchmarks for numpy and scipy as improvements have been documented. To do so, compile numpy and scipy from source and download some benchmark scripts.

sudo pip install numpy less /usr/local/lib/python2.7/dist-packages/numpy/__config__.py ## openblas? sudo pip install scipy # test different blas python ps aux | grep python lsof -p 20812 | grep 'blas\|lapack' ## change psid wget https://gist.github.com/osdf/3842524/raw/df01f7fa9d849bec353d6ab03eae0c1ee68f1538/test_numpy.py wget https://gist.github.com/osdf/3842524/raw/22e21f5d57a9526cbcd9981385504acdc7bdc788/test_scipy.py

One could switch blas and lapack like before. Results are as follows:

# selection: blas + lapack time python test_numpy.py FAST BLAS version: 1.9.1 maxint: 9223372036854775807 dot: 0.214728403091 sec real 0m1.253s user 0m1.119s sys 0m0.036s time python test_scipy.py cholesky: 0.166237211227 sec svd: 3.56523122787 sec real 0m19.183s user 0m19.105s sys 0m0.064s # selection: atlas + lapack time python test_numpy.py FAST BLAS version: 1.9.1 maxint: 9223372036854775807 dot: 0.211034584045 sec real 0m1.132s user 0m1.121s sys 0m0.008s time python test_scipy.py cholesky: 0.0454761981964 sec svd: 1.33822960854 sec real 0m7.442s user 0m7.346s sys 0m0.084s # selection: openblas + lapack time python test_numpy.py FAST BLAS version: 1.9.1 maxint: 9223372036854775807 dot: 0.212402009964 sec real 0m1.139s user 0m1.130s sys 0m0.004s time python test_scipy.py cholesky: 0.0431131839752 sec svd: 1.09770617485 sec real 0m6.227s user 0m6.143s sys 0m0.076s # selection: atlas + atlas-lapack time python test_numpy.py FAST BLAS version: 1.9.1 maxint: 9223372036854775807 dot: 0.217267608643 sec real 0m1.162s user 0m1.143s sys 0m0.016s time python test_scipy.py cholesky: 0.0429849624634 sec svd: 1.31666741371 sec real 0m7.318s user 0m7.213s sys 0m0.092s

Here, if I only focus on the svd results, then OpenBLAS yields a **70% improvement** and ATLAS yields a **63% improvement**. What about MKL? Well, a readily available version costs money, so I wasn’t able to test.

## Conclusion

Here are my take-aways:

- Using different BLAS/LAPACK libraries is
*extremely*easy on Ubuntu; no need to compile as you could install the libraries and select which version to use. - Install and use RRO (MKL) when possible as it is the fastest.
- When the previous isn’t possible, use ATLAS or OpenBLAS. For example, we have AIX at work. Getting R installed on there is already a difficult task, so optimizing R is a low priority. However, if it’s possible to use OpenBLAS or ATLAS, use it (Note: MKL is irrelevant here as AIX uses POWER cpu).
- For Python, use OpenBLAS or ATLAS.

For those that want to compile R using MKL yourself, check this. For those that wants to do so for Python, check this.

Finally, some visualizations to summarize the findings:

# R results timings <- c(173.01, 70.27, 71.93, 69.93, 52.24) versions <- c('blas + lapack', 'atlas + lapack', 'openblas + lapack', 'atlas + atlas-lapack', 'MKL') versions <- factor(versions, levels=versions) d1 <- data.frame(timings, versions) ggplot(data=d1, aes(x=versions, y=timings / max(timings))) + geom_bar(stat='identity') + geom_text(aes(x=versions, y=timings / max(timings), label=sprintf('%.f%%', timings / max(timings) * 100)), vjust=-.8) + labs(title='R - R-benchmark-25.R') ggsave('R_blas+atlas+openblas+mkl.png') # Python results timings <- c(3.57, 1.34, 1.10, 1.32) versions <- c('blas + lapack', 'atlas + lapack', 'openblas + lapack', 'atlas + atlas-lapack') versions <- factor(versions, levels=versions) d1 <- data.frame(timings, versions) ggplot(data=d1, aes(x=versions, y=timings / max(timings))) + geom_bar(stat='identity') + geom_text(aes(x=versions, y=timings / max(timings), label=sprintf('%.f%%', timings / max(timings) * 100)), vjust=-.8) + labs(title='Python - test_scipy.py (SVD)') ggsave('Python_blas+atlas+openblas.png')

How did you verify that the t2.micro instance wasn’t throttled?

Not sure what that means, so I did not verify I did run the code multiple times and got similar results.

You could replace lapack with openblas too, right? Instead of running openblas + lapack.

Which variant of openblas did you use (file names for opensuse): – Serial library (libopenblas0) – With OpenMP support (libopenblaso0) – With threading support (libopenblasp0)

Could you compare OpenMP vs threading variant? That would be cool.

Ah sorry, no I looked again at your post and saw that you can only select “lapack and atlas-lapack” for lapack. On OpenSuse 13.2 I can also select all openblas variants there.

See http://aws.amazon.com/blogs/aws/low-cost-burstable-ec2-instances/. t2.micro instances get downthrottled to 10% of a single vCPU’s capacity if you use 100% CPU for too long. You get 6 ”CPU credits” per hour that accumulate for up to 24h. 1 credit = 1 min of 100% CPU usage.

[…] sure whether these results were good I decided to compare different libraries as done in this post. Unfortunately I can’t get the update-alternatives to affect the linked library. As you can […]

[…] This is from both personal experience (with the exact same issue) and realizing why such a combination wasn’t mentioned in this comparison blog. […]

[…] Optimized R and Python: standard BLAS vs. ATLAS vs. OpenBLAS vs. MKL […]

[…] Optimized R and Python: standard BLAS vs. ATLAS vs. OpenBLAS vs. MKL […]