Optimized R and Python: standard BLAS vs. ATLAS vs. OpenBLAS vs. MKL

Linux

Python

Author

Vinh Nguyen

Published

November 11, 2014

Revolution Analytics recently released Revolution Open R, a downstream version of R built using Intel's Math Kernel Library (MKL). The post mentions that comparable improvements are observed on Mac OS X where the ATLAS blas library is used. A reader also expressed his hesitation in the Comments section for a lack of a comparison with ATLAS and OpenBLAS. This concept of using a different version of BLAS is documented in the R Administration manual, and has been compared in the past here and here. Now, as an avid R user, I should be using a more optimal version of R if it exists and is easy to obtain (install/compile), especially if the improvements are up to 40% as reported by the Domino Data Lab. I decided to follow the framework set out by this post to compare timings for the different versions of R on a t2.micro instance on Amazon EC2 running Ubuntu 14.04.

First, I install R and the various versions of BLAS and lapack and download the benchmark script:

sudo apt-get install libblas3gf libopenblas-base libatlas3gf-base liblapack3gf libopenblas-dev liblapack-dev libatlas-dev R-base R-base-dev
wget http://r.research.att.com/benchmarks/R-benchmark-25.R
echo "install.packages('SuppDists', dep=TRUE, repo='http://cran.stat.ucla.edu')" | sudo R --vanilla ## needed for R-benchmarks-25.R

One could switch which blas and lapack library are used via the following commands:

sudo update-alternatives --config libblas.so.3 ## select from 3 versions of blas: blas, atlas, openblas
sudo update-alternatives --config liblapack.so.3 ## select from 2 versions of lapack: lapack and atlas-lapack

Run R, issue Ctrl-z to send the process to the background, and see that the selected BLAS and lapack libraries are used by R by:

ps aux | grep R ## find the process id for R
lsof -p PROCESS_ID_JUST_FOUND | grep 'blas\|lapack'

Now run the benchmarks on different versions:

# selection: libblas + lapack
cat R-benchmark-25.R | time R --slave
...
171.71user 1.22system 2:53.01elapsed 99%CPU (0avgtext+0avgdata 425068maxresident)k
4960inputs+0outputs (32major+164552minor)pagefaults 0swaps
173.01
# selection: atlas + lapack
cat R-benchmark-25.R | time R --slave
...
69.05user 1.16system 1:10.27elapsed 99%CPU (0avgtext+0avgdata 432620maxresident)k
2824inputs+0outputs (15major+130664minor)pagefaults 0swaps
70.27
# selection: openblas + lapack
cat R-benchmark-25.R | time R --slave
...
70.69user 1.19system 1:11.93elapsed 99%CPU (0avgtext+0avgdata 429136maxresident)k
1592inputs+0outputs (6major+131181minor)pagefaults 0swaps
71.93
# selection: atlas + atlas-lapack
cat R-benchmark-25.R | time R --slave
...
68.02user 1.14system 1:09.21elapsed 99%CPU (0avgtext+0avgdata 432240maxresident)k
2904inputs+0outputs (12major+124761minor)pagefaults 0swaps
69.93

As can be seen, there's about a 60% improvement using OpenBLAS or ATLAS over the standard libblas+lapack. What about MKL? Let's test RRO:

sudo apt-get remove R-base R-base-dev
wget http://mran.revolutionanalytics.com/install/RRO-8.0-Beta-Ubuntu-14.04.x86_64.tar.gz
tar -xzf RRO-8.0-Beta-Ubuntu-14.04.x86_64.tar.gz
./install.sh
# check that it is using a different version of blas and lapack using lsof again
cat R-benchmark-25.R | time R --slave
...
51.19user 0.98system 0:52.24elapsed 99%CPU (0avgtext+0avgdata 417840maxresident)k
2208inputs+0outputs (11major+131128minor)pagefaults 0swaps
52.24

This is a 70% improvement over the standard libblas+lapack version, and a 25% improvement over the ATLAS/OpenBLAS version. This is quite a substantial improvement!

Python

Although I don't use Python much for data analysis (I use it as a general language for everything else), I wanted to repeat similar benchmarks for numpy and scipy as improvements have been documented. To do so, compile numpy and scipy from source and download some benchmark scripts.

sudo pip install numpy
less /usr/local/lib/python2.7/dist-packages/numpy/__config__.py ## openblas?
sudo pip install scipy
# test different blas
python
ps aux | grep python
lsof -p 20812 | grep 'blas\|lapack' ## change psid
wget https://gist.github.com/osdf/3842524/raw/df01f7fa9d849bec353d6ab03eae0c1ee68f1538/test_numpy.py
wget https://gist.github.com/osdf/3842524/raw/22e21f5d57a9526cbcd9981385504acdc7bdc788/test_scipy.py

One could switch blas and lapack like before. Results are as follows:

# selection: blas + lapack
time python test_numpy.py
FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.214728403091 sec

real    0m1.253s
user    0m1.119s
sys     0m0.036s

time python test_scipy.py
cholesky: 0.166237211227 sec
svd: 3.56523122787 sec

real    0m19.183s
user    0m19.105s
sys     0m0.064s

# selection: atlas + lapack
time python test_numpy.py
FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.211034584045 sec

real    0m1.132s
user    0m1.121s
sys     0m0.008s

time python test_scipy.py
cholesky: 0.0454761981964 sec
svd: 1.33822960854 sec

real    0m7.442s
user    0m7.346s
sys     0m0.084s

# selection: openblas + lapack
time python test_numpy.py
FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.212402009964 sec

real    0m1.139s
user    0m1.130s
sys     0m0.004s

time python test_scipy.py
cholesky: 0.0431131839752 sec
svd: 1.09770617485 sec

real    0m6.227s
user    0m6.143s
sys     0m0.076s

# selection: atlas + atlas-lapack
time python test_numpy.py
FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.217267608643 sec

real    0m1.162s
user    0m1.143s
sys     0m0.016s

time python test_scipy.py
cholesky: 0.0429849624634 sec
svd: 1.31666741371 sec

real    0m7.318s
user    0m7.213s
sys     0m0.092s

Here, if I only focus on the svd results, then OpenBLAS yields a 70% improvement and ATLAS yields a 63% improvement. What about MKL? Well, a readily available version costs money, so I wasn't able to test.

Conclusion

Here are my take-aways:

Using different BLAS/LAPACK libraries is extremely easy on Ubuntu; no need to compile as you could install the libraries and select which version to use.
Install and use RRO (MKL) when possible as it is the fastest.
When the previous isn't possible, use ATLAS or OpenBLAS. For example, we have AIX at work. Getting R installed on there is already a difficult task, so optimizing R is a low priority. However, if it's possible to use OpenBLAS or ATLAS, use it (Note: MKL is irrelevant here as AIX uses POWER cpu).
For Python, use OpenBLAS or ATLAS.

For those that want to compile R using MKL yourself, check this. For those that wants to do so for Python, check this.

Finally, some visualizations to summarize the findings:

```{r}
# R results
timings <- c(173.01, 70.27, 71.93, 69.93, 52.24)
versions <- c('blas + lapack', 'atlas + lapack', 'openblas + lapack', 'atlas + atlas-lapack', 'MKL')
versions <- factor(versions, levels=versions)
d1 <- data.frame(timings, versions)
ggplot(data=d1, aes(x=versions, y=timings / max(timings))) + 
  geom_bar(stat='identity') + 
  geom_text(aes(x=versions, y=timings / max(timings), label=sprintf('%.f%%', timings / max(timings) * 100)), vjust=-.8) +
  labs(title='R - R-benchmark-25.R')
ggsave('R_blas+atlas+openblas+mkl.png')

# Python results
timings <- c(3.57, 1.34, 1.10, 1.32)
versions <- c('blas + lapack', 'atlas + lapack', 'openblas + lapack', 'atlas + atlas-lapack')
versions <- factor(versions, levels=versions)
d1 <- data.frame(timings, versions)
ggplot(data=d1, aes(x=versions, y=timings / max(timings))) + 
  geom_bar(stat='identity') + 
  geom_text(aes(x=versions, y=timings / max(timings), label=sprintf('%.f%%', timings / max(timings) * 100)), vjust=-.8) +
  labs(title='Python - test_scipy.py (SVD)')
ggsave('Python_blas+atlas+openblas.png')
```

#+../../files/2014-11-10-R_blas+atlas+openblas+mkl.png <../../files/wpid-2014-11-10-R_blas-atlas-openblas-mkl2.png>

#+../../files/2014-11-10-Python_blas+atlas+openblas.png <../../files/wpid-2014-11-10-Python_blas-atlas-openblas2.png>