Compile R 3.2.2 on AIX 6.1

Here are my notes compiling 64-bit R 3.2.2 on AIX 6.1. As a pre-requisite, read the AIX notes from R-admin. Like the notes, I had GCC installed from here by our admin, along with many other pre-requisites. These were installed prior to compiling R. Note that you could grab newer versions of each package by going to http://www.oss4aix.org/download/RPMS/ (needed for R-dev).

## list of packages

http://www.oss4aix.org/download/RPMS/info/info-5.2-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/less/less-458-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/screen/screen-4.0.3-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/parallel/parallel-20140122-1.aix5.1.ppc.rpm

http://www.bullfreeware.com/download/bin/1464/readline-6.2-3.aix6.1.ppc.rpm


http://www.bullfreeware.com/download/bin/1465/readline-devel-6.2-3.aix6.1.ppc.rpm

http://www.oss4aix.org/download/RPMS/gmp/gmp-5.1.3-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gmp/gmp-devel-5.1.3-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/mpfr/mpfr-3.1.2-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/mpfr/mpfr-devel-3.1.2-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libmpc/libmpc-1.0.2-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libmpc/libmpc-devel-1.0.2-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/gcc-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/gcc-c++-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/gcc-cpp-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/gcc-gfortran-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/libgcc-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/libgomp-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/libstdc++-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/libstdc++-devel-4.8.2-1.aix6.1.ppc.rpm

http://www.oss4aix.org/download/RPMS/libiconv/libiconv-1.14-2.aix5.1.ppc.rpm

## conversting unicode, ascii, etc; aix version is not compatible with
R

http://download.icu-project.org/files/icu4c/54.1/icu4c-54_1-AIX7_1-VA2.tgz

## dependency for unicode support; just need to extract to root /
http://www.oss4aix.org/download/RPMS/make/make-4.1-1.aix5.3.ppc.rpm ##
need the gnu version of make

## libm
https://www.ibm.com/developerworks/community/forums/html/topic?id=72e62875-0603-4d93-a9bf-9d80c6cdc6ea
http://www-01.ibm.com/support/docview.wss?uid=isg1fileset-1318926131

https://www.ibm.com/developerworks/java/jdk/aix/service.html ## jre

http://www.oss4aix.org/download/RPMS/libpng/libpng-1.6.12-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libpng/libpng-devel-1.6.12-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libjpeg/libjpeg-9a-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libjpeg/libjpeg-devel-9a-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/icu/icu-gcc-4.8.1.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/icu/libicu-gcc-4.8.1.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/icu/libicu-gcc-devel-4.8.1.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/icu/libicu-gcc-doc-4.8.1.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gdb/gdb-7.8.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/expat/expat-2.1.0-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/curl/curl-7.27.0-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/curl/curl-devel-7.27.0-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libxml2/libxml2-2.9.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libxml2/libxml2-devel-2.9.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/ncurses/ncurses-5.9-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/ncurses/ncurses-devel-5.9-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gdbm/gdbm-1.11-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gdbm/gdbm-devel-1.11-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libtool/libtool-1.5.26-2.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libtool/libtool-ltdl-1.5.26-2.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libtool/libtool-ltdl-devel-1.5.26-2.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/zlib/zlib-1.2.8-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/zlib/zlib-devel-1.2.8-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/bzip2/bzip2-1.0.6-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/bzip2/bzip2-devel-1.0.6-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/pcre/pcre-8.37-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/pcre/pcre-devel-8.37-1.aix5.1.ppc.rpm

#### python

http://www.oss4aix.org/download/RPMS/python-setuptools/python-setuptools-0.6.24-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/python/python-2.7.5-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/python/python-devel-2.7.5-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/python/python-libs-2.7.5-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/python/python-test-2.7.5-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/python/python-tools-2.7.5-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/python/tkinter-2.7.5-1.aix6.1.ppc.rpm

Add /opt/freeware/bin to PATH.

Now, download the R source tarball, extract, and cd.

Next, update src/main/dcf.c from the R-dev as a bug that causes readDCF to segfault when using install.packages; no need to do this for future versions of R. Then apply this patch from here to fix the following error:

gcc -maix64 -pthread -std=gnu99 -I../../../../include -DNDEBUG
-I../../../include -I../../../../src/include -DHAVE_CONFIG_H
-I../../../../src/main -I/usr/local/include  -mminimal-toc    -O2 -g
-mcpu=power6  -c gramRd.c -o gramRd.o

gcc -maix64 -pthread -std=gnu99 -shared -Wl,-brtl -Wl,-G -Wl,-bexpall
-Wl,-bnoentry -lc -L/usr/local/lib -o tools.so text.o init.o Rmd5.o
md5.o signals.o install.o getfmts.o http.o gramLatex.o gramRd.o -lm
-lintl

make[6]: Entering directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

mkdir -p -- ../../../../library/tools/libs

make[6]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

make[5]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

make[4]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

make[4]: Entering directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

installing 'sysdata.rda'

Error: Line starting 'Package: tools ...' is malformed!

Execution halted

../../../share/make/basepkg.mk:150: recipe for target 'sysdata' failed

make[4]: *** [sysdata] Error 1

make[4]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

Makefile:30: recipe for target 'all' failed

make[3]: *** [all] Error 2

make[3]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

Makefile:36: recipe for target 'R' failed

make[2]: *** [R] Error 1

make[2]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library'

Makefile:28: recipe for target 'R' failed

make[1]: *** [R] Error 1

make[1]: Leaving directory '/sas/data04/vinh/R-3.2.2/src'

Makefile:59: recipe for target 'R' failed

make: *** [R] Error 1

Hopefully, this patch will make it to R-dev so that it is no longer needed for future versions of R.

export OBJECT_MODE=64
export CC="gcc -maix64 -pthread"
export CXX="g++ -maix64 -pthread"
export FC="gfortran -maix64 -pthread"
export F77="gfortran -maix64 -pthread"
export CFLAGS="-O2 -g -mcpu=power6"
export FFLAGS="-O2 -g -mcpu=power6"
export FCFLAGS="-O2 -g -mcpu=power6"
./configure --prefix=/path/to/opt ## custom location so I don't need root
make -j 16
make install
## add /path/to/opt/bin to PATH

The last step may complain about NEWS.pdf not found in a directory and a certain directory is not found in the destination. For the former, just do touch NEWS.pdf to where it’s supposed to be; for the latter, create the directory yourself.

Calculate the weighted Gini coefficient or AUC in R

This post on Kaggle provides R code for calculating the Gini for assessing a prediction rule, and this post provides R code for the weighted version (think exposure for frequency and claim count for severity in non-life insurance modeling). Note that the weighted version is not well-defined when there are ties in the predictions and where the corresponding weights vary because different Lorentz curve (gains chart) could be drawn for different orderings of the observations; see this post for an explanation and some examples.

Now, to explain the code. The calculation of the x values (variable random, the cumulative proportion of observations or weights) and y values (variable Lorentz, the cumulative proportion of the response, the good’s/1’s or positive values) are straightforward. To calculate the area between the Lorentz curve and the diagonal line, one could use the trapezoidal rule to calculate the area between the Lorentz curve and x-axis and then subtract the area of the lower triangle (1/2):

\begin{align} Gini &= \sum_{i=1}^{n} (x_{i} – x_{i-1}) \left[\frac{L(x_{i}) + L(x_{i-1})}{2}\right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i})x_{i} + L(x_{i-1})x_{i} – L(x_{i})x_{i-1} – L(x_{i-1})x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i})x_{i} – L(x_{i-1})x_{i-1} \right] + \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i})x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} L(x_{n})x_{n} + \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i}) x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i}) x_{i-1} \right] \end{align}

where the last equality comes from the fact that \(L(x_{n}) = x_{n} = 1\) for the Lorentz curve/gains chart. The remaining summation thus corresponds to sum(df$Lorentz[-1]*df$random[-n]) - sum(df$Lorentz[-n]*df$random[-1]) inside the WeightedGini function since the \(i=1\) term in the summation is 0 (\(x_i=0\) and \(L(x_{0})=0\) for the Lorentz curve), yielding \(n-1\) terms in the code.

For the unweighted case, applying the trapezoidal rule on the area between the Lorentz curve and the diagonal line yields:

\begin{align} Gini &= \sum_{i=1}^{n} \frac{1}{n} \frac{\left[ L(x_{i}) – x_{i} \right] – \left[ L(x_{i-1}) – x_{i-1} \right] }{2} \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i-1}) – x_{i-1} \right] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n-1} \left[ L(x_{i}) – x_{i} \right] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n-1} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{n}) – x_{n}] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] \ &= \frac{1}{n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] \end{align}

where we repeatedly used the fact that \(L(x_{0}) = x_{0} = 0\) and \(L(x_{n}) = x_{n} = 1\) for a Lorentz curve and that \(1/n\) is the width between points (change in cdf of the observations). The summation is what is returned by SumModelGini.

Note that both \(1/2\) and \(1/n\) are not multiplied to the sums in the weighted and unweighted functions since most people will use the normalized versions, in which case these factors just cancel.

Optimized R and Python: standard BLAS vs. ATLAS vs. OpenBLAS vs. MKL

wpid-2014-11-10-R_blas-atlas-openblas-mkl.png

Revolution Analytics recently released Revolution Open R, a downstream version of R built using Intel’s Math Kernel Library (MKL). The post mentions that comparable improvements are observed on Mac OS X where the ATLAS blas library is used. A reader also expressed his hesitation in the Comments section for a lack of a comparison with ATLAS and OpenBLAS. This concept of using a different version of BLAS is documented in the R Administration manual, and has been compared in the past here and here. Now, as an avid R user, I should be using a more optimal version of R if it exists and is easy to obtain (install/compile), especially if the improvements are up to 40% as reported by the Domino Data Lab. I decided to follow the framework set out by this post to compare timings for the different versions of R on a t2.micro instance on Amazon EC2 running Ubuntu 14.04.

First, I install R and the various versions of BLAS and lapack and download the benchmark script:

sudo apt-get install libblas3gf libopenblas-base libatlas3gf-base liblapack3gf libopenblas-dev liblapack-dev libatlas-dev R-base R-base-dev
wget http://r.research.att.com/benchmarks/R-benchmark-25.R
echo "install.packages('SuppDists', dep=TRUE, repo='http://cran.stat.ucla.edu')" | sudo R --vanilla ## needed for R-benchmarks-25.R

One could switch which blas and lapack library are used via the following commands:

sudo update-alternatives --config libblas.so.3 ## select from 3 versions of blas: blas, atlas, openblas
sudo update-alternatives --config liblapack.so.3 ## select from 2 versions of lapack: lapack and atlas-lapack

Run R, issue Ctrl-z to send the process to the background, and see that the selected BLAS and lapack libraries are used by R by:

ps aux | grep R ## find the process id for R
lsof -p PROCESS_ID_JUST_FOUND | grep 'blas\|lapack'

Now run the benchmarks on different versions:

# selection: libblas + lapack
cat R-benchmark-25.R | time R --slave
...
171.71user 1.22system 2:53.01elapsed 99%CPU (0avgtext+0avgdata 425068maxresident)k
4960inputs+0outputs (32major+164552minor)pagefaults 0swaps
173.01
# selection: atlas + lapack
cat R-benchmark-25.R | time R --slave
...
69.05user 1.16system 1:10.27elapsed 99%CPU (0avgtext+0avgdata 432620maxresident)k
2824inputs+0outputs (15major+130664minor)pagefaults 0swaps
70.27
# selection: openblas + lapack
cat R-benchmark-25.R | time R --slave
...
70.69user 1.19system 1:11.93elapsed 99%CPU (0avgtext+0avgdata 429136maxresident)k
1592inputs+0outputs (6major+131181minor)pagefaults 0swaps
71.93
# selection: atlas + atlas-lapack
cat R-benchmark-25.R | time R --slave
...
68.02user 1.14system 1:09.21elapsed 99%CPU (0avgtext+0avgdata 432240maxresident)k
2904inputs+0outputs (12major+124761minor)pagefaults 0swaps
69.93

As can be seen, there’s about a 60% improvement using OpenBLAS or ATLAS over the standard libblas+lapack. What about MKL? Let’s test RRO:

sudo apt-get remove R-base R-base-dev
wget http://mran.revolutionanalytics.com/install/RRO-8.0-Beta-Ubuntu-14.04.x86_64.tar.gz
tar -xzf RRO-8.0-Beta-Ubuntu-14.04.x86_64.tar.gz
./install.sh
# check that it is using a different version of blas and lapack using lsof again
cat R-benchmark-25.R | time R --slave
...
51.19user 0.98system 0:52.24elapsed 99%CPU (0avgtext+0avgdata 417840maxresident)k
2208inputs+0outputs (11major+131128minor)pagefaults 0swaps
52.24

This is a 70% improvement over the standard libblas+lapack version, and a 25% improvement over the ATLAS/OpenBLAS version. This is quite a substantial improvement!

Python

Although I don’t use Python much for data analysis (I use it as a general language for everything else), I wanted to repeat similar benchmarks for numpy and scipy as improvements have been documented. To do so, compile numpy and scipy from source and download some benchmark scripts.

sudo pip install numpy
less /usr/local/lib/python2.7/dist-packages/numpy/__config__.py ## openblas?
sudo pip install scipy
# test different blas
python
ps aux | grep python
lsof -p 20812 | grep 'blas\|lapack' ## change psid
wget https://gist.github.com/osdf/3842524/raw/df01f7fa9d849bec353d6ab03eae0c1ee68f1538/test_numpy.py
wget https://gist.github.com/osdf/3842524/raw/22e21f5d57a9526cbcd9981385504acdc7bdc788/test_scipy.py

One could switch blas and lapack like before. Results are as follows:

# selection: blas + lapack
time python test_numpy.py
FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.214728403091 sec

real    0m1.253s
user    0m1.119s
sys     0m0.036s

time python test_scipy.py
cholesky: 0.166237211227 sec
svd: 3.56523122787 sec

real    0m19.183s
user    0m19.105s
sys     0m0.064s

# selection: atlas + lapack
time python test_numpy.py
FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.211034584045 sec

real    0m1.132s
user    0m1.121s
sys     0m0.008s

time python test_scipy.py
cholesky: 0.0454761981964 sec
svd: 1.33822960854 sec

real    0m7.442s
user    0m7.346s
sys     0m0.084s

# selection: openblas + lapack
time python test_numpy.py
FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.212402009964 sec

real    0m1.139s
user    0m1.130s
sys     0m0.004s

time python test_scipy.py
cholesky: 0.0431131839752 sec
svd: 1.09770617485 sec

real    0m6.227s
user    0m6.143s
sys     0m0.076s

# selection: atlas + atlas-lapack
time python test_numpy.py
FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.217267608643 sec

real    0m1.162s
user    0m1.143s
sys     0m0.016s

time python test_scipy.py
cholesky: 0.0429849624634 sec
svd: 1.31666741371 sec

real    0m7.318s
user    0m7.213s
sys     0m0.092s

Here, if I only focus on the svd results, then OpenBLAS yields a 70% improvement and ATLAS yields a 63% improvement. What about MKL? Well, a readily available version costs money, so I wasn’t able to test.

Conclusion

Here are my take-aways:

  • Using different BLAS/LAPACK libraries is extremely easy on Ubuntu; no need to compile as you could install the libraries and select which version to use.
  • Install and use RRO (MKL) when possible as it is the fastest.
  • When the previous isn’t possible, use ATLAS or OpenBLAS. For example, we have AIX at work. Getting R installed on there is already a difficult task, so optimizing R is a low priority. However, if it’s possible to use OpenBLAS or ATLAS, use it (Note: MKL is irrelevant here as AIX uses POWER cpu).
  • For Python, use OpenBLAS or ATLAS.

For those that want to compile R using MKL yourself, check this. For those that wants to do so for Python, check this.

Finally, some visualizations to summarize the findings: 2014-11-10-R_blas+atlas+openblas+mkl.png 2014-11-10-Python_blas+atlas+openblas.png

# R results
timings <- c(173.01, 70.27, 71.93, 69.93, 52.24)
versions <- c('blas + lapack', 'atlas + lapack', 'openblas + lapack', 'atlas + atlas-lapack', 'MKL')
versions <- factor(versions, levels=versions)
d1 <- data.frame(timings, versions)
ggplot(data=d1, aes(x=versions, y=timings / max(timings))) + 
  geom_bar(stat='identity') + 
  geom_text(aes(x=versions, y=timings / max(timings), label=sprintf('%.f%%', timings / max(timings) * 100)), vjust=-.8) +
  labs(title='R - R-benchmark-25.R')
ggsave('R_blas+atlas+openblas+mkl.png')

# Python results
timings <- c(3.57, 1.34, 1.10, 1.32)
versions <- c('blas + lapack', 'atlas + lapack', 'openblas + lapack', 'atlas + atlas-lapack')
versions <- factor(versions, levels=versions)
d1 <- data.frame(timings, versions)
ggplot(data=d1, aes(x=versions, y=timings / max(timings))) + 
  geom_bar(stat='identity') + 
  geom_text(aes(x=versions, y=timings / max(timings), label=sprintf('%.f%%', timings / max(timings) * 100)), vjust=-.8) +
  labs(title='Python - test_scipy.py (SVD)')
ggsave('Python_blas+atlas+openblas.png')

optparse R package for creating command line scripts with arguments

Just discovered the optparse package in R that allows me to write a command line R script that could incorporate arguments (similar to Python’s argparse). Here’s an example:

#! /usr/bin/env Rscript

# http://cran.r-project.org/web/packages/optparse/vignettes/optparse.pdf
suppressPackageStartupMessages(library("optparse"))

option_list <- list(
    make_option(c('-d', '--date'), action='store', dest='date', type='character', default=Sys.Date(), metavar='"YYYY-MM-DD"', help='As of date to extract data from.  Defaults to today.')
)

opt <- parse_args(OptionParser(option_list=option_list))

# print(opt$date)
cat(sprintf('select * where contract_date > "%s"\n', opt$date)

Save this as my_scrypt.R, and do chmod +x my_script.R. Now check out ./my_script.R --help.

Package management in R and Python at work without root and behind firewall

My current job has strict security measures (referring to root access on a Linux server and the inability to access outside the company’s network), so it can be difficult in getting the tools necessary for my work, namely R packages on CRAN and Python packages via pip.

On my Windows workstation, I was able to install R by downloading the installer online and Python via Cygwin. However, R and Python are unable to connect to the internet to download and install additional packages because of the company’s firewall. To get around this for R, I could:

  • add the flag --internet2 to the execution path in R’s shortcut,
  • call setInternet2(TRUE) in the R console, or
  • set the environment variable http_proxy=http://username:password@proxy_server:port/.n

The first two tells R to use the proxy defined in Internet Explorer. I was able to access CRAN via my web browser, so this works. If CRAN is blocked on the browser, find out what proxy server is available at work and use that to access the outside world. If CRAN is also blocked on the proxy, put in a request to add it to the white list.

As for Python, install pip and use a proxy to download and install packages:

wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py --proxy="username:password@proxy_server:port"
pip install --proxy="username:password@proxy_server:port" argparse numpy pandas ## etc

NOTE: pip 1.3.1 has issues with proxy servers, so use the latest version.

On a Linux/Unix server, the added complexity is that of a lack of root access. Typically, Python is available by default on any modern distro. If not, have the admin team install R and Python via the distribution’s package manager, and if they can’t, then compile the two from source and install them locally. Once installed, use the same method as before for Python pip, but with the --user flag in order to install the packages locally in ~/.local/ (pip command is at ~/.local/bin/pip). For R, set the environment variables

export http_proxy="http://proxy_server:port/"
export http_proxy_user="username:password"

and install the libraries to ~/Rlib (add this to the library path via .libPaths() in ~/.Rprofile).

Output data to Excel for reproducible post-hoc analysis or visualization

As much as I like to analyze and visualize data in R, I sometimes have the need to export results/data into Excel for my business partners or myself to consume in Excel or Powerpoint (eg, create custom/edit-able bar charts with various graphical overlays in a powerpoint slide). As I’ve been using the XLConnect package to read xls/xlsx files, I’m also using it to write data to a sheet in an Excel workbook. I write the necessary data out to the Data sheet:

library(XLConnect)
## read data from files/DB and manipulate to get to the final data set
## now, write data
writeWorksheetToFile(file='foo.xlsx', data=iris, sheet='Data', clearSheets=TRUE) ## in case number of rows is less than before

I ask my collaborators to not edit the Data sheet except adding filters or sorting when they are inspecting/eye-balling the data. I ask them to do all analysis in separate sheets. The reason for this is to keep the work reproducible in case the data needs to be refreshed (error in data, repeat the analysis on new data, etc). That is, when I need to refresh the data, I ask for the modified Excel workbook and write out refreshed data to the same sheet (hence the clearSheets=TRUE option in case the number of rows is less than before). That way, calculations or plots referencing columns in the Data sheet would automatically be refreshed in the workbook with the refreshed data.

This is just another way to prevent inefficiency in the work flow and allows for reproducibility even when collaborating with Excel workbooks.

This work flow should in theory also work with SAS:

proc export data=mydata outfile='foo.xlsx'
    dbms=xlsx ;
    sheet=Data ;
run ;

Skeleton to create fast automatic tree diagrams using R and Graphviz

wpid-2014-01-17-tree_diagram.png

I’ve had to create tree diagrams (dendrograms, decision trees) many times in the past to illustrate the flow of data or decisions (e.g., data flow for a study). This is usually a manual task done in MS Powerpoint or Visio. I’ve also made some diagrams in the past using Graphviz based on the DOT language to make creation more reproducible. However, that still felt pretty manual.

I decided to come up with a skeleton framework to generate these diagrams using R since I could connect to various data sources, do calculations, and mash up outputs fairly fast with it.

Here is my framework illustrated by an example:

digraph <- '# dot -Tpng diagram.gv > diagram.png
digraph g {
graph [rankdir="LR"]
node [shape="rectangle" style=filled color=blue fontcolor=white] ;
n [label="%s"] ;
n_a [label="%s" color=red] ;
n_b [label="%s"] ;
n_aa [label="%s" color=red] ;
n_ab [label="%s" color=red] ;
n_ba [label="%s"] ;
n_bb [label="%s"] ;
n -> n_a [label="%s"];
n -> n_b [label="%s"];
n_a -> n_aa [label="%s"];
n_a -> n_ab [label="%s"];
n_b -> n_ba [label="%s"] ;
n_b -> n_bb [label="%s"] ;
} 
'

string_list <- list(
    digraph
    , 'All calls\\nn=100,000'
    , 'Night\\nn=20,000'
    , 'Day\\nn=80,000'
    , 'Closed\\nn=5,000'
    , 'Open\\nn=15,000'
    , 'Closed\\nn=10,000'
    , 'Open\\nn=7,000'
    , 'Night'
    , 'Day'
    , 'Closed'
    , 'Open'
    , 'Closed'
    , 'Open'
    )

# http://stackoverflow.com/questions/10341114/alternative-function-to-paste
dot_file <- do.call(sprintf, string_list)
sink('diagram.gv', split=TRUE)
cat(dot_file)
sink()
# run: dot -Tpng diagram.gv > diagram.png

This will write a file called diagram.gv:

dot -Tpng diagram.gv > diagram.png

digraph g { graph [rankdir="LR"] node [shape="rectangle" style=filled color=blue fontcolor=white] ; n [label="All calls\nn=100,000"] ; n_a [label="Night\nn=20,000" color=red] ; n_b [label="Day\nn=80,000"] ; n_aa [label="Closed\nn=5,000" color=red] ; n_ab [label="Open\nn=15,000" color=red] ; n_ba [label="Closed\nn=10,000"] ; n_bb [label="Open\nn=7,000"] ; n -> n_a [label="Night"]; n -> n_b [label="Day"]; n_a -> n_aa [label="Closed"]; n_a -> n_ab [label="Open"]; n_b -> n_ba [label="Closed"] ; n_b -> n_bb [label="Open"] ; }

Executing dot -Tpng diagram.gv > diagram.png, I will get the following output:

2014-01-17-tree_diagram.png

To make tree diagrams quickly, edit the structure of the diagram stored in the variable digraph. Dynamic texts should be inserted using sprintf via the list string_list. For example, do all calculations and then format the results in string_list. Then the diagram could be generated very quickly and will be reproducible. Changing any data source or any calculations would not result in manually re-creating the diagram!

Delimited file where delimiter clashes with data values

A comma-separated values (CSV) file is a typical way to store tabular/rectangular data. If a data cell contain a comma, then the cell with the commas is typically wrapped with quotes. However, what if a data cell contains a comma and a quotation mark? To avoid such scenarios, it is typically wise to use a delimiter that has a low chance of showing up in your data, such as the pipe (“|”) or caret (“^”) character. However, there are cases when the data is a long string with all sorts of data characters, including the pipe and caret characters. What then should the delimiter be in order to avoid a delimiter collision? As the Wikipedia article suggests, using special ASCII characters such as the unit/field separator (hex: 1F) could help as they probably won’t be in your data (no keyboard key that corresponds to it!).

Currently, my rule of thumb is to use pipe as the default delimiter. If the data contains complicated strings, then I’ll default to the field separator character. In Python, one could refer to the field separator as ‘\ x1f’. In R, one could refer to it as ‘\ x1F’. In SAS, it could be specified as ‘1F’x. In bash, the character could be specified on the command line (e.g., using the cut command, csvlook command, etc) by specifying $’1f’ as the delimiter character.

If the file contains the newline character in a data cell (\n), then the record separator character (hex: 1E) could be used for determining new lines.

Guide to accessing MS SQL Server and MySQL server on Mac OS X

Native GUI client access to MS-SQL and MySQL

We can use Oracle SQL Developer with the jTDS driver to access Microsoft SQL Server. Note: jTDS version 1.3.0 did not work for me; I had to use version 1.2.6. Detailed instructions can be found here.

We can use MySQL Workbench to access MySQL server. Setup is intuitively obvious.

Overview of ODBC on Mac OS X

Mac OS X has iODBC installed as it’s default ODBC manager. Most other Linux/UNIX system uses unixODBC to manage the ODBC drivers. This is the main reason why there’s so much confusion on getting ODBC to work on Mac OS X.

ODBC is kind of like an API for any software to access any DBMS easily, regardless of what DBMS it is and what OS it’s running on. Different software (e.g., R or Python) can utilize ODBC to access different DBMS through the following logic: Software -> ODBC Manager -> ODBC Driver for the DBMS -> DBMS Server (Software: R, Python, etc.; DBMS: MySQL, MS-SQL, etc.).

It doesn’t matter whether you use iODBC or unixODBC. Whichever one you use, just make sure the DBMS Driver and software you are using are configured/compiled to use with the same ODBC manager (usually set through the configure flags). For example, the R package RODBC and Python package pyodbc are compiled by default to use iODBC on Mac OS X. The DBMS drivers used must be compiled for use with iODBC. For iODBC, one could add data source names (DSN’s) at ~/Library/ODBC/odbc.ini. For unixODBC, one could add DSN’s at ~/.odbc.ini.

My current setup utilizes iODBC. I will outline the instructions for setting up MySQL and freeTDS (MS-SQL) drivers for use with RODBC and pyodbc through iODBC.

MySQL and FreeTDS with iODBC on Mac OS X

Install the MySQL Connector/ODBC driver. Driver should be at /usr/local/lib/libmyodbc5.so or /usr/local/lib/libmyodbc5w.so. Note: I’m unable to compile the driver from source on Mac OS X.

FreeTDS is an open source ODBC driver to access MS SQL Server. Install via Home Brew:

## install homebrew
ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"

## install freetds
brew install freetds

Driver should be at /usr/local/lib/libtdsodbc.so (symbolic linked).

Create ~/Library/ODBC/odbc.ini:

[sqlserver01]
Driver=/usr/local/lib/libtdsodbc.so
TDS_Version=4.2
Server=ip.address
Port = 1433
Trace = Yes
Description=my description
# Database=
# can't specify username and password for freetds

[mysql01]
Driver=/usr/local/lib/libmyodbc5.so
Server=hostname
Port=3306
charset=UTF8 
User=username
Password=password
# Database=
## can specify an actual database to each DSN

Install pyodbc via sudo pip install pyodbc. Test connections in python:

import pyodbc as p

con1 = p.connect("DSN=sqlserver01;UID=username;PWD=password")
con1.execute("select name from master..sysdatabases").fetchall()

con2 = p.connect("DSN=mysql01;UID=username;PWD=password")
con2.execute("show databases;").fetchall()

Install R using the installer. Install RODBC in the R interpreter via install.packages("RODBC"). Test connections in R:

library(RODBC)

ch1 <- odbcConnect(dsn="sqlserver01", uid="username", pwd="password")
odbcQuery(ch1, "select name from master..sysdatabases")
odbcFetchRows(ch1)

ch2 <- odbcConnect(dsn="mysql01", uid="username", pwd="password")
odbcQuery(ch2, "show databases;")
odbcFetchRows(ch2)

More on unixODBC on Mac OS X

If one wants to use unixODBC on Mac OS X instead, note the following:

  • First install unixODBC via Homebrew with brew install unixodbc.
  • Compile R from source to have it work with unixODBC (R binaries from the installer uses iODBC by default).
  • Can choose --with-odbc-manager=odbc when compiling RODBC.
  • When compiling freeTDS, include the argument with-unixodbc (pass to Homebrew or when compiling manually).
  • I’m unable to compile the MySQL Connector driver on Mac OS X from source (Homebrew or manually). Thus, it won’t work with unixODBC. I believe I tried unixODBC and MySQL Connector from macports, and those work.
  • pyodbc only works with iODBC on Mac OS X (inspect setup file). Currently I can’t get pyodbc to work with unixODBC on Mac OS X.

More differences between unixODBC and iODBC

unixODBC comes with the isql command to access different DBMS from the command line interpreter. iODBC comes with the iodbctest and iodbctestw commands. The command isql works for me on Mac OS X when I set freeTDS up to work with unixODBC (e.g., accessing MS SQL Server). I couldn’t access MySQL server because the MySQL Connector driver was compiled for use with iODBC.

If I use iODBC, I get the following for trying to access a MySQL server:

$ iodbctestw "DSN=sqlserver01;UID=username;PWD=password"
iODBC Unicode Demonstration program
This program shows an interactive SQL processor
Driver Manager: 03.52.0607.1008
1: SQLDriverConnectW = [MySQL][ODBC 5.1 Driver]Prompting is not supported on this platform. Please provide all required connect information. (0) SQLSTATE=HY000
1: ODBC_Connect = [MySQL][ODBC 5.1 Driver]Prompting is not supported on this platform. Please provide all required connect information. (0) SQLSTATE=HY000

When I try to access SQL Server, I get

$ iodbctestw "DSN=sqlserver01;UID=username;PWD=password"
iODBC Unicode Demonstration program
This program shows an interactive SQL processor
Driver Manager: 03.52.0607.1008
1: SQLDriverConnectW = [FreeTDS][SQL Server]Login failed for user 'username'. (18456) SQLSTATE=42000
2: SQLDriverConnectW = [FreeTDS][SQL Server]Unable to connect to data source (0) SQLSTATE=08001
1: ODBC_Connect = [FreeTDS][SQL Server]Login failed for user 'username'. (18456) SQLSTATE=42000
2: ODBC_Connect = [FreeTDS][SQL Server]Unable to connect to data source (0) SQLSTATE=08001

Don’t know why that is so. I guess it’s not too important to use an interactive interpreter. What matter is that the driver works with R and Python. Perhaps I should consider sqsh or do more searching.

Better decision tree graphics for rpart via party and partykit

I’ve been using Graphviz to create better decision tree graphics “by hand” for rpart objects created in R (final tree). I stumbled on this post that shows how one could convert an rpart object to a party project via the as.party function in partykit to utilize the plot functions in party. It looks quite nice.

I might have to do additional hacking as I like to display the node size and percentage of success in every node. For example, in rpart, I do something like

## rpartObj created from rpart
textRpartCustom <- 
{
    nclass <- (ncol(yval) - 1L)/2
    group <- yval[, 1L]
    counts <- yval[, 1L + (1L:nclass)]
    if (!is.null(ylevel)) 
        group <- ylevel[group]
    temp1 <- rpart:::formatg(counts, digits)
    if (nclass > 1) {
        ## temp1 <- apply(matrix(temp1, ncol = nclass), 1, paste, 
        ##     collapse = "/")
      temp1 <- matrix(as.numeric(temp1), ncol=nclass)
      ##temp1 <- paste("p=", round(temp1[, 2] / apply(temp1, 1, sum)100, 1), "%", "; N=", apply(temp1, 1, sum), sep="")
      temp1 <- paste("", round(temp1[, 2] / apply(temp1, 1, sum)100, 1), "%", "; ", apply(temp1, 1, sum), sep="") 
    }
    if (use.n) {
        out <- paste(format(group, justify = "left"), "\n", temp1, 
            sep = "")
    }
    else {
        out <- format(group, justify = "left")
    }
    return(out)
}

rpartObj$functions$text <- textRpartCustom plot(rpartObj) text(rpartObj)

to get these information to display for a classification fit.