Compile R 3.2.2 on AIX 6.1

Here are my notes compiling 64-bit R 3.2.2 on AIX 6.1. As a pre-requisite, read the AIX notes from R-admin. Like the notes, I had GCC installed from here by our admin, along with many other pre-requisites. These were installed prior to compiling R. Note that you could grab newer versions of each package by going to http://www.oss4aix.org/download/RPMS/ (needed for R-dev).

## list of packages

http://www.oss4aix.org/download/RPMS/info/info-5.2-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/less/less-458-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/screen/screen-4.0.3-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/parallel/parallel-20140122-1.aix5.1.ppc.rpm

http://www.bullfreeware.com/download/bin/1464/readline-6.2-3.aix6.1.ppc.rpm


http://www.bullfreeware.com/download/bin/1465/readline-devel-6.2-3.aix6.1.ppc.rpm

http://www.oss4aix.org/download/RPMS/gmp/gmp-5.1.3-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gmp/gmp-devel-5.1.3-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/mpfr/mpfr-3.1.2-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/mpfr/mpfr-devel-3.1.2-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libmpc/libmpc-1.0.2-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libmpc/libmpc-devel-1.0.2-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/gcc-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/gcc-c++-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/gcc-cpp-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/gcc-gfortran-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/libgcc-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/libgomp-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/libstdc++-4.8.2-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gcc/libstdc++-devel-4.8.2-1.aix6.1.ppc.rpm

http://www.oss4aix.org/download/RPMS/libiconv/libiconv-1.14-2.aix5.1.ppc.rpm

## conversting unicode, ascii, etc; aix version is not compatible with
R

http://download.icu-project.org/files/icu4c/54.1/icu4c-54_1-AIX7_1-VA2.tgz

## dependency for unicode support; just need to extract to root /
http://www.oss4aix.org/download/RPMS/make/make-4.1-1.aix5.3.ppc.rpm ##
need the gnu version of make

## libm
https://www.ibm.com/developerworks/community/forums/html/topic?id=72e62875-0603-4d93-a9bf-9d80c6cdc6ea
http://www-01.ibm.com/support/docview.wss?uid=isg1fileset-1318926131

https://www.ibm.com/developerworks/java/jdk/aix/service.html ## jre

http://www.oss4aix.org/download/RPMS/libpng/libpng-1.6.12-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libpng/libpng-devel-1.6.12-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libjpeg/libjpeg-9a-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libjpeg/libjpeg-devel-9a-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/icu/icu-gcc-4.8.1.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/icu/libicu-gcc-4.8.1.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/icu/libicu-gcc-devel-4.8.1.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/icu/libicu-gcc-doc-4.8.1.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gdb/gdb-7.8.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/expat/expat-2.1.0-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/curl/curl-7.27.0-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/curl/curl-devel-7.27.0-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libxml2/libxml2-2.9.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libxml2/libxml2-devel-2.9.1-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/ncurses/ncurses-5.9-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/ncurses/ncurses-devel-5.9-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gdbm/gdbm-1.11-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/gdbm/gdbm-devel-1.11-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libtool/libtool-1.5.26-2.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libtool/libtool-ltdl-1.5.26-2.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/libtool/libtool-ltdl-devel-1.5.26-2.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/zlib/zlib-1.2.8-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/zlib/zlib-devel-1.2.8-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/bzip2/bzip2-1.0.6-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/bzip2/bzip2-devel-1.0.6-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/pcre/pcre-8.37-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/pcre/pcre-devel-8.37-1.aix5.1.ppc.rpm

#### python

http://www.oss4aix.org/download/RPMS/python-setuptools/python-setuptools-0.6.24-1.aix5.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/python/python-2.7.5-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/python/python-devel-2.7.5-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/python/python-libs-2.7.5-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/python/python-test-2.7.5-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/python/python-tools-2.7.5-1.aix6.1.ppc.rpm


http://www.oss4aix.org/download/RPMS/python/tkinter-2.7.5-1.aix6.1.ppc.rpm

Add /opt/freeware/bin to PATH.

Now, download the R source tarball, extract, and cd.

Next, update src/main/dcf.c from the R-dev as a bug that causes readDCF to segfault when using install.packages; no need to do this for future versions of R. Then apply this patch from here to fix the following error:

gcc -maix64 -pthread -std=gnu99 -I../../../../include -DNDEBUG
-I../../../include -I../../../../src/include -DHAVE_CONFIG_H
-I../../../../src/main -I/usr/local/include  -mminimal-toc    -O2 -g
-mcpu=power6  -c gramRd.c -o gramRd.o

gcc -maix64 -pthread -std=gnu99 -shared -Wl,-brtl -Wl,-G -Wl,-bexpall
-Wl,-bnoentry -lc -L/usr/local/lib -o tools.so text.o init.o Rmd5.o
md5.o signals.o install.o getfmts.o http.o gramLatex.o gramRd.o -lm
-lintl

make[6]: Entering directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

mkdir -p -- ../../../../library/tools/libs

make[6]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

make[5]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

make[4]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

make[4]: Entering directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

installing 'sysdata.rda'

Error: Line starting 'Package: tools ...' is malformed!

Execution halted

../../../share/make/basepkg.mk:150: recipe for target 'sysdata' failed

make[4]: *** [sysdata] Error 1

make[4]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

Makefile:30: recipe for target 'all' failed

make[3]: *** [all] Error 2

make[3]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

Makefile:36: recipe for target 'R' failed

make[2]: *** [R] Error 1

make[2]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library'

Makefile:28: recipe for target 'R' failed

make[1]: *** [R] Error 1

make[1]: Leaving directory '/sas/data04/vinh/R-3.2.2/src'

Makefile:59: recipe for target 'R' failed

make: *** [R] Error 1

Hopefully, this patch will make it to R-dev so that it is no longer needed for future versions of R.

export OBJECT_MODE=64
export CC="gcc -maix64 -pthread"
export CXX="g++ -maix64 -pthread"
export FC="gfortran -maix64 -pthread"
export F77="gfortran -maix64 -pthread"
export CFLAGS="-O2 -g -mcpu=power6"
export FFLAGS="-O2 -g -mcpu=power6"
export FCFLAGS="-O2 -g -mcpu=power6"
./configure --prefix=/path/to/opt ## custom location so I don't need root
make -j 16
make install
## add /path/to/opt/bin to PATH

The last step may complain about NEWS.pdf not found in a directory and a certain directory is not found in the destination. For the former, just do touch NEWS.pdf to where it’s supposed to be; for the latter, create the directory yourself.

Repair line breaks within a field of a delimited file

Sometimes some people generate delimited files with line break characters (carriage return and/or line feed) inside a field without quoting. I previously wrote about the case when the problematic fields are quoted. I also wrote about using non-ascii characters as field and new record indicators to avoid clashes.

The following script reads in stdin and writes to stdout repaired lines by ensuring every output line has at least the number of delimiters (|) as the first/header line (call this the target number of delimiters) by continually concatenating lines (remove line breaks) until it reaches the point when concatenating the next line would yield more delimiters than the target number of delimiters. The script appears more complicated than it should be in order to address the case when there are more than one line breaks in a field (so don’t just concatenate one line but keep doing so) and the case when a line has more delimiters than the target number of delimiter (this could lead to an infinite loop if we restrict the number of delimiters to equal the target).

#! /usr/bin/env python

dlm='|'

import sys
from signal import signal, SIGPIPE, SIG_DFL # http://stackoverflow.com/questions/14207708/ioerror-errno-32-broken-pipe-python
signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less

line = sys.stdin.readline()
n_dlm = line.count(dlm)

line0 = line
line_next = 'a'
while line:
    if line.count(dlm) > n_dlm or line_next=='':
        sys.stdout.write(line0)
        line = line_next
        # line = sys.stdin.readline()
        if line.count(dlm) > n_dlm: ## line with more delimiters than target?
            line0 = line_next
            line_next = sys.stdin.readline()
            line = line.replace('\r', ' ').replace('\n', ' ') + line_next
    else:
        line0 = line
        line_next = sys.stdin.readline()
        line = line.replace('\r', ' ').replace('\n', ' ') + line_next

Optimized R and Python: standard BLAS vs. ATLAS vs. OpenBLAS vs. MKL

wpid-2014-11-10-R_blas-atlas-openblas-mkl.png

Revolution Analytics recently released Revolution Open R, a downstream version of R built using Intel’s Math Kernel Library (MKL). The post mentions that comparable improvements are observed on Mac OS X where the ATLAS blas library is used. A reader also expressed his hesitation in the Comments section for a lack of a comparison with ATLAS and OpenBLAS. This concept of using a different version of BLAS is documented in the R Administration manual, and has been compared in the past here and here. Now, as an avid R user, I should be using a more optimal version of R if it exists and is easy to obtain (install/compile), especially if the improvements are up to 40% as reported by the Domino Data Lab. I decided to follow the framework set out by this post to compare timings for the different versions of R on a t2.micro instance on Amazon EC2 running Ubuntu 14.04.

First, I install R and the various versions of BLAS and lapack and download the benchmark script:

sudo apt-get install libblas3gf libopenblas-base libatlas3gf-base liblapack3gf libopenblas-dev liblapack-dev libatlas-dev R-base R-base-dev
wget http://r.research.att.com/benchmarks/R-benchmark-25.R
echo "install.packages('SuppDists', dep=TRUE, repo='http://cran.stat.ucla.edu')" | sudo R --vanilla ## needed for R-benchmarks-25.R

One could switch which blas and lapack library are used via the following commands:

sudo update-alternatives --config libblas.so.3 ## select from 3 versions of blas: blas, atlas, openblas
sudo update-alternatives --config liblapack.so.3 ## select from 2 versions of lapack: lapack and atlas-lapack

Run R, issue Ctrl-z to send the process to the background, and see that the selected BLAS and lapack libraries are used by R by:

ps aux | grep R ## find the process id for R
lsof -p PROCESS_ID_JUST_FOUND | grep 'blas\|lapack'

Now run the benchmarks on different versions:

# selection: libblas + lapack
cat R-benchmark-25.R | time R --slave
...
171.71user 1.22system 2:53.01elapsed 99%CPU (0avgtext+0avgdata 425068maxresident)k
4960inputs+0outputs (32major+164552minor)pagefaults 0swaps
173.01
# selection: atlas + lapack
cat R-benchmark-25.R | time R --slave
...
69.05user 1.16system 1:10.27elapsed 99%CPU (0avgtext+0avgdata 432620maxresident)k
2824inputs+0outputs (15major+130664minor)pagefaults 0swaps
70.27
# selection: openblas + lapack
cat R-benchmark-25.R | time R --slave
...
70.69user 1.19system 1:11.93elapsed 99%CPU (0avgtext+0avgdata 429136maxresident)k
1592inputs+0outputs (6major+131181minor)pagefaults 0swaps
71.93
# selection: atlas + atlas-lapack
cat R-benchmark-25.R | time R --slave
...
68.02user 1.14system 1:09.21elapsed 99%CPU (0avgtext+0avgdata 432240maxresident)k
2904inputs+0outputs (12major+124761minor)pagefaults 0swaps
69.93

As can be seen, there’s about a 60% improvement using OpenBLAS or ATLAS over the standard libblas+lapack. What about MKL? Let’s test RRO:

sudo apt-get remove R-base R-base-dev
wget http://mran.revolutionanalytics.com/install/RRO-8.0-Beta-Ubuntu-14.04.x86_64.tar.gz
tar -xzf RRO-8.0-Beta-Ubuntu-14.04.x86_64.tar.gz
./install.sh
# check that it is using a different version of blas and lapack using lsof again
cat R-benchmark-25.R | time R --slave
...
51.19user 0.98system 0:52.24elapsed 99%CPU (0avgtext+0avgdata 417840maxresident)k
2208inputs+0outputs (11major+131128minor)pagefaults 0swaps
52.24

This is a 70% improvement over the standard libblas+lapack version, and a 25% improvement over the ATLAS/OpenBLAS version. This is quite a substantial improvement!

Python

Although I don’t use Python much for data analysis (I use it as a general language for everything else), I wanted to repeat similar benchmarks for numpy and scipy as improvements have been documented. To do so, compile numpy and scipy from source and download some benchmark scripts.

sudo pip install numpy
less /usr/local/lib/python2.7/dist-packages/numpy/__config__.py ## openblas?
sudo pip install scipy
# test different blas
python
ps aux | grep python
lsof -p 20812 | grep 'blas\|lapack' ## change psid
wget https://gist.github.com/osdf/3842524/raw/df01f7fa9d849bec353d6ab03eae0c1ee68f1538/test_numpy.py
wget https://gist.github.com/osdf/3842524/raw/22e21f5d57a9526cbcd9981385504acdc7bdc788/test_scipy.py

One could switch blas and lapack like before. Results are as follows:

# selection: blas + lapack
time python test_numpy.py
FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.214728403091 sec

real    0m1.253s
user    0m1.119s
sys     0m0.036s

time python test_scipy.py
cholesky: 0.166237211227 sec
svd: 3.56523122787 sec

real    0m19.183s
user    0m19.105s
sys     0m0.064s

# selection: atlas + lapack
time python test_numpy.py
FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.211034584045 sec

real    0m1.132s
user    0m1.121s
sys     0m0.008s

time python test_scipy.py
cholesky: 0.0454761981964 sec
svd: 1.33822960854 sec

real    0m7.442s
user    0m7.346s
sys     0m0.084s

# selection: openblas + lapack
time python test_numpy.py
FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.212402009964 sec

real    0m1.139s
user    0m1.130s
sys     0m0.004s

time python test_scipy.py
cholesky: 0.0431131839752 sec
svd: 1.09770617485 sec

real    0m6.227s
user    0m6.143s
sys     0m0.076s

# selection: atlas + atlas-lapack
time python test_numpy.py
FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.217267608643 sec

real    0m1.162s
user    0m1.143s
sys     0m0.016s

time python test_scipy.py
cholesky: 0.0429849624634 sec
svd: 1.31666741371 sec

real    0m7.318s
user    0m7.213s
sys     0m0.092s

Here, if I only focus on the svd results, then OpenBLAS yields a 70% improvement and ATLAS yields a 63% improvement. What about MKL? Well, a readily available version costs money, so I wasn’t able to test.

Conclusion

Here are my take-aways:

  • Using different BLAS/LAPACK libraries is extremely easy on Ubuntu; no need to compile as you could install the libraries and select which version to use.
  • Install and use RRO (MKL) when possible as it is the fastest.
  • When the previous isn’t possible, use ATLAS or OpenBLAS. For example, we have AIX at work. Getting R installed on there is already a difficult task, so optimizing R is a low priority. However, if it’s possible to use OpenBLAS or ATLAS, use it (Note: MKL is irrelevant here as AIX uses POWER cpu).
  • For Python, use OpenBLAS or ATLAS.

For those that want to compile R using MKL yourself, check this. For those that wants to do so for Python, check this.

Finally, some visualizations to summarize the findings: 2014-11-10-R_blas+atlas+openblas+mkl.png 2014-11-10-Python_blas+atlas+openblas.png

# R results
timings <- c(173.01, 70.27, 71.93, 69.93, 52.24)
versions <- c('blas + lapack', 'atlas + lapack', 'openblas + lapack', 'atlas + atlas-lapack', 'MKL')
versions <- factor(versions, levels=versions)
d1 <- data.frame(timings, versions)
ggplot(data=d1, aes(x=versions, y=timings / max(timings))) + 
  geom_bar(stat='identity') + 
  geom_text(aes(x=versions, y=timings / max(timings), label=sprintf('%.f%%', timings / max(timings) * 100)), vjust=-.8) +
  labs(title='R - R-benchmark-25.R')
ggsave('R_blas+atlas+openblas+mkl.png')

# Python results
timings <- c(3.57, 1.34, 1.10, 1.32)
versions <- c('blas + lapack', 'atlas + lapack', 'openblas + lapack', 'atlas + atlas-lapack')
versions <- factor(versions, levels=versions)
d1 <- data.frame(timings, versions)
ggplot(data=d1, aes(x=versions, y=timings / max(timings))) + 
  geom_bar(stat='identity') + 
  geom_text(aes(x=versions, y=timings / max(timings), label=sprintf('%.f%%', timings / max(timings) * 100)), vjust=-.8) +
  labs(title='Python - test_scipy.py (SVD)')
ggsave('Python_blas+atlas+openblas.png')

Change delimiter in a csv file and remove line breaks in fields

I wrote a script to convert delimiters in CSV files, eg, commas to pipes. I prefer pipe-delimited files because the the pipe-delimiter (|) will not clash data in the different fields 99.999% of the time. I also added the option to convert newline () and carriage return () characters in the data fields to spaces. This comes in handy when I use PROC IMPORT in SAS as line breaks cause it to choke.

Here’s my csvconvert.py script:

#! /usr/bin/env python

#### Command line arguments
import argparse
parser = argparse.ArgumentParser(description="Convert delimited file from one delimiter to another; defaults to converting CSV to pipe-delimited.")
parser.add_argument("--dlm-input", action="store", dest="dlm_in", default=",", required=False, help="delimiter of the input file; defaults to comma (,)", nargs='?', metavar="','")
parser.add_argument("--dlm-output", action="store", dest="dlm_out", default="|", required=False, help="delimiter of the output file; defaults to pipe (|)", nargs='?', metavar="'|'")
parser.add_argument("--remove-line-char", action="store_true", dest="remove_line_char", default=False, help="remove \\n and \\r characters in fields and replace with spaces")
parser.add_argument("--quote-char", action="store", dest="quote_char", default='"', required=False, help="quote character; defaults to double quote (\")", nargs='?', metavar="\"")
parser.add_argument("-i", "--input", action="store", dest="input", required=False, help="input file; if not specified, take from standard input.", nargs='?', metavar="file.csv")
parser.add_argument("-o", "--output", action="store", dest="output", required=False, help="output file; if not specified, write to standard output", nargs='?', metavar="file.pipe")
parser.add_argument("-v", "--verbose", action="store_true", dest="verbose", default=False, help="increase verbosity")
args  =  parser.parse_args()
# print args

# http://snipplr.com/view/45759/convert-csv-file-to-pipe-delineated-file/
import argparse
import csv
import sys
from signal import signal, SIGPIPE, SIG_DFL # http://stackoverflow.com/questions/14207708/ioerror-errno-32-broken-pipe-python
signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less

if args.input:
    csv_reader = csv.reader(open(args.input, 'rb'), delimiter=args.dlm_in, quotechar=args.quote_char)
else:
    csv_reader = csv.reader(sys.stdin, delimiter=args.dlm_in, quotechar=args.quote_char)

if args.output:
    h_outfile = open(args.output, 'wb')
else:
    h_outfile = sys.stdout

for row in csv_reader:
    row = args.dlm_out.join(row)
    if args.remove_line_char:
        row  =  row.replace('\n', ' ').replace('\r', ' ')
    h_outfile.write("%s\n" % (row))
    h_outfile.flush()
    # print row

Help description:

usage: csvconvert.py [-h] [--dlm-input [',']] [--dlm-output ['|']]
                     [--remove-line-char] [--quote-char ["]] [-i [file.csv]]
                     [-o [file.pipe]] [-v]

Convert delimited file from one delimiter to another; defaults to converting CSV to pipe-delimited.

optional arguments: -h, --help show this help message and exit --dlm-input [','] delimiter of the input file; defaults to comma (,) --dlm-output ['|'] delimiter of the output file; defaults to pipe (|) --remove-line-char remove \n and \r characters in fields and replace with spaces --quote-char ["] quote character; defaults to double quote (") -i [file.csv], --input [file.csv] input file; if not specified, take from standard input. -o [file.pipe], --output [file.pipe] output file; if not specified, write to standard output -v, --verbose increase verbosity

Usage:

cat myfile.csv | csvconvert.py --remove-line-char > myfile.pipe

Issues with https proxy in Python via suds and urllib2

I recently had the need to access a SOAP API to obtain some data. SOAP works by posting an xml file to a site url in a format defined by the API’s schema. The API then returns data, also in a form of an xml file. Based on this post, I figured suds was the easiest way to utilize Python to access the API so I could sequentially (and hence, parallelize) query data repeatedly. suds did turn out to be relatively easy to use:

from suds.client import Client
url = 'http://www.ripedev.com/webservices/localtime.asmx?WSDL'
client = Client(url)
print client
client.service.LocalTimeByZipCode('90210')

This worked on my home network. At work, I had to utilize a proxy in order to access the outside world. Otherwise, I’d get a connection refuse message: urllib2.URLError: <urlopen error [Errno 111] Connection refused>. The modification to use a proxy was straightforward:

from suds.client import Client
proxy = {'http': 'proxy_username:proxy_password@proxy_server.com:port'}
url = 'http://www.ripedev.com/webservices/localtime.asmx?WSDL'
# client = Client(url)
client = Client(url, proxy=proxy)
print client
client.service.LocalTimeByZipCode('90210')

The previous examples were from a public SOAP API I found online. Now, the site I wanted to actually hit uses ssl for encryption (i.e., https site) and requires authentication. I thought the fix would be as simple as:

from suds.client import Client
proxy = {'https': 'proxy_username:proxy_password@proxy_server.com:port'}
url = 'https://some_server.com/path/to/soap_api?wsdl'
un = 'site_username'
pw = 'site_password'
# client = Client(url)
client = Client(url, proxy=proxy, username=un, password=pw)
print client
client.service.someFunction(args)

However, I got the error message: Exception: (404, u'/path/to/soap_api'). Very weird to me. Is it an authentication issue? Is it a proxy issue? If a proxy issue, how so, as my previous toy example worked. Tried the same site on my home network where there is no firewall, and things worked:

from suds.client import Client
url = 'https://some_server.com/path/to/soap_api?wsdl'
un = 'site_username'
pw = 'site_password'
# client = Client(url)
client = Client(url, username=un, password=pw)
print client
client.service.someFunction(args)

Conclusion? Must be a proxy issue with https. I used the following prior to calling suds to help with debugging:

import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger('suds.client').setLevel(logging.DEBUG)
logging.getLogger('suds.transport').setLevel(logging.DEBUG)
logging.getLogger('suds.xsd.schema').setLevel(logging.DEBUG)
logging.getLogger('suds.wsdl').setLevel(logging.DEBUG)

My initial thoughts after some debugging: there must be something wrong with the proxy as the log shows python sending the request to the target url, but I get back a response that shows the path (minus the domain name) not found. What happened to the domain name? I notified the firewall team to look into this, as it appears the proxy is modifying something (url is not complete?). The firewall team investigated, and found that the proxy is returning a message that warns the ClientHello message is too large. This is one clue. The log also shows that the user was never authenticated and that the ssl handshake was never completed. My thought: still a proxy issue, as the python code works at home. However, the proxy team was able to access the https SOAP API through the proxy using the SOA Client plugin for Firefox. Now that convinced me that something else may be the culprit.

Googled for help, and thought this would be helpful.

import urllib2
import urllib
import httplib
import socket

class ProxyHTTPConnection(httplib.HTTPConnection):
    _ports = {'http' : 80, 'https' : 443}
    def request(self, method, url, body=None, headers={}):
        #request is called before connect, so can interpret url and get
        #real host/port to be used to make CONNECT request to proxy
        proto, rest = urllib.splittype(url)
        if proto is None:
            raise ValueError, "unknown URL type: %s" % url
        #get host
        host, rest = urllib.splithost(rest)
        #try to get port
        host, port = urllib.splitport(host)
        #if port is not defined try to get from proto
        if port is None:
            try:
                port = self._ports[proto]
            except KeyError:
                raise ValueError, "unknown protocol for: %s" % url
        self._real_host = host
        self._real_port = port
        httplib.HTTPConnection.request(self, method, url, body, headers)
    def connect(self):
        httplib.HTTPConnection.connect(self)
        #send proxy CONNECT request
        self.send("CONNECT %s:%d HTTP/1.0\r\n\r\n" % (self._real_host, self._real_port))
        #expect a HTTP/1.0 200 Connection established
        response = self.response_class(self.sock, strict=self.strict, method=self._method)
        (version, code, message) = response._read_status()
        #probably here we can handle auth requests...
        if code != 200:
            #proxy returned and error, abort connection, and raise exception
            self.close()
            raise socket.error, "Proxy connection failed: %d %s" % (code, message.strip())
        #eat up header block from proxy....
        while True:
            #should not use directly fp probablu
            line = response.fp.readline()
            if line == '\r\n': break

class ProxyHTTPSConnection(ProxyHTTPConnection):
    default_port = 443
    def __init__(self, host, port = None, key_file = None, cert_file = None, strict = None, timeout=0): # vinh added timeout
        ProxyHTTPConnection.__init__(self, host, port)
        self.key_file = key_file
        self.cert_file = cert_file
    def connect(self):
        ProxyHTTPConnection.connect(self)
        #make the sock ssl-aware
        ssl = socket.ssl(self.sock, self.key_file, self.cert_file)
        self.sock = httplib.FakeSocket(self.sock, ssl)

class ConnectHTTPHandler(urllib2.HTTPHandler):
    def do_open(self, http_class, req):
        return urllib2.HTTPHandler.do_open(self, ProxyHTTPConnection, req)

class ConnectHTTPSHandler(urllib2.HTTPSHandler):
    def do_open(self, http_class, req):
        return urllib2.HTTPSHandler.do_open(self, ProxyHTTPSConnection, req)

from suds.client import Client
# from httpsproxy import ConnectHTTPSHandler, ConnectHTTPHandler ## these are code from above classes
import urllib2, urllib
from suds.transport.http import HttpTransport
opener = urllib2.build_opener(ConnectHTTPHandler, ConnectHTTPSHandler)
urllib2.install_opener(opener)
t = HttpTransport()
t.urlopener = opener
url = 'https://some_server.com/path/to/soap_api?wsdl'
proxy = {'https': 'proxy_username:proxy_password@proxy_server.com:port'}
un = 'site_username'
pw = 'site_password'
client = Client(url=url, transport=t, proxy=proxy, username=un, password=pw)
client = Client(url=url, transport=t, proxy=proxy, username=un, password=pw, location='https://some_server.com/path/to/soap_api?wsdl') ## some site suggests specifying location

This too did not work. Continued to google, and found that lot’s of people are having issues with https and proxy. I knew suds depended on urllib2, so googled about that as well, and people too had issues with urllib2 in terms of https and proxy. I then decided to investigate using urllib2 to contact the https url through a proxy:

## http://stackoverflow.com/questions/5227333/xml-soap-post-error-what-am-i-doing-wrong
## http://stackoverflow.com/questions/34079/how-to-specify-an-authenticated-proxy-for-a-python-http-connect
### at home this works
import urllib2
url = 'https://some_server.com/path/to/soap_api?wsdl'
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None,
                          uri=url,
                          user='site_username',
                          passwd='site_password')
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)
page = urllib2.urlopen(url)
page.read()

### work network, does not work:
url = 'https://some_server.com/path/to/soap_api?wsdl'
proxy = urllib2.ProxyHandler({'https':'proxy_username:proxy_password@proxy_server.com:port', 'http':'proxy_username:proxy_password@proxy_server.com:port'})
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None,
                          uri=url,
                          user='site_username',
                          passwd='site_password')
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(proxy, auth_handler, urllib2.HTTPSHandler)
urllib2.install_opener(opener)
page = urllib2.urlopen(site)
### also tried re-doing above, but with the custom handler as defined in the previous code chunk (http://code.activestate.com/recipes/456195/) running first (run the list of classes)

No luck. I re-read this post that I ran into before, and really agreed that urllib2 is severely flawed, especially when using https proxy. At the end of the page, the author suggested using the requests package. Tried it out, and I was able to connect using the https proxy:

import requests
import xmltodict
p1 = 'http://proxy_username:proxy_password@proxy_server.com:port'
p2 = 'https://proxy_username:proxy_password@proxy_server.com:port'
proxy = {'http': p1, 'https':p2}

site = 'https://some_server.com/path/to/soap_api?wsdl'
r = requests.get(site, proxies=proxy, auth=('site_username', 'site_password'))
r.text ## works
soap_xml_in = """<?xml version="1.0" encoding="UTF-8"?>
...
"""
headers = {'SOAPAction': u'""', 'Content-Type': 'text/xml; charset=utf-8', 'Content-type': 'text/xml; charset=utf-8', 'Soapaction': u'""'}
soap_xml_out = requests.post(site, data=soap_xml_in, headers=headers, proxies=proxy, auth=('site_username', 'site_password')).text

My learnings?

  • suds is great for accessing SOAP, just not when you have to access an https site through a firewall.
  • urllib2 is severely flawed. Things only work in very standard situations.
  • requests package is very powerful and just works. Even though I have to deal with actual xml files as opposed to leveraging suds‘ pythonic structures, the xmltodict package helps to translate the xml file into dictionaries that only adds marginal effort to extract out relevant data.

NOTE: I had to install libuuid-devel in cygwin64 because I was getting an installation error.

Upgrading Ubuntu 12.04 to 14.04 breaks encrypted LVM

My laptop runs Ubuntu and is fully encrypted (since version 10.04). Upgrade from 10.04 to 12.04 was smooth in the sense that my system booted fine, asking for the passphrase to unlock the LVM. However, when I upgraded from 12.04 to 14.04, things broke and my laptop no longer booted properly as the LVM never got encrypted. I had to do the following to get my laptop working again (after many rounds of trial and error):

  • Boot a live usb Ubuntu session, de-crypted the LVM, and chroot’ed to run as the original OS
  • Finish the upgrade session via apt-get update && apt-get upgrade
  • It appears Ubuntu 14.04 installed some new package (did not write name down) that manages LVM or disks somehow (based on googling the error message). I removed this package.
  • Saw lvm issues, so installed the package lvm2
  • I made sure both dm-crypt and lvm2 were installed, and were accessible in initramfs, as cryptsetup was removed from initramfs since version 13.10. Had to do something with the following CRYPTSETUP issue.
  • Based on this post, I modified various files, but things still did not boot properly. I believe what finally fixed it was explicitly pointing to the LVM by /dev/sda5 in the GRUB_CMDLINE_LINUX line in /etc/default/grub.

The following is summary of these files for me. /etc/crypttab:

# <target name> <source device>         <key file>      <options>
# sdb5_crypt UUID=731a44c4-4655-4f2b-ae1a-2e3e6a14f2ef none luks
sdb5_crypt UUID=731a44c4-4655-4f2b-ae1a-2e3e6a14f2ef none luks,retry=1,lvm=vg01

/etc/initramfs-tools/conf.d/cryptroot:

## vinh created http://www.joh.fi/posts/2014/03/18/install-ubuntu-1310-on-top-of-encrypted-lvm/
# CRYPTROOT=target=sdb5_crypt,source=/dev/disk/by-uuid/f1ba5a54-ac7e-419d-8762-43da3274aba4
CRYPTOPTS=target=sdb5_crypt,source=UUID=f1ba5a54-ac7e-419d-8762-43da3274aba4,lvm=vg01

Then run update-initramfs -k all -c in order to update the initramfs images.

Have this line in /etc/default/grub:

#GRUB_CMDLINE_LINUX="cryptopts=target=sdb5_crypt,source=/dev/disk/by-uuid/f1ba5a54-ac7e-419d-8762-43da3274aba4,lvm=vg01"
#GRUB_CMDLINE_LINUX="cryptopts=target=sdb5_crypt,source=UUID=f1ba5a54-ac7e-419d-8762-43da3274aba4,lvm=vg01"
GRUB_CMDLINE_LINUX="cryptopts=target=sdb5_crypt,source=/dev/sda5,lvm=vg01"

Run update-grub.

Again, I think the key is the source definition in the previous line. I kept trying to refer to it by uuid but that did not work.

optparse R package for creating command line scripts with arguments

Just discovered the optparse package in R that allows me to write a command line R script that could incorporate arguments (similar to Python’s argparse). Here’s an example:

#! /usr/bin/env Rscript

# http://cran.r-project.org/web/packages/optparse/vignettes/optparse.pdf
suppressPackageStartupMessages(library("optparse"))

option_list <- list(
    make_option(c('-d', '--date'), action='store', dest='date', type='character', default=Sys.Date(), metavar='"YYYY-MM-DD"', help='As of date to extract data from.  Defaults to today.')
)

opt <- parse_args(OptionParser(option_list=option_list))

# print(opt$date)
cat(sprintf('select * where contract_date > "%s"\n', opt$date)

Save this as my_scrypt.R, and do chmod +x my_script.R. Now check out ./my_script.R --help.

Package management in R and Python at work without root and behind firewall

My current job has strict security measures (referring to root access on a Linux server and the inability to access outside the company’s network), so it can be difficult in getting the tools necessary for my work, namely R packages on CRAN and Python packages via pip.

On my Windows workstation, I was able to install R by downloading the installer online and Python via Cygwin. However, R and Python are unable to connect to the internet to download and install additional packages because of the company’s firewall. To get around this for R, I could:

  • add the flag --internet2 to the execution path in R’s shortcut,
  • call setInternet2(TRUE) in the R console, or
  • set the environment variable http_proxy=http://username:password@proxy_server:port/.n

The first two tells R to use the proxy defined in Internet Explorer. I was able to access CRAN via my web browser, so this works. If CRAN is blocked on the browser, find out what proxy server is available at work and use that to access the outside world. If CRAN is also blocked on the proxy, put in a request to add it to the white list.

As for Python, install pip and use a proxy to download and install packages:

wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py --proxy="username:password@proxy_server:port"
pip install --proxy="username:password@proxy_server:port" argparse numpy pandas ## etc

NOTE: pip 1.3.1 has issues with proxy servers, so use the latest version.

On a Linux/Unix server, the added complexity is that of a lack of root access. Typically, Python is available by default on any modern distro. If not, have the admin team install R and Python via the distribution’s package manager, and if they can’t, then compile the two from source and install them locally. Once installed, use the same method as before for Python pip, but with the --user flag in order to install the packages locally in ~/.local/ (pip command is at ~/.local/bin/pip). For R, set the environment variables

export http_proxy="http://proxy_server:port/"
export http_proxy_user="username:password"

and install the libraries to ~/Rlib (add this to the library path via .libPaths() in ~/.Rprofile).

Bulk resize images and keeping original files

Suppose you have a directory (with subdirectories) full of images, and you want to resize them all while keeping the original images. To do so, first create a copy of the directory tree without the image files. Then, using a for loop, find each image file and apply the convert command to it. The following is an example to resize jpg files to 40% of the original quality.

mkdir /path/to/mirror_dir
find /path/to/image_dir -type d -exec mkdir -p /path/to/mirror_dir/{} \;
cd /path/to/image_dir
for i in $(find -iname "*.jpg"); do echo $i; convert -resize 40% $i /path/to/mirror_dir/$i; done