Compile R 3.2.2 on AIX 6.1

Here are my notes compiling 64-bit R 3.2.2 on AIX 6.1. As a pre-requisite, read the AIX notes from R-admin. Like the notes, I had GCC installed from here by our admin, along with many other pre-requisites. These were installed prior to compiling R. Note that you could grab newer versions of each package by going to (needed for R-dev).

## list of packages

## conversting unicode, ascii, etc; aix version is not compatible with

## dependency for unicode support; just need to extract to root / ##
need the gnu version of make

## libm ## jre

#### python

Add /opt/freeware/bin to PATH.

Now, download the R source tarball, extract, and cd.

Next, update src/main/dcf.c from the R-dev as a bug that causes readDCF to segfault when using install.packages; no need to do this for future versions of R. Then apply this patch from here to fix the following error:

gcc -maix64 -pthread -std=gnu99 -I../../../../include -DNDEBUG
-I../../../include -I../../../../src/include -DHAVE_CONFIG_H
-I../../../../src/main -I/usr/local/include  -mminimal-toc    -O2 -g
-mcpu=power6  -c gramRd.c -o gramRd.o

gcc -maix64 -pthread -std=gnu99 -shared -Wl,-brtl -Wl,-G -Wl,-bexpall
-Wl,-bnoentry -lc -L/usr/local/lib -o text.o init.o Rmd5.o
md5.o signals.o install.o getfmts.o http.o gramLatex.o gramRd.o -lm

make[6]: Entering directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

mkdir -p -- ../../../../library/tools/libs

make[6]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

make[5]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

make[4]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

make[4]: Entering directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

installing 'sysdata.rda'

Error: Line starting 'Package: tools ...' is malformed!

Execution halted

../../../share/make/ recipe for target 'sysdata' failed

make[4]: *** [sysdata] Error 1

make[4]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

Makefile:30: recipe for target 'all' failed

make[3]: *** [all] Error 2

make[3]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

Makefile:36: recipe for target 'R' failed

make[2]: *** [R] Error 1

make[2]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library'

Makefile:28: recipe for target 'R' failed

make[1]: *** [R] Error 1

make[1]: Leaving directory '/sas/data04/vinh/R-3.2.2/src'

Makefile:59: recipe for target 'R' failed

make: *** [R] Error 1

Hopefully, this patch will make it to R-dev so that it is no longer needed for future versions of R.

export OBJECT_MODE=64
export CC="gcc -maix64 -pthread"
export CXX="g++ -maix64 -pthread"
export FC="gfortran -maix64 -pthread"
export F77="gfortran -maix64 -pthread"
export CFLAGS="-O2 -g -mcpu=power6"
export FFLAGS="-O2 -g -mcpu=power6"
export FCFLAGS="-O2 -g -mcpu=power6"
./configure --prefix=/path/to/opt ## custom location so I don't need root
make -j 16
make install
## add /path/to/opt/bin to PATH

The last step may complain about NEWS.pdf not found in a directory and a certain directory is not found in the destination. For the former, just do touch NEWS.pdf to where it’s supposed to be; for the latter, create the directory yourself.

Automatically specify line break options with termstr as CRLF or LF in SAS when importing data

It could be annoying when dealing with data from multiple platforms: Windows uses the carriage return (CR) and line feed (LF) to indicate a new line, UNIX uses LF, and Mac uses CR. Most companies have SAS running on a UNIX/Linux server, and it’s hard to tell which characters indicate a new line without going to a terminal to inspect the file.

Here’s a sas macro that creates a filename handle that could be used in PROC IMPORT or a DATA step. It will automatically detect CRLF and if not, default to LF. This assumes SAS is running on a UNIX server with access to the head and awk commands.

%macro handle_crlf(file, handle_name, other_filename_options=) ;
/* if there is a carriage return at the end, then return 1 (stored in macro variable SYSRC) */
%sysexec head -n 1 "&file" | awk '/\r$/ { exit(1) }' ;
%if &SYSRC=1 %then %let termstr=crlf ;
%else %let termstr=lf ;
filename &handle_name "&file" termstr=&termstr &other_filename_options ;
%mend ;
%handle_crlf(file=/path/to/file.txt, handle_name=fh) ;
proc import data=fh dbms=dlm replace outdata=d1 ;
   delimiter='|' ;
run ;

Repair line breaks within a field of a delimited file

Sometimes some people generate delimited files with line break characters (carriage return and/or line feed) inside a field without quoting. I previously wrote about the case when the problematic fields are quoted. I also wrote about using non-ascii characters as field and new record indicators to avoid clashes.

The following script reads in stdin and writes to stdout repaired lines by ensuring every output line has at least the number of delimiters (|) as the first/header line (call this the target number of delimiters) by continually concatenating lines (remove line breaks) until it reaches the point when concatenating the next line would yield more delimiters than the target number of delimiters. The script appears more complicated than it should be in order to address the case when there are more than one line breaks in a field (so don’t just concatenate one line but keep doing so) and the case when a line has more delimiters than the target number of delimiter (this could lead to an infinite loop if we restrict the number of delimiters to equal the target).

#! /usr/bin/env python


import sys
from signal import signal, SIGPIPE, SIG_DFL #
signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less

line = sys.stdin.readline()
n_dlm = line.count(dlm)

line0 = line
line_next = 'a'
while line:
    if line.count(dlm) > n_dlm or line_next=='':
        line = line_next
        # line = sys.stdin.readline()
        if line.count(dlm) > n_dlm: ## line with more delimiters than target?
            line0 = line_next
            line_next = sys.stdin.readline()
            line = line.replace('\r', ' ').replace('\n', ' ') + line_next
        line0 = line
        line_next = sys.stdin.readline()
        line = line.replace('\r', ' ').replace('\n', ' ') + line_next

Calculate the weighted Gini coefficient or AUC in R

This post on Kaggle provides R code for calculating the Gini for assessing a prediction rule, and this post provides R code for the weighted version (think exposure for frequency and claim count for severity in non-life insurance modeling). Note that the weighted version is not well-defined when there are ties in the predictions and where the corresponding weights vary because different Lorentz curve (gains chart) could be drawn for different orderings of the observations; see this post for an explanation and some examples.

Now, to explain the code. The calculation of the x values (variable random, the cumulative proportion of observations or weights) and y values (variable Lorentz, the cumulative proportion of the response, the good’s/1’s or positive values) are straightforward. To calculate the area between the Lorentz curve and the diagonal line, one could use the trapezoidal rule to calculate the area between the Lorentz curve and x-axis and then subtract the area of the lower triangle (1/2):

\begin{align} Gini &= \sum_{i=1}^{n} (x_{i} – x_{i-1}) \left[\frac{L(x_{i}) + L(x_{i-1})}{2}\right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i})x_{i} + L(x_{i-1})x_{i} – L(x_{i})x_{i-1} – L(x_{i-1})x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i})x_{i} – L(x_{i-1})x_{i-1} \right] + \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i})x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} L(x_{n})x_{n} + \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i}) x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i}) x_{i-1} \right] \end{align}

where the last equality comes from the fact that \(L(x_{n}) = x_{n} = 1\) for the Lorentz curve/gains chart. The remaining summation thus corresponds to sum(df$Lorentz[-1]*df$random[-n]) - sum(df$Lorentz[-n]*df$random[-1]) inside the WeightedGini function since the \(i=1\) term in the summation is 0 (\(x_i=0\) and \(L(x_{0})=0\) for the Lorentz curve), yielding \(n-1\) terms in the code.

For the unweighted case, applying the trapezoidal rule on the area between the Lorentz curve and the diagonal line yields:

\begin{align} Gini &= \sum_{i=1}^{n} \frac{1}{n} \frac{\left[ L(x_{i}) – x_{i} \right] – \left[ L(x_{i-1}) – x_{i-1} \right] }{2} \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i-1}) – x_{i-1} \right] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n-1} \left[ L(x_{i}) – x_{i} \right] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n-1} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{n}) – x_{n}] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] \ &= \frac{1}{n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] \end{align}

where we repeatedly used the fact that \(L(x_{0}) = x_{0} = 0\) and \(L(x_{n}) = x_{n} = 1\) for a Lorentz curve and that \(1/n\) is the width between points (change in cdf of the observations). The summation is what is returned by SumModelGini.

Note that both \(1/2\) and \(1/n\) are not multiplied to the sums in the weighted and unweighted functions since most people will use the normalized versions, in which case these factors just cancel.

Make a printer wireless using a router with USB running OpenWRT

Many recent printers have wifi capability built in for wireless printing. Older printers or even some recent printers do not have this feature, but one could purchase a wireless adapter to turn the printer wireless. The adapters aren’t cheap, and a search for a cheap adapter led me to configuring the TP-Link WR-703N with OpenWRT as an affordable alternative (plug printer into router with a usb cable and print to router via a usb print server).

First, flash the router to OpenWRT by logging into the router at using the username/password admin/admin; follow this guide for pictures in navigating the default Chinese interface. Once flashed, the router will have wifi disabled and the ethernet port could be used to log onto the LAN network. Log into the router using a web browser at the destination Set up wifi and turn the ethernet port to WAN by following these instructions; I changed my default router IP address to to avoid clashing with my default “home” network when I plug it into my home network for internet access. Plug the current router into another router with internet access via ethernet. Then ssh into the TP-Link router on its network: root@ Install the usb printer:

opkg update
opkg install p910nd kmod-usb-printer

Start the print server:

/etc/init.d/p910nd start
/etc/init.d/p910nd enable

Now, on a laptop or computer, connect to the same wifi network as this mini router and add a printer at the router’s ip with port 9100 after plugging a printer into the usb port. Install the necessary printer driver on the laptop or computer.

This setup creates a separate network for usb wireless printing. If we want to have the printer join an existing wifi network, then just set up the router as the first post I referenced.

Optimized R and Python: standard BLAS vs. ATLAS vs. OpenBLAS vs. MKL


Revolution Analytics recently released Revolution Open R, a downstream version of R built using Intel’s Math Kernel Library (MKL). The post mentions that comparable improvements are observed on Mac OS X where the ATLAS blas library is used. A reader also expressed his hesitation in the Comments section for a lack of a comparison with ATLAS and OpenBLAS. This concept of using a different version of BLAS is documented in the R Administration manual, and has been compared in the past here and here. Now, as an avid R user, I should be using a more optimal version of R if it exists and is easy to obtain (install/compile), especially if the improvements are up to 40% as reported by the Domino Data Lab. I decided to follow the framework set out by this post to compare timings for the different versions of R on a t2.micro instance on Amazon EC2 running Ubuntu 14.04.

First, I install R and the various versions of BLAS and lapack and download the benchmark script:

sudo apt-get install libblas3gf libopenblas-base libatlas3gf-base liblapack3gf libopenblas-dev liblapack-dev libatlas-dev R-base R-base-dev
echo "install.packages('SuppDists', dep=TRUE, repo='')" | sudo R --vanilla ## needed for R-benchmarks-25.R

One could switch which blas and lapack library are used via the following commands:

sudo update-alternatives --config ## select from 3 versions of blas: blas, atlas, openblas
sudo update-alternatives --config ## select from 2 versions of lapack: lapack and atlas-lapack

Run R, issue Ctrl-z to send the process to the background, and see that the selected BLAS and lapack libraries are used by R by:

ps aux | grep R ## find the process id for R
lsof -p PROCESS_ID_JUST_FOUND | grep 'blas\|lapack'

Now run the benchmarks on different versions:

# selection: libblas + lapack
cat R-benchmark-25.R | time R --slave
171.71user 1.22system 2:53.01elapsed 99%CPU (0avgtext+0avgdata 425068maxresident)k
4960inputs+0outputs (32major+164552minor)pagefaults 0swaps
# selection: atlas + lapack
cat R-benchmark-25.R | time R --slave
69.05user 1.16system 1:10.27elapsed 99%CPU (0avgtext+0avgdata 432620maxresident)k
2824inputs+0outputs (15major+130664minor)pagefaults 0swaps
# selection: openblas + lapack
cat R-benchmark-25.R | time R --slave
70.69user 1.19system 1:11.93elapsed 99%CPU (0avgtext+0avgdata 429136maxresident)k
1592inputs+0outputs (6major+131181minor)pagefaults 0swaps
# selection: atlas + atlas-lapack
cat R-benchmark-25.R | time R --slave
68.02user 1.14system 1:09.21elapsed 99%CPU (0avgtext+0avgdata 432240maxresident)k
2904inputs+0outputs (12major+124761minor)pagefaults 0swaps

As can be seen, there’s about a 60% improvement using OpenBLAS or ATLAS over the standard libblas+lapack. What about MKL? Let’s test RRO:

sudo apt-get remove R-base R-base-dev
tar -xzf RRO-8.0-Beta-Ubuntu-14.04.x86_64.tar.gz
# check that it is using a different version of blas and lapack using lsof again
cat R-benchmark-25.R | time R --slave
51.19user 0.98system 0:52.24elapsed 99%CPU (0avgtext+0avgdata 417840maxresident)k
2208inputs+0outputs (11major+131128minor)pagefaults 0swaps

This is a 70% improvement over the standard libblas+lapack version, and a 25% improvement over the ATLAS/OpenBLAS version. This is quite a substantial improvement!


Although I don’t use Python much for data analysis (I use it as a general language for everything else), I wanted to repeat similar benchmarks for numpy and scipy as improvements have been documented. To do so, compile numpy and scipy from source and download some benchmark scripts.

sudo pip install numpy
less /usr/local/lib/python2.7/dist-packages/numpy/ ## openblas?
sudo pip install scipy
# test different blas
ps aux | grep python
lsof -p 20812 | grep 'blas\|lapack' ## change psid

One could switch blas and lapack like before. Results are as follows:

# selection: blas + lapack
time python
version: 1.9.1
maxint: 9223372036854775807

dot: 0.214728403091 sec

real    0m1.253s
user    0m1.119s
sys     0m0.036s

time python
cholesky: 0.166237211227 sec
svd: 3.56523122787 sec

real    0m19.183s
user    0m19.105s
sys     0m0.064s

# selection: atlas + lapack
time python
version: 1.9.1
maxint: 9223372036854775807

dot: 0.211034584045 sec

real    0m1.132s
user    0m1.121s
sys     0m0.008s

time python
cholesky: 0.0454761981964 sec
svd: 1.33822960854 sec

real    0m7.442s
user    0m7.346s
sys     0m0.084s

# selection: openblas + lapack
time python
version: 1.9.1
maxint: 9223372036854775807

dot: 0.212402009964 sec

real    0m1.139s
user    0m1.130s
sys     0m0.004s

time python
cholesky: 0.0431131839752 sec
svd: 1.09770617485 sec

real    0m6.227s
user    0m6.143s
sys     0m0.076s

# selection: atlas + atlas-lapack
time python
version: 1.9.1
maxint: 9223372036854775807

dot: 0.217267608643 sec

real    0m1.162s
user    0m1.143s
sys     0m0.016s

time python
cholesky: 0.0429849624634 sec
svd: 1.31666741371 sec

real    0m7.318s
user    0m7.213s
sys     0m0.092s

Here, if I only focus on the svd results, then OpenBLAS yields a 70% improvement and ATLAS yields a 63% improvement. What about MKL? Well, a readily available version costs money, so I wasn’t able to test.


Here are my take-aways:

  • Using different BLAS/LAPACK libraries is extremely easy on Ubuntu; no need to compile as you could install the libraries and select which version to use.
  • Install and use RRO (MKL) when possible as it is the fastest.
  • When the previous isn’t possible, use ATLAS or OpenBLAS. For example, we have AIX at work. Getting R installed on there is already a difficult task, so optimizing R is a low priority. However, if it’s possible to use OpenBLAS or ATLAS, use it (Note: MKL is irrelevant here as AIX uses POWER cpu).
  • For Python, use OpenBLAS or ATLAS.

For those that want to compile R using MKL yourself, check this. For those that wants to do so for Python, check this.

Finally, some visualizations to summarize the findings: 2014-11-10-R_blas+atlas+openblas+mkl.png 2014-11-10-Python_blas+atlas+openblas.png

# R results
timings <- c(173.01, 70.27, 71.93, 69.93, 52.24)
versions <- c('blas + lapack', 'atlas + lapack', 'openblas + lapack', 'atlas + atlas-lapack', 'MKL')
versions <- factor(versions, levels=versions)
d1 <- data.frame(timings, versions)
ggplot(data=d1, aes(x=versions, y=timings / max(timings))) + 
  geom_bar(stat='identity') + 
  geom_text(aes(x=versions, y=timings / max(timings), label=sprintf('%.f%%', timings / max(timings) * 100)), vjust=-.8) +
  labs(title='R - R-benchmark-25.R')

# Python results
timings <- c(3.57, 1.34, 1.10, 1.32)
versions <- c('blas + lapack', 'atlas + lapack', 'openblas + lapack', 'atlas + atlas-lapack')
versions <- factor(versions, levels=versions)
d1 <- data.frame(timings, versions)
ggplot(data=d1, aes(x=versions, y=timings / max(timings))) + 
  geom_bar(stat='identity') + 
  geom_text(aes(x=versions, y=timings / max(timings), label=sprintf('%.f%%', timings / max(timings) * 100)), vjust=-.8) +
  labs(title='Python - (SVD)')

Change delimiter in a csv file and remove line breaks in fields

I wrote a script to convert delimiters in CSV files, eg, commas to pipes. I prefer pipe-delimited files because the the pipe-delimiter (|) will not clash data in the different fields 99.999% of the time. I also added the option to convert newline () and carriage return () characters in the data fields to spaces. This comes in handy when I use PROC IMPORT in SAS as line breaks cause it to choke.

Here’s my script:

#! /usr/bin/env python

#### Command line arguments
import argparse
parser = argparse.ArgumentParser(description="Convert delimited file from one delimiter to another; defaults to converting CSV to pipe-delimited.")
parser.add_argument("--dlm-input", action="store", dest="dlm_in", default=",", required=False, help="delimiter of the input file; defaults to comma (,)", nargs='?', metavar="','")
parser.add_argument("--dlm-output", action="store", dest="dlm_out", default="|", required=False, help="delimiter of the output file; defaults to pipe (|)", nargs='?', metavar="'|'")
parser.add_argument("--remove-line-char", action="store_true", dest="remove_line_char", default=False, help="remove \\n and \\r characters in fields and replace with spaces")
parser.add_argument("--quote-char", action="store", dest="quote_char", default='"', required=False, help="quote character; defaults to double quote (\")", nargs='?', metavar="\"")
parser.add_argument("-i", "--input", action="store", dest="input", required=False, help="input file; if not specified, take from standard input.", nargs='?', metavar="file.csv")
parser.add_argument("-o", "--output", action="store", dest="output", required=False, help="output file; if not specified, write to standard output", nargs='?', metavar="file.pipe")
parser.add_argument("-v", "--verbose", action="store_true", dest="verbose", default=False, help="increase verbosity")
args  =  parser.parse_args()
# print args

import argparse
import csv
import sys
from signal import signal, SIGPIPE, SIG_DFL #
signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less

if args.input:
    csv_reader = csv.reader(open(args.input, 'rb'), delimiter=args.dlm_in, quotechar=args.quote_char)
    csv_reader = csv.reader(sys.stdin, delimiter=args.dlm_in, quotechar=args.quote_char)

if args.output:
    h_outfile = open(args.output, 'wb')
    h_outfile = sys.stdout

for row in csv_reader:
    row = args.dlm_out.join(row)
    if args.remove_line_char:
        row  =  row.replace('\n', ' ').replace('\r', ' ')
    h_outfile.write("%s\n" % (row))
    # print row

Help description:

usage: [-h] [--dlm-input [',']] [--dlm-output ['|']]
                     [--remove-line-char] [--quote-char ["]] [-i [file.csv]]
                     [-o [file.pipe]] [-v]

Convert delimited file from one delimiter to another; defaults to converting CSV to pipe-delimited.

optional arguments: -h, --help show this help message and exit --dlm-input [','] delimiter of the input file; defaults to comma (,) --dlm-output ['|'] delimiter of the output file; defaults to pipe (|) --remove-line-char remove \n and \r characters in fields and replace with spaces --quote-char ["] quote character; defaults to double quote (") -i [file.csv], --input [file.csv] input file; if not specified, take from standard input. -o [file.pipe], --output [file.pipe] output file; if not specified, write to standard output -v, --verbose increase verbosity


cat myfile.csv | --remove-line-char > myfile.pipe

Issues with https proxy in Python via suds and urllib2

I recently had the need to access a SOAP API to obtain some data. SOAP works by posting an xml file to a site url in a format defined by the API’s schema. The API then returns data, also in a form of an xml file. Based on this post, I figured suds was the easiest way to utilize Python to access the API so I could sequentially (and hence, parallelize) query data repeatedly. suds did turn out to be relatively easy to use:

from suds.client import Client
url = ''
client = Client(url)
print client

This worked on my home network. At work, I had to utilize a proxy in order to access the outside world. Otherwise, I’d get a connection refuse message: urllib2.URLError: <urlopen error [Errno 111] Connection refused>. The modification to use a proxy was straightforward:

from suds.client import Client
proxy = {'http': ''}
url = ''
# client = Client(url)
client = Client(url, proxy=proxy)
print client

The previous examples were from a public SOAP API I found online. Now, the site I wanted to actually hit uses ssl for encryption (i.e., https site) and requires authentication. I thought the fix would be as simple as:

from suds.client import Client
proxy = {'https': ''}
url = ''
un = 'site_username'
pw = 'site_password'
# client = Client(url)
client = Client(url, proxy=proxy, username=un, password=pw)
print client

However, I got the error message: Exception: (404, u'/path/to/soap_api'). Very weird to me. Is it an authentication issue? Is it a proxy issue? If a proxy issue, how so, as my previous toy example worked. Tried the same site on my home network where there is no firewall, and things worked:

from suds.client import Client
url = ''
un = 'site_username'
pw = 'site_password'
# client = Client(url)
client = Client(url, username=un, password=pw)
print client

Conclusion? Must be a proxy issue with https. I used the following prior to calling suds to help with debugging:

import logging

My initial thoughts after some debugging: there must be something wrong with the proxy as the log shows python sending the request to the target url, but I get back a response that shows the path (minus the domain name) not found. What happened to the domain name? I notified the firewall team to look into this, as it appears the proxy is modifying something (url is not complete?). The firewall team investigated, and found that the proxy is returning a message that warns the ClientHello message is too large. This is one clue. The log also shows that the user was never authenticated and that the ssl handshake was never completed. My thought: still a proxy issue, as the python code works at home. However, the proxy team was able to access the https SOAP API through the proxy using the SOA Client plugin for Firefox. Now that convinced me that something else may be the culprit.

Googled for help, and thought this would be helpful.

import urllib2
import urllib
import httplib
import socket

class ProxyHTTPConnection(httplib.HTTPConnection):
    _ports = {'http' : 80, 'https' : 443}
    def request(self, method, url, body=None, headers={}):
        #request is called before connect, so can interpret url and get
        #real host/port to be used to make CONNECT request to proxy
        proto, rest = urllib.splittype(url)
        if proto is None:
            raise ValueError, "unknown URL type: %s" % url
        #get host
        host, rest = urllib.splithost(rest)
        #try to get port
        host, port = urllib.splitport(host)
        #if port is not defined try to get from proto
        if port is None:
                port = self._ports[proto]
            except KeyError:
                raise ValueError, "unknown protocol for: %s" % url
        self._real_host = host
        self._real_port = port
        httplib.HTTPConnection.request(self, method, url, body, headers)
    def connect(self):
        #send proxy CONNECT request
        self.send("CONNECT %s:%d HTTP/1.0\r\n\r\n" % (self._real_host, self._real_port))
        #expect a HTTP/1.0 200 Connection established
        response = self.response_class(self.sock, strict=self.strict, method=self._method)
        (version, code, message) = response._read_status()
        #probably here we can handle auth requests...
        if code != 200:
            #proxy returned and error, abort connection, and raise exception
            raise socket.error, "Proxy connection failed: %d %s" % (code, message.strip())
        #eat up header block from proxy....
        while True:
            #should not use directly fp probablu
            line = response.fp.readline()
            if line == '\r\n': break

class ProxyHTTPSConnection(ProxyHTTPConnection):
    default_port = 443
    def __init__(self, host, port = None, key_file = None, cert_file = None, strict = None, timeout=0): # vinh added timeout
        ProxyHTTPConnection.__init__(self, host, port)
        self.key_file = key_file
        self.cert_file = cert_file
    def connect(self):
        #make the sock ssl-aware
        ssl = socket.ssl(self.sock, self.key_file, self.cert_file)
        self.sock = httplib.FakeSocket(self.sock, ssl)

class ConnectHTTPHandler(urllib2.HTTPHandler):
    def do_open(self, http_class, req):
        return urllib2.HTTPHandler.do_open(self, ProxyHTTPConnection, req)

class ConnectHTTPSHandler(urllib2.HTTPSHandler):
    def do_open(self, http_class, req):
        return urllib2.HTTPSHandler.do_open(self, ProxyHTTPSConnection, req)

from suds.client import Client
# from httpsproxy import ConnectHTTPSHandler, ConnectHTTPHandler ## these are code from above classes
import urllib2, urllib
from suds.transport.http import HttpTransport
opener = urllib2.build_opener(ConnectHTTPHandler, ConnectHTTPSHandler)
t = HttpTransport()
t.urlopener = opener
url = ''
proxy = {'https': ''}
un = 'site_username'
pw = 'site_password'
client = Client(url=url, transport=t, proxy=proxy, username=un, password=pw)
client = Client(url=url, transport=t, proxy=proxy, username=un, password=pw, location='') ## some site suggests specifying location

This too did not work. Continued to google, and found that lot’s of people are having issues with https and proxy. I knew suds depended on urllib2, so googled about that as well, and people too had issues with urllib2 in terms of https and proxy. I then decided to investigate using urllib2 to contact the https url through a proxy:

### at home this works
import urllib2
url = ''
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(auth_handler)
page = urllib2.urlopen(url)

### work network, does not work:
url = ''
proxy = urllib2.ProxyHandler({'https':'', 'http':''})
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(proxy, auth_handler, urllib2.HTTPSHandler)
page = urllib2.urlopen(site)
### also tried re-doing above, but with the custom handler as defined in the previous code chunk ( running first (run the list of classes)

No luck. I re-read this post that I ran into before, and really agreed that urllib2 is severely flawed, especially when using https proxy. At the end of the page, the author suggested using the requests package. Tried it out, and I was able to connect using the https proxy:

import requests
import xmltodict
p1 = ''
p2 = ''
proxy = {'http': p1, 'https':p2}

site = ''
r = requests.get(site, proxies=proxy, auth=('site_username', 'site_password'))
r.text ## works
soap_xml_in = """<?xml version="1.0" encoding="UTF-8"?>
headers = {'SOAPAction': u'""', 'Content-Type': 'text/xml; charset=utf-8', 'Content-type': 'text/xml; charset=utf-8', 'Soapaction': u'""'}
soap_xml_out =, data=soap_xml_in, headers=headers, proxies=proxy, auth=('site_username', 'site_password')).text

My learnings?

  • suds is great for accessing SOAP, just not when you have to access an https site through a firewall.
  • urllib2 is severely flawed. Things only work in very standard situations.
  • requests package is very powerful and just works. Even though I have to deal with actual xml files as opposed to leveraging suds‘ pythonic structures, the xmltodict package helps to translate the xml file into dictionaries that only adds marginal effort to extract out relevant data.

NOTE: I had to install libuuid-devel in cygwin64 because I was getting an installation error.

Upgrading Ubuntu 12.04 to 14.04 breaks encrypted LVM

My laptop runs Ubuntu and is fully encrypted (since version 10.04). Upgrade from 10.04 to 12.04 was smooth in the sense that my system booted fine, asking for the passphrase to unlock the LVM. However, when I upgraded from 12.04 to 14.04, things broke and my laptop no longer booted properly as the LVM never got encrypted. I had to do the following to get my laptop working again (after many rounds of trial and error):

  • Boot a live usb Ubuntu session, de-crypted the LVM, and chroot’ed to run as the original OS
  • Finish the upgrade session via apt-get update && apt-get upgrade
  • It appears Ubuntu 14.04 installed some new package (did not write name down) that manages LVM or disks somehow (based on googling the error message). I removed this package.
  • Saw lvm issues, so installed the package lvm2
  • I made sure both dm-crypt and lvm2 were installed, and were accessible in initramfs, as cryptsetup was removed from initramfs since version 13.10. Had to do something with the following CRYPTSETUP issue.
  • Based on this post, I modified various files, but things still did not boot properly. I believe what finally fixed it was explicitly pointing to the LVM by /dev/sda5 in the GRUB_CMDLINE_LINUX line in /etc/default/grub.

The following is summary of these files for me. /etc/crypttab:

# <target name> <source device>         <key file>      <options>
# sdb5_crypt UUID=731a44c4-4655-4f2b-ae1a-2e3e6a14f2ef none luks
sdb5_crypt UUID=731a44c4-4655-4f2b-ae1a-2e3e6a14f2ef none luks,retry=1,lvm=vg01


## vinh created
# CRYPTROOT=target=sdb5_crypt,source=/dev/disk/by-uuid/f1ba5a54-ac7e-419d-8762-43da3274aba4

Then run update-initramfs -k all -c in order to update the initramfs images.

Have this line in /etc/default/grub:


Run update-grub.

Again, I think the key is the source definition in the previous line. I kept trying to refer to it by uuid but that did not work.