Optimized R and Python: standard BLAS vs. ATLAS vs. OpenBLAS vs. MKL


Revolution Analytics recently released Revolution Open R, a downstream version of R built using Intel’s Math Kernel Library (MKL). The post mentions that comparable improvements are observed on Mac OS X where the ATLAS blas library is used. A reader also expressed his hesitation in the Comments section for a lack of a comparison with ATLAS and OpenBLAS. This concept of using a different version of BLAS is documented in the R Administration manual, and has been compared in the past here and here. Now, as an avid R user, I should be using a more optimal version of R if it exists and is easy to obtain (install/compile), especially if the improvements are up to 40% as reported by the Domino Data Lab. I decided to follow the framework set out by this post to compare timings for the different versions of R on a t2.micro instance on Amazon EC2 running Ubuntu 14.04.

First, I install R and the various versions of BLAS and lapack and download the benchmark script:

sudo apt-get install libblas3gf libopenblas-base libatlas3gf-base liblapack3gf libopenblas-dev liblapack-dev libatlas-dev R-base R-base-dev
wget http://r.research.att.com/benchmarks/R-benchmark-25.R
echo "install.packages('SuppDists', dep=TRUE, repo='http://cran.stat.ucla.edu')" | sudo R --vanilla ## needed for R-benchmarks-25.R

One could switch which blas and lapack library are used via the following commands:

sudo update-alternatives --config libblas.so.3 ## select from 3 versions of blas: blas, atlas, openblas
sudo update-alternatives --config liblapack.so.3 ## select from 2 versions of lapack: lapack and atlas-lapack

Run R, issue Ctrl-z to send the process to the background, and see that the selected BLAS and lapack libraries are used by R by:

ps aux | grep R ## find the process id for R
lsof -p PROCESS_ID_JUST_FOUND | grep 'blas\|lapack'

Now run the benchmarks on different versions:

# selection: libblas + lapack
cat R-benchmark-25.R | time R --slave
171.71user 1.22system 2:53.01elapsed 99%CPU (0avgtext+0avgdata 425068maxresident)k
4960inputs+0outputs (32major+164552minor)pagefaults 0swaps
# selection: atlas + lapack
cat R-benchmark-25.R | time R --slave
69.05user 1.16system 1:10.27elapsed 99%CPU (0avgtext+0avgdata 432620maxresident)k
2824inputs+0outputs (15major+130664minor)pagefaults 0swaps
# selection: openblas + lapack
cat R-benchmark-25.R | time R --slave
70.69user 1.19system 1:11.93elapsed 99%CPU (0avgtext+0avgdata 429136maxresident)k
1592inputs+0outputs (6major+131181minor)pagefaults 0swaps
# selection: atlas + atlas-lapack
cat R-benchmark-25.R | time R --slave
68.02user 1.14system 1:09.21elapsed 99%CPU (0avgtext+0avgdata 432240maxresident)k
2904inputs+0outputs (12major+124761minor)pagefaults 0swaps

As can be seen, there’s about a 60% improvement using OpenBLAS or ATLAS over the standard libblas+lapack. What about MKL? Let’s test RRO:

sudo apt-get remove R-base R-base-dev
wget http://mran.revolutionanalytics.com/install/RRO-8.0-Beta-Ubuntu-14.04.x86_64.tar.gz
tar -xzf RRO-8.0-Beta-Ubuntu-14.04.x86_64.tar.gz
# check that it is using a different version of blas and lapack using lsof again
cat R-benchmark-25.R | time R --slave
51.19user 0.98system 0:52.24elapsed 99%CPU (0avgtext+0avgdata 417840maxresident)k
2208inputs+0outputs (11major+131128minor)pagefaults 0swaps

This is a 70% improvement over the standard libblas+lapack version, and a 25% improvement over the ATLAS/OpenBLAS version. This is quite a substantial improvement!


Although I don’t use Python much for data analysis (I use it as a general language for everything else), I wanted to repeat similar benchmarks for numpy and scipy as improvements have been documented. To do so, compile numpy and scipy from source and download some benchmark scripts.

sudo pip install numpy
less /usr/local/lib/python2.7/dist-packages/numpy/__config__.py ## openblas?
sudo pip install scipy
# test different blas
ps aux | grep python
lsof -p 20812 | grep 'blas\|lapack' ## change psid
wget https://gist.github.com/osdf/3842524/raw/df01f7fa9d849bec353d6ab03eae0c1ee68f1538/test_numpy.py
wget https://gist.github.com/osdf/3842524/raw/22e21f5d57a9526cbcd9981385504acdc7bdc788/test_scipy.py

One could switch blas and lapack like before. Results are as follows:

# selection: blas + lapack
time python test_numpy.py
version: 1.9.1
maxint: 9223372036854775807

dot: 0.214728403091 sec

real    0m1.253s
user    0m1.119s
sys     0m0.036s

time python test_scipy.py
cholesky: 0.166237211227 sec
svd: 3.56523122787 sec

real    0m19.183s
user    0m19.105s
sys     0m0.064s

# selection: atlas + lapack
time python test_numpy.py
version: 1.9.1
maxint: 9223372036854775807

dot: 0.211034584045 sec

real    0m1.132s
user    0m1.121s
sys     0m0.008s

time python test_scipy.py
cholesky: 0.0454761981964 sec
svd: 1.33822960854 sec

real    0m7.442s
user    0m7.346s
sys     0m0.084s

# selection: openblas + lapack
time python test_numpy.py
version: 1.9.1
maxint: 9223372036854775807

dot: 0.212402009964 sec

real    0m1.139s
user    0m1.130s
sys     0m0.004s

time python test_scipy.py
cholesky: 0.0431131839752 sec
svd: 1.09770617485 sec

real    0m6.227s
user    0m6.143s
sys     0m0.076s

# selection: atlas + atlas-lapack
time python test_numpy.py
version: 1.9.1
maxint: 9223372036854775807

dot: 0.217267608643 sec

real    0m1.162s
user    0m1.143s
sys     0m0.016s

time python test_scipy.py
cholesky: 0.0429849624634 sec
svd: 1.31666741371 sec

real    0m7.318s
user    0m7.213s
sys     0m0.092s

Here, if I only focus on the svd results, then OpenBLAS yields a 70% improvement and ATLAS yields a 63% improvement. What about MKL? Well, a readily available version costs money, so I wasn’t able to test.


Here are my take-aways:

  • Using different BLAS/LAPACK libraries is extremely easy on Ubuntu; no need to compile as you could install the libraries and select which version to use.
  • Install and use RRO (MKL) when possible as it is the fastest.
  • When the previous isn’t possible, use ATLAS or OpenBLAS. For example, we have AIX at work. Getting R installed on there is already a difficult task, so optimizing R is a low priority. However, if it’s possible to use OpenBLAS or ATLAS, use it (Note: MKL is irrelevant here as AIX uses POWER cpu).
  • For Python, use OpenBLAS or ATLAS.

For those that want to compile R using MKL yourself, check this. For those that wants to do so for Python, check this.

Finally, some visualizations to summarize the findings: 2014-11-10-R_blas+atlas+openblas+mkl.png 2014-11-10-Python_blas+atlas+openblas.png

# R results
timings <- c(173.01, 70.27, 71.93, 69.93, 52.24)
versions <- c('blas + lapack', 'atlas + lapack', 'openblas + lapack', 'atlas + atlas-lapack', 'MKL')
versions <- factor(versions, levels=versions)
d1 <- data.frame(timings, versions)
ggplot(data=d1, aes(x=versions, y=timings / max(timings))) + 
  geom_bar(stat='identity') + 
  geom_text(aes(x=versions, y=timings / max(timings), label=sprintf('%.f%%', timings / max(timings) * 100)), vjust=-.8) +
  labs(title='R - R-benchmark-25.R')

# Python results
timings <- c(3.57, 1.34, 1.10, 1.32)
versions <- c('blas + lapack', 'atlas + lapack', 'openblas + lapack', 'atlas + atlas-lapack')
versions <- factor(versions, levels=versions)
d1 <- data.frame(timings, versions)
ggplot(data=d1, aes(x=versions, y=timings / max(timings))) + 
  geom_bar(stat='identity') + 
  geom_text(aes(x=versions, y=timings / max(timings), label=sprintf('%.f%%', timings / max(timings) * 100)), vjust=-.8) +
  labs(title='Python - test_scipy.py (SVD)')

Change delimiter in a csv file and remove line breaks in fields

I wrote a script to convert delimiters in CSV files, eg, commas to pipes. I prefer pipe-delimited files because the the pipe-delimiter (|) will not clash data in the different fields 99.999% of the time. I also added the option to convert newline () and carriage return () characters in the data fields to spaces. This comes in handy when I use PROC IMPORT in SAS as line breaks cause it to choke.

Here’s my csvconvert.py script:

#! /usr/bin/env python

#### Command line arguments
import argparse
parser = argparse.ArgumentParser(description="Convert delimited file from one delimiter to another; defaults to converting CSV to pipe-delimited.")
parser.add_argument("--dlm-input", action="store", dest="dlm_in", default=",", required=False, help="delimiter of the input file; defaults to comma (,)", nargs='?', metavar="','")
parser.add_argument("--dlm-output", action="store", dest="dlm_out", default="|", required=False, help="delimiter of the output file; defaults to pipe (|)", nargs='?', metavar="'|'")
parser.add_argument("--remove-line-char", action="store_true", dest="remove_line_char", default=False, help="remove \\n and \\r characters in fields and replace with spaces")
parser.add_argument("--quote-char", action="store", dest="quote_char", default='"', required=False, help="quote character; defaults to double quote (\")", nargs='?', metavar="\"")
parser.add_argument("-i", "--input", action="store", dest="input", required=False, help="input file; if not specified, take from standard input.", nargs='?', metavar="file.csv")
parser.add_argument("-o", "--output", action="store", dest="output", required=False, help="output file; if not specified, write to standard output", nargs='?', metavar="file.pipe")
parser.add_argument("-v", "--verbose", action="store_true", dest="verbose", default=False, help="increase verbosity")
args  =  parser.parse_args()
# print args

# http://snipplr.com/view/45759/convert-csv-file-to-pipe-delineated-file/
import argparse
import csv
import sys
from signal import signal, SIGPIPE, SIG_DFL # http://stackoverflow.com/questions/14207708/ioerror-errno-32-broken-pipe-python
signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less

if args.input:
    csv_reader = csv.reader(open(args.input, 'rb'), delimiter=args.dlm_in, quotechar=args.quote_char)
    csv_reader = csv.reader(sys.stdin, delimiter=args.dlm_in, quotechar=args.quote_char)

if args.output:
    h_outfile = open(args.output, 'wb')
    h_outfile = sys.stdout

for row in csv_reader:
    row = args.dlm_out.join(row)
    if args.remove_line_char:
        row  =  row.replace('\n', ' ').replace('\r', ' ')
    h_outfile.write("%s\n" % (row))
    # print row

Help description:

usage: csvconvert.py [-h] [--dlm-input [',']] [--dlm-output ['|']]
                     [--remove-line-char] [--quote-char ["]] [-i [file.csv]]
                     [-o [file.pipe]] [-v]

Convert delimited file from one delimiter to another; defaults to converting CSV to pipe-delimited.

optional arguments: -h, --help show this help message and exit --dlm-input [','] delimiter of the input file; defaults to comma (,) --dlm-output ['|'] delimiter of the output file; defaults to pipe (|) --remove-line-char remove \n and \r characters in fields and replace with spaces --quote-char ["] quote character; defaults to double quote (") -i [file.csv], --input [file.csv] input file; if not specified, take from standard input. -o [file.pipe], --output [file.pipe] output file; if not specified, write to standard output -v, --verbose increase verbosity


cat myfile.csv | csvconvert.py --remove-line-char > myfile.pipe

Issues with https proxy in Python via suds and urllib2

I recently had the need to access a SOAP API to obtain some data. SOAP works by posting an xml file to a site url in a format defined by the API’s schema. The API then returns data, also in a form of an xml file. Based on this post, I figured suds was the easiest way to utilize Python to access the API so I could sequentially (and hence, parallelize) query data repeatedly. suds did turn out to be relatively easy to use:

from suds.client import Client
url = 'http://www.ripedev.com/webservices/localtime.asmx?WSDL'
client = Client(url)
print client

This worked on my home network. At work, I had to utilize a proxy in order to access the outside world. Otherwise, I’d get a connection refuse message: urllib2.URLError: <urlopen error [Errno 111] Connection refused>. The modification to use a proxy was straightforward:

from suds.client import Client
proxy = {'http': 'proxy_username:proxy_password@proxy_server.com:port'}
url = 'http://www.ripedev.com/webservices/localtime.asmx?WSDL'
# client = Client(url)
client = Client(url, proxy=proxy)
print client

The previous examples were from a public SOAP API I found online. Now, the site I wanted to actually hit uses ssl for encryption (i.e., https site) and requires authentication. I thought the fix would be as simple as:

from suds.client import Client
proxy = {'https': 'proxy_username:proxy_password@proxy_server.com:port'}
url = 'https://some_server.com/path/to/soap_api?wsdl'
un = 'site_username'
pw = 'site_password'
# client = Client(url)
client = Client(url, proxy=proxy, username=un, password=pw)
print client

However, I got the error message: Exception: (404, u'/path/to/soap_api'). Very weird to me. Is it an authentication issue? Is it a proxy issue? If a proxy issue, how so, as my previous toy example worked. Tried the same site on my home network where there is no firewall, and things worked:

from suds.client import Client
url = 'https://some_server.com/path/to/soap_api?wsdl'
un = 'site_username'
pw = 'site_password'
# client = Client(url)
client = Client(url, username=un, password=pw)
print client

Conclusion? Must be a proxy issue with https. I used the following prior to calling suds to help with debugging:

import logging

My initial thoughts after some debugging: there must be something wrong with the proxy as the log shows python sending the request to the target url, but I get back a response that shows the path (minus the domain name) not found. What happened to the domain name? I notified the firewall team to look into this, as it appears the proxy is modifying something (url is not complete?). The firewall team investigated, and found that the proxy is returning a message that warns the ClientHello message is too large. This is one clue. The log also shows that the user was never authenticated and that the ssl handshake was never completed. My thought: still a proxy issue, as the python code works at home. However, the proxy team was able to access the https SOAP API through the proxy using the SOA Client plugin for Firefox. Now that convinced me that something else may be the culprit.

Googled for help, and thought this would be helpful.

import urllib2
import urllib
import httplib
import socket

class ProxyHTTPConnection(httplib.HTTPConnection):
    _ports = {'http' : 80, 'https' : 443}
    def request(self, method, url, body=None, headers={}):
        #request is called before connect, so can interpret url and get
        #real host/port to be used to make CONNECT request to proxy
        proto, rest = urllib.splittype(url)
        if proto is None:
            raise ValueError, "unknown URL type: %s" % url
        #get host
        host, rest = urllib.splithost(rest)
        #try to get port
        host, port = urllib.splitport(host)
        #if port is not defined try to get from proto
        if port is None:
                port = self._ports[proto]
            except KeyError:
                raise ValueError, "unknown protocol for: %s" % url
        self._real_host = host
        self._real_port = port
        httplib.HTTPConnection.request(self, method, url, body, headers)
    def connect(self):
        #send proxy CONNECT request
        self.send("CONNECT %s:%d HTTP/1.0\r\n\r\n" % (self._real_host, self._real_port))
        #expect a HTTP/1.0 200 Connection established
        response = self.response_class(self.sock, strict=self.strict, method=self._method)
        (version, code, message) = response._read_status()
        #probably here we can handle auth requests...
        if code != 200:
            #proxy returned and error, abort connection, and raise exception
            raise socket.error, "Proxy connection failed: %d %s" % (code, message.strip())
        #eat up header block from proxy....
        while True:
            #should not use directly fp probablu
            line = response.fp.readline()
            if line == '\r\n': break

class ProxyHTTPSConnection(ProxyHTTPConnection):
    default_port = 443
    def __init__(self, host, port = None, key_file = None, cert_file = None, strict = None, timeout=0): # vinh added timeout
        ProxyHTTPConnection.__init__(self, host, port)
        self.key_file = key_file
        self.cert_file = cert_file
    def connect(self):
        #make the sock ssl-aware
        ssl = socket.ssl(self.sock, self.key_file, self.cert_file)
        self.sock = httplib.FakeSocket(self.sock, ssl)

class ConnectHTTPHandler(urllib2.HTTPHandler):
    def do_open(self, http_class, req):
        return urllib2.HTTPHandler.do_open(self, ProxyHTTPConnection, req)

class ConnectHTTPSHandler(urllib2.HTTPSHandler):
    def do_open(self, http_class, req):
        return urllib2.HTTPSHandler.do_open(self, ProxyHTTPSConnection, req)

from suds.client import Client
# from httpsproxy import ConnectHTTPSHandler, ConnectHTTPHandler ## these are code from above classes
import urllib2, urllib
from suds.transport.http import HttpTransport
opener = urllib2.build_opener(ConnectHTTPHandler, ConnectHTTPSHandler)
t = HttpTransport()
t.urlopener = opener
url = 'https://some_server.com/path/to/soap_api?wsdl'
proxy = {'https': 'proxy_username:proxy_password@proxy_server.com:port'}
un = 'site_username'
pw = 'site_password'
client = Client(url=url, transport=t, proxy=proxy, username=un, password=pw)
client = Client(url=url, transport=t, proxy=proxy, username=un, password=pw, location='https://some_server.com/path/to/soap_api?wsdl') ## some site suggests specifying location

This too did not work. Continued to google, and found that lot’s of people are having issues with https and proxy. I knew suds depended on urllib2, so googled about that as well, and people too had issues with urllib2 in terms of https and proxy. I then decided to investigate using urllib2 to contact the https url through a proxy:

## http://stackoverflow.com/questions/5227333/xml-soap-post-error-what-am-i-doing-wrong
## http://stackoverflow.com/questions/34079/how-to-specify-an-authenticated-proxy-for-a-python-http-connect
### at home this works
import urllib2
url = 'https://some_server.com/path/to/soap_api?wsdl'
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(auth_handler)
page = urllib2.urlopen(url)

### work network, does not work:
url = 'https://some_server.com/path/to/soap_api?wsdl'
proxy = urllib2.ProxyHandler({'https':'proxy_username:proxy_password@proxy_server.com:port', 'http':'proxy_username:proxy_password@proxy_server.com:port'})
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(proxy, auth_handler, urllib2.HTTPSHandler)
page = urllib2.urlopen(site)
### also tried re-doing above, but with the custom handler as defined in the previous code chunk (http://code.activestate.com/recipes/456195/) running first (run the list of classes)

No luck. I re-read this post that I ran into before, and really agreed that urllib2 is severely flawed, especially when using https proxy. At the end of the page, the author suggested using the requests package. Tried it out, and I was able to connect using the https proxy:

import requests
import xmltodict
p1 = 'http://proxy_username:proxy_password@proxy_server.com:port'
p2 = 'https://proxy_username:proxy_password@proxy_server.com:port'
proxy = {'http': p1, 'https':p2}

site = 'https://some_server.com/path/to/soap_api?wsdl'
r = requests.get(site, proxies=proxy, auth=('site_username', 'site_password'))
r.text ## works
soap_xml_in = """<?xml version="1.0" encoding="UTF-8"?>
headers = {'SOAPAction': u'""', 'Content-Type': 'text/xml; charset=utf-8', 'Content-type': 'text/xml; charset=utf-8', 'Soapaction': u'""'}
soap_xml_out = requests.post(site, data=soap_xml_in, headers=headers, proxies=proxy, auth=('site_username', 'site_password')).text

My learnings?

  • suds is great for accessing SOAP, just not when you have to access an https site through a firewall.
  • urllib2 is severely flawed. Things only work in very standard situations.
  • requests package is very powerful and just works. Even though I have to deal with actual xml files as opposed to leveraging suds‘ pythonic structures, the xmltodict package helps to translate the xml file into dictionaries that only adds marginal effort to extract out relevant data.

NOTE: I had to install libuuid-devel in cygwin64 because I was getting an installation error.

Package management in R and Python at work without root and behind firewall

My current job has strict security measures (referring to root access on a Linux server and the inability to access outside the company’s network), so it can be difficult in getting the tools necessary for my work, namely R packages on CRAN and Python packages via pip.

On my Windows workstation, I was able to install R by downloading the installer online and Python via Cygwin. However, R and Python are unable to connect to the internet to download and install additional packages because of the company’s firewall. To get around this for R, I could:

  • add the flag --internet2 to the execution path in R’s shortcut,
  • call setInternet2(TRUE) in the R console, or
  • set the environment variable http_proxy=http://username:password@proxy_server:port/.n

The first two tells R to use the proxy defined in Internet Explorer. I was able to access CRAN via my web browser, so this works. If CRAN is blocked on the browser, find out what proxy server is available at work and use that to access the outside world. If CRAN is also blocked on the proxy, put in a request to add it to the white list.

As for Python, install pip and use a proxy to download and install packages:

wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py --proxy="username:password@proxy_server:port"
pip install --proxy="username:password@proxy_server:port" argparse numpy pandas ## etc

NOTE: pip 1.3.1 has issues with proxy servers, so use the latest version.

On a Linux/Unix server, the added complexity is that of a lack of root access. Typically, Python is available by default on any modern distro. If not, have the admin team install R and Python via the distribution’s package manager, and if they can’t, then compile the two from source and install them locally. Once installed, use the same method as before for Python pip, but with the --user flag in order to install the packages locally in ~/.local/ (pip command is at ~/.local/bin/pip). For R, set the environment variables

export http_proxy="http://proxy_server:port/"
export http_proxy_user="username:password"

and install the libraries to ~/Rlib (add this to the library path via .libPaths() in ~/.Rprofile).

Delimited file where delimiter clashes with data values

A comma-separated values (CSV) file is a typical way to store tabular/rectangular data. If a data cell contain a comma, then the cell with the commas is typically wrapped with quotes. However, what if a data cell contains a comma and a quotation mark? To avoid such scenarios, it is typically wise to use a delimiter that has a low chance of showing up in your data, such as the pipe (“|”) or caret (“^”) character. However, there are cases when the data is a long string with all sorts of data characters, including the pipe and caret characters. What then should the delimiter be in order to avoid a delimiter collision? As the Wikipedia article suggests, using special ASCII characters such as the unit/field separator (hex: 1F) could help as they probably won’t be in your data (no keyboard key that corresponds to it!).

Currently, my rule of thumb is to use pipe as the default delimiter. If the data contains complicated strings, then I’ll default to the field separator character. In Python, one could refer to the field separator as ‘\ x1f’. In R, one could refer to it as ‘\ x1F’. In SAS, it could be specified as ‘1F’x. In bash, the character could be specified on the command line (e.g., using the cut command, csvlook command, etc) by specifying $’1f’ as the delimiter character.

If the file contains the newline character in a data cell (\n), then the record separator character (hex: 1E) could be used for determining new lines.

Guide to accessing MS SQL Server and MySQL server on Mac OS X

Native GUI client access to MS-SQL and MySQL

We can use Oracle SQL Developer with the jTDS driver to access Microsoft SQL Server. Note: jTDS version 1.3.0 did not work for me; I had to use version 1.2.6. Detailed instructions can be found here.

We can use MySQL Workbench to access MySQL server. Setup is intuitively obvious.

Overview of ODBC on Mac OS X

Mac OS X has iODBC installed as it’s default ODBC manager. Most other Linux/UNIX system uses unixODBC to manage the ODBC drivers. This is the main reason why there’s so much confusion on getting ODBC to work on Mac OS X.

ODBC is kind of like an API for any software to access any DBMS easily, regardless of what DBMS it is and what OS it’s running on. Different software (e.g., R or Python) can utilize ODBC to access different DBMS through the following logic: Software -> ODBC Manager -> ODBC Driver for the DBMS -> DBMS Server (Software: R, Python, etc.; DBMS: MySQL, MS-SQL, etc.).

It doesn’t matter whether you use iODBC or unixODBC. Whichever one you use, just make sure the DBMS Driver and software you are using are configured/compiled to use with the same ODBC manager (usually set through the configure flags). For example, the R package RODBC and Python package pyodbc are compiled by default to use iODBC on Mac OS X. The DBMS drivers used must be compiled for use with iODBC. For iODBC, one could add data source names (DSN’s) at ~/Library/ODBC/odbc.ini. For unixODBC, one could add DSN’s at ~/.odbc.ini.

My current setup utilizes iODBC. I will outline the instructions for setting up MySQL and freeTDS (MS-SQL) drivers for use with RODBC and pyodbc through iODBC.

MySQL and FreeTDS with iODBC on Mac OS X

Install the MySQL Connector/ODBC driver. Driver should be at /usr/local/lib/libmyodbc5.so or /usr/local/lib/libmyodbc5w.so. Note: I’m unable to compile the driver from source on Mac OS X.

FreeTDS is an open source ODBC driver to access MS SQL Server. Install via Home Brew:

## install homebrew
ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"

## install freetds
brew install freetds

Driver should be at /usr/local/lib/libtdsodbc.so (symbolic linked).

Create ~/Library/ODBC/odbc.ini:

Port = 1433
Trace = Yes
Description=my description
# Database=
# can't specify username and password for freetds

# Database=
## can specify an actual database to each DSN

Install pyodbc via sudo pip install pyodbc. Test connections in python:

import pyodbc as p

con1 = p.connect("DSN=sqlserver01;UID=username;PWD=password")
con1.execute("select name from master..sysdatabases").fetchall()

con2 = p.connect("DSN=mysql01;UID=username;PWD=password")
con2.execute("show databases;").fetchall()

Install R using the installer. Install RODBC in the R interpreter via install.packages("RODBC"). Test connections in R:


ch1 <- odbcConnect(dsn="sqlserver01", uid="username", pwd="password")
odbcQuery(ch1, "select name from master..sysdatabases")

ch2 <- odbcConnect(dsn="mysql01", uid="username", pwd="password")
odbcQuery(ch2, "show databases;")

More on unixODBC on Mac OS X

If one wants to use unixODBC on Mac OS X instead, note the following:

  • First install unixODBC via Homebrew with brew install unixodbc.
  • Compile R from source to have it work with unixODBC (R binaries from the installer uses iODBC by default).
  • Can choose --with-odbc-manager=odbc when compiling RODBC.
  • When compiling freeTDS, include the argument with-unixodbc (pass to Homebrew or when compiling manually).
  • I’m unable to compile the MySQL Connector driver on Mac OS X from source (Homebrew or manually). Thus, it won’t work with unixODBC. I believe I tried unixODBC and MySQL Connector from macports, and those work.
  • pyodbc only works with iODBC on Mac OS X (inspect setup file). Currently I can’t get pyodbc to work with unixODBC on Mac OS X.

More differences between unixODBC and iODBC

unixODBC comes with the isql command to access different DBMS from the command line interpreter. iODBC comes with the iodbctest and iodbctestw commands. The command isql works for me on Mac OS X when I set freeTDS up to work with unixODBC (e.g., accessing MS SQL Server). I couldn’t access MySQL server because the MySQL Connector driver was compiled for use with iODBC.

If I use iODBC, I get the following for trying to access a MySQL server:

$ iodbctestw "DSN=sqlserver01;UID=username;PWD=password"
iODBC Unicode Demonstration program
This program shows an interactive SQL processor
Driver Manager: 03.52.0607.1008
1: SQLDriverConnectW = [MySQL][ODBC 5.1 Driver]Prompting is not supported on this platform. Please provide all required connect information. (0) SQLSTATE=HY000
1: ODBC_Connect = [MySQL][ODBC 5.1 Driver]Prompting is not supported on this platform. Please provide all required connect information. (0) SQLSTATE=HY000

When I try to access SQL Server, I get

$ iodbctestw "DSN=sqlserver01;UID=username;PWD=password"
iODBC Unicode Demonstration program
This program shows an interactive SQL processor
Driver Manager: 03.52.0607.1008
1: SQLDriverConnectW = [FreeTDS][SQL Server]Login failed for user 'username'. (18456) SQLSTATE=42000
2: SQLDriverConnectW = [FreeTDS][SQL Server]Unable to connect to data source (0) SQLSTATE=08001
1: ODBC_Connect = [FreeTDS][SQL Server]Login failed for user 'username'. (18456) SQLSTATE=42000
2: ODBC_Connect = [FreeTDS][SQL Server]Unable to connect to data source (0) SQLSTATE=08001

Don’t know why that is so. I guess it’s not too important to use an interactive interpreter. What matter is that the driver works with R and Python. Perhaps I should consider sqsh or do more searching.

My own programming style convention for most languages

I write code mainly in R, and from times to times, in C, C++, SAS, bash, python, and perl. There are style guides out there that help make your code more consistent and readable to yourself and others. Here is a style guide for C++, and here is Google’s style guide for R and here is Hadley Wickam’s guide for R. For R, I agree more with Google’s style guide than Hadley Wickam’s because I absolutely hate typing the underscore (personal preference) and because Google’s style guide seems more related to that of the C++’s guide. Style guides differ by languages because the languages are different (restrictions on names, etc.).

My brain goes crazy if I have to remember and apply multiple styles, so I want to use a convention that I can use consistently for all languages. This boils down to refraining from using special characters such as “-“, “.”, and “_” in names as these characters can have special meaning in different languages. Here it goes:

  • constants: the letter k or n followed by a description. For example, kConstant or nSim.
  • variable names: description words where the first word is lower case and every subsequent words are lower case but with an upper case first letter. For example, varBeta1 for variance of beta1.
  • function names: a verb followed by description words, with each word capitalized except. For example, savePlot and computeGRhoGamma. This is the same as the variable names. I originally was going to follow the Google R style guide instead of the C++ style guide, but opted for the latter because the reasoning made sense: the distinction is obvious based on syntax when a function is called (has parentheses and possibly arguments).
  • function arguments: depending on the type of argument, use the variable names or function names style for each. To help distinguish an argument that takes in a function, the argument should begin with f if it takes in a function. For example, drop=TRUE and fSummarize=summarizeDefault.

When breaking these conventions lead to a better understanding of the code (easier on the brain), I will not hesitate to break them. For example, using i, j, k as iterator variables, using na.rm as functional argument in R, or rKM for a function that draw random numbers from a Kaplan-Meier survival curve.

Now, if only I can just magically transform all of my own code into this convention. I’m going to really hate going back to old code that doesn’t follow my own style, especially when they refer to code in packages that I will update according to my new convention.

Backup my google calendars via shell and python scripts and scheduling them via cron

So recently my calendar got messed up a little (dupes) due my mobile phone (HTC Touch Pro running Windows Mobile 6.1) resetting [syncing again messed it up]. I realize (now???) I need to keep backups of my calendars and contacts.

I took the following script online (which requires python and the gdata module):

<pre class="src src-sh"><span style="color: #ff4500;">#</span><span style="color: #ff4500;">! /usr/bin/</span><span style="color: #00ffff;">python</span>

## This script backs up my google contacts ## http://www.joet3ch.com/2008/09/25/backup-google-contacts/

import gdata.contacts.service gd_client = gdata.contacts.service.ContactsService() gd_client.ClientLogin(‘uname@domain.com’, ‘password’) query = gdata.contacts.service.ContactsQuery() query.max_results = 2000 # change for max contacts returned feed = gd_client.GetContactsFeed(query.ToUri()) print feed

and adapted the following script to backup my calendars and contacts:

 <pre class="src src-sh"><span style="color: #ff4500;">#</span><span style="color: #ff4500;">! /bin/</span><span style="color: #00ffff;">bash</span>

## This script backs up my calendars — will use with Cron to backup daily. ## following is adapted from http://permanentinkdesign.com/articles/backing-up-a-google-calendar/ INCREMENT=date +%Y%m%d%H%M DIR=“$HOME/Documents/Backup_Google” USERNAME=“uname@domain.com” PASSWORD=“mypassword” ## Vinh’s calendar curl -s url-to-private-ics -o “$DIR/Vinh_$INCREMENT.ics” ## TNTTSP’s calendar’ curl -s url-to-private-ics -o “$DIR/TNTTSP_$INCREMENT.ics” ## Birthdays curl -s url-to-private-ics -o “$DIR/Birthdays_$INCREMENT.ics”

## Now backup Google Contacts ## http://www.joet3ch.com/2008/09/25/backup-google-contacts/ $HOME/Documents/bin/Backup_Google_Contacts.py > ${DIR}/VinhContacts_$INCREMENT.xml

## Now backup Google Reader subscriptions ## http://www.clausconrad.com/blog/backup-google-reader-subscription-list-to-opml-automatically $HOME/Documents/bin/gr-opml-backup.py $USERNAME $PASSWORD > Vinh_GoogleReaderSubs_$INCREMENT.opml

find $DIR -mtime +14 -exec rm -f {} \;

Every time the script runs, it backs up my 3 calendars and contacts. The very last line says delete if files are 14 days or older.

Now, to set up my crontab to run this everyday at 10pm, I type “crontab -e” in the shell. I put

<pre class="src src-sh">00 22 * * * $<span style="color: #eedd82;">HOME</span>/Documents/bin/Backup_Google.sh

when vi (or your default visual editor) opens up.

More info about cron on the Mac is here.

Sometimes, you might not want to schedule a recurring script. To run a script once, we can rely on the at (“at”) command. More information here and here. However, the at command didn’t work on my Macbook when I tried it a while back.

UPDATE: The script above (Backup_Google.sh) is updated to include the google reader subscription, thanks to this site.