Change delimiter in a csv file and remove line breaks in fields

I wrote a script to convert delimiters in CSV files, eg, commas to pipes. I prefer pipe-delimited files because the the pipe-delimiter (|) will not clash data in the different fields 99.999% of the time. I also added the option to convert newline () and carriage return () characters in the data fields to spaces. This comes in handy when I use PROC IMPORT in SAS as line breaks cause it to choke.

Here’s my csvconvert.py script:

#! /usr/bin/env python

#### Command line arguments
import argparse
parser = argparse.ArgumentParser(description="Convert delimited file from one delimiter to another; defaults to converting CSV to pipe-delimited.")
parser.add_argument("--dlm-input", action="store", dest="dlm_in", default=",", required=False, help="delimiter of the input file; defaults to comma (,)", nargs='?', metavar="','")
parser.add_argument("--dlm-output", action="store", dest="dlm_out", default="|", required=False, help="delimiter of the output file; defaults to pipe (|)", nargs='?', metavar="'|'")
parser.add_argument("--remove-line-char", action="store_true", dest="remove_line_char", default=False, help="remove \\n and \\r characters in fields and replace with spaces")
parser.add_argument("--quote-char", action="store", dest="quote_char", default='"', required=False, help="quote character; defaults to double quote (\")", nargs='?', metavar="\"")
parser.add_argument("-i", "--input", action="store", dest="input", required=False, help="input file; if not specified, take from standard input.", nargs='?', metavar="file.csv")
parser.add_argument("-o", "--output", action="store", dest="output", required=False, help="output file; if not specified, write to standard output", nargs='?', metavar="file.pipe")
parser.add_argument("-v", "--verbose", action="store_true", dest="verbose", default=False, help="increase verbosity")
args  =  parser.parse_args()
# print args

# http://snipplr.com/view/45759/convert-csv-file-to-pipe-delineated-file/
import argparse
import csv
import sys
from signal import signal, SIGPIPE, SIG_DFL # http://stackoverflow.com/questions/14207708/ioerror-errno-32-broken-pipe-python
signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less

if args.input:
    csv_reader = csv.reader(open(args.input, 'rb'), delimiter=args.dlm_in, quotechar=args.quote_char)
else:
    csv_reader = csv.reader(sys.stdin, delimiter=args.dlm_in, quotechar=args.quote_char)

if args.output:
    h_outfile = open(args.output, 'wb')
else:
    h_outfile = sys.stdout

for row in csv_reader:
    row = args.dlm_out.join(row)
    if args.remove_line_char:
        row  =  row.replace('\n', ' ').replace('\r', ' ')
    h_outfile.write("%s\n" % (row))
    h_outfile.flush()
    # print row

Help description:

usage: csvconvert.py [-h] [--dlm-input [',']] [--dlm-output ['|']]
                     [--remove-line-char] [--quote-char ["]] [-i [file.csv]]
                     [-o [file.pipe]] [-v]

Convert delimited file from one delimiter to another; defaults to converting
CSV to pipe-delimited.

optional arguments:
  -h, --help            show this help message and exit
  --dlm-input [',']     delimiter of the input file; defaults to comma (,)
  --dlm-output ['|']    delimiter of the output file; defaults to pipe (|)
  --remove-line-char    remove \n and \r characters in fields and replace with
                        spaces
  --quote-char ["]      quote character; defaults to double quote (")
  -i [file.csv], --input [file.csv]
                        input file; if not specified, take from standard
                        input.
  -o [file.pipe], --output [file.pipe]
                        output file; if not specified, write to standard
                        output
  -v, --verbose         increase verbosity

Usage:

cat myfile.csv | csvconvert.py --remove-line-char > myfile.pipe

Issues with https proxy in Python via suds and urllib2

I recently had the need to access a SOAP API to obtain some data. SOAP works by posting an xml file to a site url in a format defined by the API’s schema. The API then returns data, also in a form of an xml file. Based on this post, I figured suds was the easiest way to utilize Python to access the API so I could sequentially (and hence, parallelize) query data repeatedly. suds did turn out to be relatively easy to use:

from suds.client import Client
url = 'http://www.ripedev.com/webservices/localtime.asmx?WSDL'
client = Client(url)
print client
client.service.LocalTimeByZipCode('90210')

This worked on my home network. At work, I had to utilize a proxy in order to access the outside world. Otherwise, I’d get a connection refuse message: urllib2.URLError: <urlopen error [Errno 111] Connection refused>. The modification to use a proxy was straightforward:

from suds.client import Client
proxy = {'http': 'proxy_username:proxy_password@proxy_server.com:port'}
url = 'http://www.ripedev.com/webservices/localtime.asmx?WSDL'
# client = Client(url)
client = Client(url, proxy=proxy)
print client
client.service.LocalTimeByZipCode('90210')

The previous examples were from a public SOAP API I found online. Now, the site I wanted to actually hit uses ssl for encryption (i.e., https site) and requires authentication. I thought the fix would be as simple as:

from suds.client import Client
proxy = {'https': 'proxy_username:proxy_password@proxy_server.com:port'}
url = 'https://some_server.com/path/to/soap_api?wsdl'
un = 'site_username'
pw = 'site_password'
# client = Client(url)
client = Client(url, proxy=proxy, username=un, password=pw)
print client
client.service.someFunction(args)

However, I got the error message: Exception: (404, u'/path/to/soap_api'). Very weird to me. Is it an authentication issue? Is it a proxy issue? If a proxy issue, how so, as my previous toy example worked. Tried the same site on my home network where there is no firewall, and things worked:

from suds.client import Client
url = 'https://some_server.com/path/to/soap_api?wsdl'
un = 'site_username'
pw = 'site_password'
# client = Client(url)
client = Client(url, username=un, password=pw)
print client
client.service.someFunction(args)

Conclusion? Must be a proxy issue with https. I used the following prior to calling suds to help with debugging:

import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger('suds.client').setLevel(logging.DEBUG)
logging.getLogger('suds.transport').setLevel(logging.DEBUG)
logging.getLogger('suds.xsd.schema').setLevel(logging.DEBUG)
logging.getLogger('suds.wsdl').setLevel(logging.DEBUG)

My initial thoughts after some debugging: there must be something wrong with the proxy as the log shows python sending the request to the target url, but I get back a response that shows the path (minus the domain name) not found. What happened to the domain name? I notified the firewall team to look into this, as it appears the proxy is modifying something (url is not complete?). The firewall team investigated, and found that the proxy is returning a message that warns the ClientHello message is too large. This is one clue. The log also shows that the user was never authenticated and that the ssl handshake was never completed. My thought: still a proxy issue, as the python code works at home. However, the proxy team was able to access the https SOAP API through the proxy using the SOA Client plugin for Firefox. Now that convinced me that something else may be the culprit.

Googled for help, and thought this would be helpful.

import urllib2
import urllib
import httplib
import socket

class ProxyHTTPConnection(httplib.HTTPConnection):
    _ports = {'http' : 80, 'https' : 443}
    def request(self, method, url, body=None, headers={}):
        #request is called before connect, so can interpret url and get
        #real host/port to be used to make CONNECT request to proxy
        proto, rest = urllib.splittype(url)
        if proto is None:
            raise ValueError, "unknown URL type: %s" % url
        #get host
        host, rest = urllib.splithost(rest)
        #try to get port
        host, port = urllib.splitport(host)
        #if port is not defined try to get from proto
        if port is None:
            try:
                port = self._ports[proto]
            except KeyError:
                raise ValueError, "unknown protocol for: %s" % url
        self._real_host = host
        self._real_port = port
        httplib.HTTPConnection.request(self, method, url, body, headers)
    def connect(self):
        httplib.HTTPConnection.connect(self)
        #send proxy CONNECT request
        self.send("CONNECT %s:%d HTTP/1.0\r\n\r\n" % (self._real_host, self._real_port))
        #expect a HTTP/1.0 200 Connection established
        response = self.response_class(self.sock, strict=self.strict, method=self._method)
        (version, code, message) = response._read_status()
        #probably here we can handle auth requests...
        if code != 200:
            #proxy returned and error, abort connection, and raise exception
            self.close()
            raise socket.error, "Proxy connection failed: %d %s" % (code, message.strip())
        #eat up header block from proxy....
        while True:
            #should not use directly fp probablu
            line = response.fp.readline()
            if line == '\r\n': break

class ProxyHTTPSConnection(ProxyHTTPConnection):
    default_port = 443
    def __init__(self, host, port = None, key_file = None, cert_file = None, strict = None, timeout=0): # vinh added timeout
        ProxyHTTPConnection.__init__(self, host, port)
        self.key_file = key_file
        self.cert_file = cert_file
    def connect(self):
        ProxyHTTPConnection.connect(self)
        #make the sock ssl-aware
        ssl = socket.ssl(self.sock, self.key_file, self.cert_file)
        self.sock = httplib.FakeSocket(self.sock, ssl)

class ConnectHTTPHandler(urllib2.HTTPHandler):
    def do_open(self, http_class, req):
        return urllib2.HTTPHandler.do_open(self, ProxyHTTPConnection, req)

class ConnectHTTPSHandler(urllib2.HTTPSHandler):
    def do_open(self, http_class, req):
        return urllib2.HTTPSHandler.do_open(self, ProxyHTTPSConnection, req)

from suds.client import Client
# from httpsproxy import ConnectHTTPSHandler, ConnectHTTPHandler ## these are code from above classes
import urllib2, urllib
from suds.transport.http import HttpTransport
opener = urllib2.build_opener(ConnectHTTPHandler, ConnectHTTPSHandler)
urllib2.install_opener(opener)
t = HttpTransport()
t.urlopener = opener
url = 'https://some_server.com/path/to/soap_api?wsdl'
proxy = {'https': 'proxy_username:proxy_password@proxy_server.com:port'}
un = 'site_username'
pw = 'site_password'
client = Client(url=url, transport=t, proxy=proxy, username=un, password=pw)
client = Client(url=url, transport=t, proxy=proxy, username=un, password=pw, location='https://some_server.com/path/to/soap_api?wsdl') ## some site suggests specifying location

This too did not work. Continued to google, and found that lot’s of people are having issues with https and proxy. I knew suds depended on urllib2, so googled about that as well, and people too had issues with urllib2 in terms of https and proxy. I then decided to investigate using urllib2 to contact the https url through a proxy:

## http://stackoverflow.com/questions/5227333/xml-soap-post-error-what-am-i-doing-wrong
## http://stackoverflow.com/questions/34079/how-to-specify-an-authenticated-proxy-for-a-python-http-connect
### at home this works
import urllib2
url = 'https://some_server.com/path/to/soap_api?wsdl'
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None,
                          uri=url,
                          user='site_username',
                          passwd='site_password')
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)
page = urllib2.urlopen(url)
page.read()

### work network, does not work:
url = 'https://some_server.com/path/to/soap_api?wsdl'
proxy = urllib2.ProxyHandler({'https':'proxy_username:proxy_password@proxy_server.com:port', 'http':'proxy_username:proxy_password@proxy_server.com:port'})
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None,
                          uri=url,
                          user='site_username',
                          passwd='site_password')
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(proxy, auth_handler, urllib2.HTTPSHandler)
urllib2.install_opener(opener)
page = urllib2.urlopen(site)
### also tried re-doing above, but with the custom handler as defined in the previous code chunk (http://code.activestate.com/recipes/456195/) running first (run the list of classes)

No luck. I re-read this post that I ran into before, and really agreed that urllib2 is severely flawed, especially when using https proxy. At the end of the page, the author suggested using the requests package. Tried it out, and I was able to connect using the https proxy:

import requests
import xmltodict
p1 = 'http://proxy_username:proxy_password@proxy_server.com:port'
p2 = 'https://proxy_username:proxy_password@proxy_server.com:port'
proxy = {'http': p1, 'https':p2}

site = 'https://some_server.com/path/to/soap_api?wsdl'
r = requests.get(site, proxies=proxy, auth=('site_username', 'site_password'))
r.text ## works
soap_xml_in = """<?xml version="1.0" encoding="UTF-8"?>
...
"""
headers = {'SOAPAction': u'""', 'Content-Type': 'text/xml; charset=utf-8', 'Content-type': 'text/xml; charset=utf-8', 'Soapaction': u'""'}
soap_xml_out = requests.post(site, data=soap_xml_in, headers=headers, proxies=proxy, auth=('site_username', 'site_password')).text

My learnings?

  • suds is great for accessing SOAP, just not when you have to access an https site through a firewall.
  • urllib2 is severely flawed. Things only work in very standard situations.
  • requests package is very powerful and just works. Even though I have to deal with actual xml files as opposed to leveraging suds‘ pythonic structures, the xmltodict package helps to translate the xml file into dictionaries that only adds marginal effort to extract out relevant data.

NOTE: I had to install libuuid-devel in cygwin64 because I was getting an installation error.

Upgrading Ubuntu 12.04 to 14.04 breaks encrypted LVM

My laptop runs Ubuntu and is fully encrypted (since version 10.04). Upgrade from 10.04 to 12.04 was smooth in the sense that my system booted fine, asking for the passphrase to unlock the LVM. However, when I upgraded from 12.04 to 14.04, things broke and my laptop no longer booted properly as the LVM never got encrypted. I had to do the following to get my laptop working again (after many rounds of trial and error):

  • Boot a live usb Ubuntu session, de-crypted the LVM, and chroot’ed to run as the original OS
  • Finish the upgrade session via apt-get update && apt-get upgrade
  • It appears Ubuntu 14.04 installed some new package (did not write name down) that manages LVM or disks somehow (based on googling the error message). I removed this package.
  • Saw lvm issues, so installed the package lvm2
  • I made sure both dm-crypt and lvm2 were installed, and were accessible in initramfs, as cryptsetup was removed from initramfs since version 13.10. Had to do something with the following CRYPTSETUP issue.
  • Based on this post, I modified various files, but things still did not boot properly. I believe what finally fixed it was explicitly pointing to the LVM by /dev/sda5 in the GRUB_CMDLINE_LINUX line in /etc/default/grub.

The following is summary of these files for me. /etc/crypttab:

# <target name> <source device>         <key file>      <options>
# sdb5_crypt UUID=731a44c4-4655-4f2b-ae1a-2e3e6a14f2ef none luks
sdb5_crypt UUID=731a44c4-4655-4f2b-ae1a-2e3e6a14f2ef none luks,retry=1,lvm=vg01

/etc/initramfs-tools/conf.d/cryptroot:

## vinh created http://www.joh.fi/posts/2014/03/18/install-ubuntu-1310-on-top-of-encrypted-lvm/
# CRYPTROOT=target=sdb5_crypt,source=/dev/disk/by-uuid/f1ba5a54-ac7e-419d-8762-43da3274aba4
CRYPTOPTS=target=sdb5_crypt,source=UUID=f1ba5a54-ac7e-419d-8762-43da3274aba4,lvm=vg01

Then run update-initramfs -k all -c in order to update the initramfs images.

Have this line in /etc/default/grub:

#GRUB_CMDLINE_LINUX="cryptopts=target=sdb5_crypt,source=/dev/disk/by-uuid/f1ba5a54-ac7e-419d-8762-43da3274aba4,lvm=vg01"
#GRUB_CMDLINE_LINUX="cryptopts=target=sdb5_crypt,source=UUID=f1ba5a54-ac7e-419d-8762-43da3274aba4,lvm=vg01"
GRUB_CMDLINE_LINUX="cryptopts=target=sdb5_crypt,source=/dev/sda5,lvm=vg01"

Run update-grub.

Again, I think the key is the source definition in the previous line. I kept trying to refer to it by uuid but that did not work.

optparse R package for creating command line scripts with arguments

Just discovered the optparse package in R that allows me to write a command line R script that could incorporate arguments (similar to Python’s argparse). Here’s an example:

#! /usr/bin/env Rscript

# http://cran.r-project.org/web/packages/optparse/vignettes/optparse.pdf
suppressPackageStartupMessages(library("optparse"))

option_list <- list(
    make_option(c('-d', '--date'), action='store', dest='date', type='character', default=Sys.Date(), metavar='"YYYY-MM-DD"', help='As of date to extract data from.  Defaults to today.')
)

opt <- parse_args(OptionParser(option_list=option_list))

# print(opt$date)
cat(sprintf('select * where contract_date > "%s"\n', opt$date)

Save this as my_scrypt.R, and do chmod +x my_script.R. Now check out ./my_script.R --help.

Package management in R and Python at work without root and behind firewall

My current job has strict security measures (referring to root access on a Linux server and the inability to access outside the company’s network), so it can be difficult in getting the tools necessary for my work, namely R packages on CRAN and Python packages via pip.

On my Windows workstation, I was able to install R by downloading the installer online and Python via Cygwin. However, R and Python are unable to connect to the internet to download and install additional packages because of the company’s firewall. To get around this for R, I could:

  • add the flag --internet2 to the execution path in R’s shortcut,
  • call setInternet2(TRUE) in the R console, or
  • set the environment variable http_proxy=http://username:password@proxy_server:port/.n

The first two tells R to use the proxy defined in Internet Explorer. I was able to access CRAN via my web browser, so this works. If CRAN is blocked on the browser, find out what proxy server is available at work and use that to access the outside world. If CRAN is also blocked on the proxy, put in a request to add it to the white list.

As for Python, install pip and use a proxy to download and install packages:

wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py --proxy="username:password@proxy_server:port"
pip install --proxy="username:password@proxy_server:port" argparse numpy pandas ## etc

NOTE: pip 1.3.1 has issues with proxy servers, so use the latest version.

On a Linux/Unix server, the added complexity is that of a lack of root access. Typically, Python is available by default on any modern distro. If not, have the admin team install R and Python via the distribution’s package manager, and if they can’t, then compile the two from source and install them locally. Once installed, use the same method as before for Python pip, but with the --user flag in order to install the packages locally in ~/.local/ (pip command is at ~/.local/bin/pip). For R, set the environment variables

export http_proxy="http://proxy_server:port/"
export http_proxy_user="username:password"

and install the libraries to ~/Rlib (add this to the library path via .libPaths() in ~/.Rprofile).

Parental control on home network

I recently looked into ways to block content on the home network. To protect the entire network, it seems like the filter should be placed on the router. This article on Lifehacker lists a few popular methods. I think using OpenDNS to filter is easy enough to get started. However, it’s quite easy to configure your connected computer to use a different DNS provider. However, one could set a static DNS on their tomato router.

Output data to Excel for reproducible post-hoc analysis or visualization

As much as I like to analyze and visualize data in R, I sometimes have the need to export results/data into Excel for my business partners or myself to consume in Excel or Powerpoint (eg, create custom/edit-able bar charts with various graphical overlays in a powerpoint slide). As I’ve been using the XLConnect package to read xls/xlsx files, I’m also using it to write data to a sheet in an Excel workbook. I write the necessary data out to the Data sheet:

library(XLConnect)
## read data from files/DB and manipulate to get to the final data set
## now, write data
writeWorksheetToFile(file='foo.xlsx', data=iris, sheet='Data', clearSheets=TRUE) ## in case number of rows is less than before

I ask my collaborators to not edit the Data sheet except adding filters or sorting when they are inspecting/eye-balling the data. I ask them to do all analysis in separate sheets. The reason for this is to keep the work reproducible in case the data needs to be refreshed (error in data, repeat the analysis on new data, etc). That is, when I need to refresh the data, I ask for the modified Excel workbook and write out refreshed data to the same sheet (hence the clearSheets=TRUE option in case the number of rows is less than before). That way, calculations or plots referencing columns in the Data sheet would automatically be refreshed in the workbook with the refreshed data.

This is just another way to prevent inefficiency in the work flow and allows for reproducibility even when collaborating with Excel workbooks.

This work flow should in theory also work with SAS:

proc export data=mydata outfile='foo.xlsx'
    dbms=xlsx ;
    sheet=Data ;
run ;

First hackintosh with Windows dual boot using Intel NUC DC3217BY

wpid-2014-02-07-multibeast.png

My friend recently introduced me to the Intel NUC (DC3217BY). It’s basically a micro form factor barebone system that comes with Intel’s ULV i3 processor (powerful and low power consumpton). I decided to get one, slapped on 8 GB of ram, a 256 GB mSATA SSD, and a Broadcom based half-sized e-PCI network card, and Hackintosh’d it since the processor is similar to what’s in an Apple Macbook Air. Basic instructions for this particular hardware could be found here and here. A generic guide could be found here. For dual booting with Windows, this article and this post helped. This is what I recalled doing to set up:

  • Update the BIOS to the latest version
  • Create a bootable Windows 7 usb drive on Ubuntu using unetbootin (must be version 494) (drive must be formatted to NTFS)
  • Get Mac OS X 10.9.1 (Mavericks) from the Apple App store
  • Download Unibeast and Multibeast at tonymacx86
  • Download Chameleon Wizard
  • Download Kext Installer
  • On an existing machine with Mac OS X, /Applications/Utilities/Disk Utility and a (> 8 GB) usb flash drive to Mac OS Extended (Master Boot Record enabled).
  • Run Unibeast to load the Mac OS X installer on it
  • Copy Multibeast, Chameleon Wizard, and Kext Installer into this flash drive. Download DSDT.aml and the patched AppleIntelFramebufferCapri.kext here and place them on the flash drive as well (these are to get HDMI audio to work).
  • Boot up the flash drive, and boot the installer with the flags -x PCIRootUID=1 GraphicsEnabler=Yes per this post relevant to Mavericks
  • Once the Mac installer is booted, go to the Utilities Menu and launch Disk Utility. Format the hard drive into two partitions. The first should be called “Macintosh HD” and formatted to Mac OS Extended (Journaled) and the second should be called “BOOTCAMP” and formatted to MS-DOS (FAT).
  • Shutdown, insert in the Windows 7 usb, and install Windows on the second partition
  • Boot the Mac usb again, and install Mac OS X on the first partition
  • Boot the Mac usb again, and select to boot into the Mac OS X partition
  • Run Multibeast to do some post-configurations so that the hardware just works (options in image below)
  • Edit the org.chameleon.Boot via Chameleon Wizard per the image below
  • Copy DSDT.aml to /Extra/DSDT.aml
  • Install Kext Installer. Use it to install the patched AppleIntelFramebufferCapri.kext, then use it to rebuild permissions and kext cache. Restart the computer to have HDMI audio working.

Multibeast options: 2014-02-07-multibeast.png

Chameleon Wizard options for org.chameleon.Boot: 2014-02-07-chameleon-wizard.png

What would I do differently now? Consider getting a network card with bluetooth like the Dell DW1702 per this. I’m not sure if my monitor has speakers, so this would enable me to use wireless speakers. Update (8/22/2014): I ordered this wifi + bluetooth card (BCM943225 HMB/AzureWave AW-NB290) as it was cheap and that this guide shows it works well on a Mac after installing toledaARPT.kext from the repo. Now I have both wifi and bluetooth. The Windows driver for this card can be found here (direct link here).

Now, time to mount the small NUC to the back of my 27″ monitor.

Bulk resize images and keeping original files

Suppose you have a directory (with subdirectories) full of images, and you want to resize them all while keeping the original images. To do so, first create a copy of the directory tree without the image files. Then, using a for loop, find each image file and apply the convert command to it. The following is an example to resize jpg files to 40% of the original quality.

mkdir /path/to/mirror_dir
find /path/to/image_dir -type d -exec mkdir -p /path/to/mirror_dir/{} \;
cd /path/to/image_dir
for i in $(find -iname "*.jpg"); do echo $i; convert -resize 40% $i /path/to/mirror_dir/$i; done