Change delimiter in a csv file and remove line breaks in fields

I wrote a script to convert delimiters in CSV files, eg, commas to pipes. I prefer pipe-delimited files because the the pipe-delimiter (|) will not clash data in the different fields 99.999% of the time. I also added the option to convert newline () and carriage return () characters in the data fields to spaces. This comes in handy when I use PROC IMPORT in SAS as line breaks cause it to choke.

Here’s my csvconvert.py script:

#! /usr/bin/env python

#### Command line arguments
import argparse
parser = argparse.ArgumentParser(description="Convert delimited file from one delimiter to another; defaults to converting CSV to pipe-delimited.")
parser.add_argument("--dlm-input", action="store", dest="dlm_in", default=",", required=False, help="delimiter of the input file; defaults to comma (,)", nargs='?', metavar="','")
parser.add_argument("--dlm-output", action="store", dest="dlm_out", default="|", required=False, help="delimiter of the output file; defaults to pipe (|)", nargs='?', metavar="'|'")
parser.add_argument("--remove-line-char", action="store_true", dest="remove_line_char", default=False, help="remove \\n and \\r characters in fields and replace with spaces")
parser.add_argument("--quote-char", action="store", dest="quote_char", default='"', required=False, help="quote character; defaults to double quote (\")", nargs='?', metavar="\"")
parser.add_argument("-i", "--input", action="store", dest="input", required=False, help="input file; if not specified, take from standard input.", nargs='?', metavar="file.csv")
parser.add_argument("-o", "--output", action="store", dest="output", required=False, help="output file; if not specified, write to standard output", nargs='?', metavar="file.pipe")
parser.add_argument("-v", "--verbose", action="store_true", dest="verbose", default=False, help="increase verbosity")
args  =  parser.parse_args()
# print args

# http://snipplr.com/view/45759/convert-csv-file-to-pipe-delineated-file/
import argparse
import csv
import sys
from signal import signal, SIGPIPE, SIG_DFL # http://stackoverflow.com/questions/14207708/ioerror-errno-32-broken-pipe-python
signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less

if args.input:
    csv_reader = csv.reader(open(args.input, 'rb'), delimiter=args.dlm_in, quotechar=args.quote_char)
else:
    csv_reader = csv.reader(sys.stdin, delimiter=args.dlm_in, quotechar=args.quote_char)

if args.output:
    h_outfile = open(args.output, 'wb')
else:
    h_outfile = sys.stdout

for row in csv_reader:
    row = args.dlm_out.join(row)
    if args.remove_line_char:
        row  =  row.replace('\n', ' ').replace('\r', ' ')
    h_outfile.write("%s\n" % (row))
    h_outfile.flush()
    # print row

Help description:

usage: csvconvert.py [-h] [--dlm-input [',']] [--dlm-output ['|']]
                     [--remove-line-char] [--quote-char ["]] [-i [file.csv]]
                     [-o [file.pipe]] [-v]

Convert delimited file from one delimiter to another; defaults to converting CSV to pipe-delimited.

optional arguments: -h, --help show this help message and exit --dlm-input [','] delimiter of the input file; defaults to comma (,) --dlm-output ['|'] delimiter of the output file; defaults to pipe (|) --remove-line-char remove \n and \r characters in fields and replace with spaces --quote-char ["] quote character; defaults to double quote (") -i [file.csv], --input [file.csv] input file; if not specified, take from standard input. -o [file.pipe], --output [file.pipe] output file; if not specified, write to standard output -v, --verbose increase verbosity

Usage:

cat myfile.csv | csvconvert.py --remove-line-char > myfile.pipe

Issues with https proxy in Python via suds and urllib2

I recently had the need to access a SOAP API to obtain some data. SOAP works by posting an xml file to a site url in a format defined by the API’s schema. The API then returns data, also in a form of an xml file. Based on this post, I figured suds was the easiest way to utilize Python to access the API so I could sequentially (and hence, parallelize) query data repeatedly. suds did turn out to be relatively easy to use:

from suds.client import Client
url = 'http://www.ripedev.com/webservices/localtime.asmx?WSDL'
client = Client(url)
print client
client.service.LocalTimeByZipCode('90210')

This worked on my home network. At work, I had to utilize a proxy in order to access the outside world. Otherwise, I’d get a connection refuse message: urllib2.URLError: <urlopen error [Errno 111] Connection refused>. The modification to use a proxy was straightforward:

from suds.client import Client
proxy = {'http': 'proxy_username:proxy_password@proxy_server.com:port'}
url = 'http://www.ripedev.com/webservices/localtime.asmx?WSDL'
# client = Client(url)
client = Client(url, proxy=proxy)
print client
client.service.LocalTimeByZipCode('90210')

The previous examples were from a public SOAP API I found online. Now, the site I wanted to actually hit uses ssl for encryption (i.e., https site) and requires authentication. I thought the fix would be as simple as:

from suds.client import Client
proxy = {'https': 'proxy_username:proxy_password@proxy_server.com:port'}
url = 'https://some_server.com/path/to/soap_api?wsdl'
un = 'site_username'
pw = 'site_password'
# client = Client(url)
client = Client(url, proxy=proxy, username=un, password=pw)
print client
client.service.someFunction(args)

However, I got the error message: Exception: (404, u'/path/to/soap_api'). Very weird to me. Is it an authentication issue? Is it a proxy issue? If a proxy issue, how so, as my previous toy example worked. Tried the same site on my home network where there is no firewall, and things worked:

from suds.client import Client
url = 'https://some_server.com/path/to/soap_api?wsdl'
un = 'site_username'
pw = 'site_password'
# client = Client(url)
client = Client(url, username=un, password=pw)
print client
client.service.someFunction(args)

Conclusion? Must be a proxy issue with https. I used the following prior to calling suds to help with debugging:

import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger('suds.client').setLevel(logging.DEBUG)
logging.getLogger('suds.transport').setLevel(logging.DEBUG)
logging.getLogger('suds.xsd.schema').setLevel(logging.DEBUG)
logging.getLogger('suds.wsdl').setLevel(logging.DEBUG)

My initial thoughts after some debugging: there must be something wrong with the proxy as the log shows python sending the request to the target url, but I get back a response that shows the path (minus the domain name) not found. What happened to the domain name? I notified the firewall team to look into this, as it appears the proxy is modifying something (url is not complete?). The firewall team investigated, and found that the proxy is returning a message that warns the ClientHello message is too large. This is one clue. The log also shows that the user was never authenticated and that the ssl handshake was never completed. My thought: still a proxy issue, as the python code works at home. However, the proxy team was able to access the https SOAP API through the proxy using the SOA Client plugin for Firefox. Now that convinced me that something else may be the culprit.

Googled for help, and thought this would be helpful.

import urllib2
import urllib
import httplib
import socket

class ProxyHTTPConnection(httplib.HTTPConnection):
    _ports = {'http' : 80, 'https' : 443}
    def request(self, method, url, body=None, headers={}):
        #request is called before connect, so can interpret url and get
        #real host/port to be used to make CONNECT request to proxy
        proto, rest = urllib.splittype(url)
        if proto is None:
            raise ValueError, "unknown URL type: %s" % url
        #get host
        host, rest = urllib.splithost(rest)
        #try to get port
        host, port = urllib.splitport(host)
        #if port is not defined try to get from proto
        if port is None:
            try:
                port = self._ports[proto]
            except KeyError:
                raise ValueError, "unknown protocol for: %s" % url
        self._real_host = host
        self._real_port = port
        httplib.HTTPConnection.request(self, method, url, body, headers)
    def connect(self):
        httplib.HTTPConnection.connect(self)
        #send proxy CONNECT request
        self.send("CONNECT %s:%d HTTP/1.0\r\n\r\n" % (self._real_host, self._real_port))
        #expect a HTTP/1.0 200 Connection established
        response = self.response_class(self.sock, strict=self.strict, method=self._method)
        (version, code, message) = response._read_status()
        #probably here we can handle auth requests...
        if code != 200:
            #proxy returned and error, abort connection, and raise exception
            self.close()
            raise socket.error, "Proxy connection failed: %d %s" % (code, message.strip())
        #eat up header block from proxy....
        while True:
            #should not use directly fp probablu
            line = response.fp.readline()
            if line == '\r\n': break

class ProxyHTTPSConnection(ProxyHTTPConnection):
    default_port = 443
    def __init__(self, host, port = None, key_file = None, cert_file = None, strict = None, timeout=0): # vinh added timeout
        ProxyHTTPConnection.__init__(self, host, port)
        self.key_file = key_file
        self.cert_file = cert_file
    def connect(self):
        ProxyHTTPConnection.connect(self)
        #make the sock ssl-aware
        ssl = socket.ssl(self.sock, self.key_file, self.cert_file)
        self.sock = httplib.FakeSocket(self.sock, ssl)

class ConnectHTTPHandler(urllib2.HTTPHandler):
    def do_open(self, http_class, req):
        return urllib2.HTTPHandler.do_open(self, ProxyHTTPConnection, req)

class ConnectHTTPSHandler(urllib2.HTTPSHandler):
    def do_open(self, http_class, req):
        return urllib2.HTTPSHandler.do_open(self, ProxyHTTPSConnection, req)

from suds.client import Client
# from httpsproxy import ConnectHTTPSHandler, ConnectHTTPHandler ## these are code from above classes
import urllib2, urllib
from suds.transport.http import HttpTransport
opener = urllib2.build_opener(ConnectHTTPHandler, ConnectHTTPSHandler)
urllib2.install_opener(opener)
t = HttpTransport()
t.urlopener = opener
url = 'https://some_server.com/path/to/soap_api?wsdl'
proxy = {'https': 'proxy_username:proxy_password@proxy_server.com:port'}
un = 'site_username'
pw = 'site_password'
client = Client(url=url, transport=t, proxy=proxy, username=un, password=pw)
client = Client(url=url, transport=t, proxy=proxy, username=un, password=pw, location='https://some_server.com/path/to/soap_api?wsdl') ## some site suggests specifying location

This too did not work. Continued to google, and found that lot’s of people are having issues with https and proxy. I knew suds depended on urllib2, so googled about that as well, and people too had issues with urllib2 in terms of https and proxy. I then decided to investigate using urllib2 to contact the https url through a proxy:

## http://stackoverflow.com/questions/5227333/xml-soap-post-error-what-am-i-doing-wrong
## http://stackoverflow.com/questions/34079/how-to-specify-an-authenticated-proxy-for-a-python-http-connect
### at home this works
import urllib2
url = 'https://some_server.com/path/to/soap_api?wsdl'
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None,
                          uri=url,
                          user='site_username',
                          passwd='site_password')
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)
page = urllib2.urlopen(url)
page.read()

### work network, does not work:
url = 'https://some_server.com/path/to/soap_api?wsdl'
proxy = urllib2.ProxyHandler({'https':'proxy_username:proxy_password@proxy_server.com:port', 'http':'proxy_username:proxy_password@proxy_server.com:port'})
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None,
                          uri=url,
                          user='site_username',
                          passwd='site_password')
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(proxy, auth_handler, urllib2.HTTPSHandler)
urllib2.install_opener(opener)
page = urllib2.urlopen(site)
### also tried re-doing above, but with the custom handler as defined in the previous code chunk (http://code.activestate.com/recipes/456195/) running first (run the list of classes)

No luck. I re-read this post that I ran into before, and really agreed that urllib2 is severely flawed, especially when using https proxy. At the end of the page, the author suggested using the requests package. Tried it out, and I was able to connect using the https proxy:

import requests
import xmltodict
p1 = 'http://proxy_username:proxy_password@proxy_server.com:port'
p2 = 'https://proxy_username:proxy_password@proxy_server.com:port'
proxy = {'http': p1, 'https':p2}

site = 'https://some_server.com/path/to/soap_api?wsdl'
r = requests.get(site, proxies=proxy, auth=('site_username', 'site_password'))
r.text ## works
soap_xml_in = """<?xml version="1.0" encoding="UTF-8"?>
...
"""
headers = {'SOAPAction': u'""', 'Content-Type': 'text/xml; charset=utf-8', 'Content-type': 'text/xml; charset=utf-8', 'Soapaction': u'""'}
soap_xml_out = requests.post(site, data=soap_xml_in, headers=headers, proxies=proxy, auth=('site_username', 'site_password')).text

My learnings?

  • suds is great for accessing SOAP, just not when you have to access an https site through a firewall.
  • urllib2 is severely flawed. Things only work in very standard situations.
  • requests package is very powerful and just works. Even though I have to deal with actual xml files as opposed to leveraging suds‘ pythonic structures, the xmltodict package helps to translate the xml file into dictionaries that only adds marginal effort to extract out relevant data.

NOTE: I had to install libuuid-devel in cygwin64 because I was getting an installation error.

Package management in R and Python at work without root and behind firewall

My current job has strict security measures (referring to root access on a Linux server and the inability to access outside the company’s network), so it can be difficult in getting the tools necessary for my work, namely R packages on CRAN and Python packages via pip.

On my Windows workstation, I was able to install R by downloading the installer online and Python via Cygwin. However, R and Python are unable to connect to the internet to download and install additional packages because of the company’s firewall. To get around this for R, I could:

  • add the flag --internet2 to the execution path in R’s shortcut,
  • call setInternet2(TRUE) in the R console, or
  • set the environment variable http_proxy=http://username:password@proxy_server:port/.n

The first two tells R to use the proxy defined in Internet Explorer. I was able to access CRAN via my web browser, so this works. If CRAN is blocked on the browser, find out what proxy server is available at work and use that to access the outside world. If CRAN is also blocked on the proxy, put in a request to add it to the white list.

As for Python, install pip and use a proxy to download and install packages:

wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py --proxy="username:password@proxy_server:port"
pip install --proxy="username:password@proxy_server:port" argparse numpy pandas ## etc

NOTE: pip 1.3.1 has issues with proxy servers, so use the latest version.

On a Linux/Unix server, the added complexity is that of a lack of root access. Typically, Python is available by default on any modern distro. If not, have the admin team install R and Python via the distribution’s package manager, and if they can’t, then compile the two from source and install them locally. Once installed, use the same method as before for Python pip, but with the --user flag in order to install the packages locally in ~/.local/ (pip command is at ~/.local/bin/pip). For R, set the environment variables

export http_proxy="http://proxy_server:port/"
export http_proxy_user="username:password"

and install the libraries to ~/Rlib (add this to the library path via .libPaths() in ~/.Rprofile).

Delimited file where delimiter clashes with data values

A comma-separated values (CSV) file is a typical way to store tabular/rectangular data. If a data cell contain a comma, then the cell with the commas is typically wrapped with quotes. However, what if a data cell contains a comma and a quotation mark? To avoid such scenarios, it is typically wise to use a delimiter that has a low chance of showing up in your data, such as the pipe (“|”) or caret (“^”) character. However, there are cases when the data is a long string with all sorts of data characters, including the pipe and caret characters. What then should the delimiter be in order to avoid a delimiter collision? As the Wikipedia article suggests, using special ASCII characters such as the unit/field separator (hex: 1F) could help as they probably won’t be in your data (no keyboard key that corresponds to it!).

Currently, my rule of thumb is to use pipe as the default delimiter. If the data contains complicated strings, then I’ll default to the field separator character. In Python, one could refer to the field separator as ‘\ x1f’. In R, one could refer to it as ‘\ x1F’. In SAS, it could be specified as ‘1F’x. In bash, the character could be specified on the command line (e.g., using the cut command, csvlook command, etc) by specifying $’1f’ as the delimiter character.

If the file contains the newline character in a data cell (\n), then the record separator character (hex: 1E) could be used for determining new lines.

Best practices for importing a delimited file in SAS using the DATA step

The easiest way to import a delimited file (e.g., CSV) in SAS is to use PROC IMPORT:

proc import datafile="/path/to/my_file.txt"
    out=work.my_data
    dbms=dlm
    replace
    ;
    delimiter="|" ;
    guessingrows=32000 ;
run ;

PROC IMPORT isn’t a viable option when the fileref used in the datafile argument is not of the DISK type. For example, the fileref my_pipe would not work in the following example,

filename my_pipe pipe "gunzip -c my_file.txt.gz" ;

because SAS needs “random access” to the fileref (i.e., to determine the variable type). PROC IMPORT also isn’t suitable when you have a very large data set where one of the columns might contain an element that has a very long length (and this might occur after the number of rows specified by guessingrows). Based on my experience, one should use the truncover, missover (don’t go to next line if line ends early), dsd (allow empty field) and lrecl (make this big for long lines; defaults to 256, which means your lines will be truncated if they are longer than 256 characters long) options in the infile statement to avoid unnecessary errors.

Since the infile is delimited, it is easy to import the fields using the list input method. However, one should use the length statement to declare the maximum length for each character variable, and use the informat statement for numeric variables that have special formats (date, dollar amount, etc.). I usually forget and just declare the informats following the variables in the input statement, which only works when we are inputting using the input pointer method (e.g., @27 my_var date9.). Here is an example:

filename my_pipe pipe "gunzip -c my_file.txt.gz" ;
data my_data ;
    infile my_file dlm="|" dsd truncover missover lrecl=50000 ;
    length
        x2 $50
        x3 $25
        ;
    informat
        x4 date9.
        ;
    format
        x4 date9.
        ;
    input
        x1
        x2 $
        x3 $
        x4
        ;
run ;

SAS Proc SQL Group By returns multiple rows per group

Just wanted to note that for traditional SQL implementations (e.g., MySQL, MS-SQL), the Group By statement used to aggregate a variable by certain variable(s) returns 1 row for each group. When a column that is not unique within a group is also selected, then the row that’s returned is determined somehow by the DB software.

In contrast, SAS’s Proc SQL will return multiple rows for each group (the number of original rows), with the aggregated variable repeated for each row within a group. Here’s an example:

<pre class="src src-sas"><span style="color: #7fffd4;">data</span> foo ;
<span style="color: #00ffff;">infile</span> datalines dlm=<span style="color: #ffa07a;">" "</span> ;
<span style="color: #00ffff;">input</span> name $ week $ sales ;
datalines ;

bob 1 20000 bob 2 30000 jane 1 40000 jane 2 50000 mike 1 60000 mike 2 70000 kevin 1 80000 kevin 2 90000 ; run ;

proc sql ; create table foo_agg as select a.name , a.week , sum(a.sales) as total_sales from foo as a group by name ; quit ; run ;

proc export data=foo_agg outfile=“foo_agg.csv” DBMS=csv REPLACE ; run ;

The content of foo_agg.csv looks like

<pre class="example">name,week,total_sales

bob,2,50000 bob,1,50000 jane,1,90000 jane,2,90000 kevin,1,170000 kevin,2,170000 mike,2,130000 mike,1,130000

An analogous return from the SQL code in MySQL or MS-SQL might look something like

name,week,total_sales
bob,2,50000
jane,1,90000
kevin,1,170000
mike,2,130000

In SAS’s Proc SQL, one would need to use the Select Distinct statement in order to remove the duplicate rows.

Note that when combining the Group By statement with a Join, these multiple records per group still hold.

SAS’s implementation is not necessarily bad as it gives the user’s more flexibility in returning an aggregated variable with every row without re-joining the aggregated table with the original table. The user just has to remember this behavior ;).

Execute shell commands with an asterisk in SAS

I wanted to use %sysexec to execute a shell command with an asterisk (shell wildcard for globbing) in a SAS program:

<pre class="src src-sas"><span style="color: #7fffd4;">%sysexec</span> cp /tmp/foo<span style="color: #ff4500;">/*</span><span style="color: #ff4500;">.txt /tmp/bar ;</span>

However, it wasn’t giving me the desired results, probably due to the /* characters as they begin a commented section in a SAS program. Also tried escaping the asterisk with \* and surrounding the shell command with quotes but I didn’t get any luck. Emailed the SAS-L community for help and discovered the x and call system statements in SAS. The following works:

<pre class="src src-sas">x <span style="color: #ffa07a;">"cp /tmp/foo/*.txt /tmp/bar"</span> ;

/ or / data null ; call system(“cp /tmp/foo/*.txt /tmp/bar”) ; run ;

More information on executing shell commands in a SAS program can be found here.