Best practices for importing a delimited file in SAS using the DATA step

The easiest way to import a delimited file (e.g., CSV) in SAS is to use PROC IMPORT:

proc import datafile="/path/to/my_file.txt"
    out=work.my_data
    dbms=dlm
    replace
    ;
    delimiter="|" ;
    guessingrows=32000 ;
run ;

PROC IMPORT isn’t a viable option when the fileref used in the datafile argument is not of the DISK type. For example, the fileref my_pipe would not work in the following example,

filename my_pipe pipe "gunzip -c my_file.txt.gz" ;

because SAS needs “random access” to the fileref (i.e., to determine the variable type). PROC IMPORT also isn’t suitable when you have a very large data set where one of the columns might contain an element that has a very long length (and this might occur after the number of rows specified by guessingrows). Based on my experience, one should use the truncover, missover (don’t go to next line if line ends early), dsd (allow empty field) and lrecl (make this big for long lines; defaults to 256, which means your lines will be truncated if they are longer than 256 characters long) options in the infile statement to avoid unnecessary errors.

Since the infile is delimited, it is easy to import the fields using the list input method. However, one should use the length statement to declare the maximum length for each character variable, and use the informat statement for numeric variables that have special formats (date, dollar amount, etc.). I usually forget and just declare the informats following the variables in the input statement, which only works when we are inputting using the input pointer method (e.g., @27 my_var date9.). Here is an example:

filename my_pipe pipe "gunzip -c my_file.txt.gz" ;
data my_data ;
    infile my_file dlm="|" dsd truncover missover lrecl=50000 ;
    length
        x2 $50
        x3 $25
        ;
    informat
        x4 date9.
        ;
    format
        x4 date9.
        ;
    input
        x1
        x2 $
        x3 $
        x4
        ;
run ;

Make Windows like Linux

I’m back to a job that only allows Windows on our laptops and desktops. Here’s how I configured my workstation to be more Linux-like in order to increase my productivity:

  • Install Google Chrome
  • Map my caps lock key to control
  • Download Cygwin and install the following: xinit (X server), python, gcc-*, openssh, screen, rsync, python-setuptools (easy_install)), git, subversion, xwinclipboard, procps (top command + others), aspell (I use flyspell in emacs), aspell-en, make, zip, unzip, patch, wget, perl, perl-dbi, libcrypt-devel (for perl DBD:ODBC), automake, autoconf, email (modify /etc/email/email.conf and enter correct server and credential information; sendmail is not needed for sending outbound emails, only needed to send internal emails)
  • Install the emacs binaries to C:\Documents and Settings\my_username\bin\emacs-ver_num (Windows XP) or C:\Users\my_username\bin\emacs-ver-num (Windows 7) and copy relevant image library dll files into the emacs binary directory in order for doc-view to work properly (eg, need libpng14-14.dll for emacs 24.3
    • See Vincent Goulet’s emacs distro to see what dll files are needed)
    • One could also use emacs w32 provided by cygwin, but ESS doesn’t seem to work because that version of emacs does not have the function w32-short-file-name compiled with it as needed by ESS; tentative solution can be found here, but it’s probably better to use the compiled emacs binaries available on GNU
  • Set the HOME environment variable to C:\Documents and Settings\my_username or C:\Users\my_username; set LC_ALL and LANG to en_US.utf-8 (for perl dbi error); set CYGWIN=nodosfilewarning
  • Add the following to the PATH environment variable: C:\Documents and Settings\my_username\My Documents\bin;C:\Documents and Settings\my_username\My Documents\bin\emacs-24.3\bin;C:\cygwin\bin
  • Add the following to ~/.bash_profile: . ~/.bashrc
  • Fix the carriage return issue
  • Run touch ~/.startxwinrc to prevent xterm from launching whenever X server is started.
  • Install R using the Windows installer (works with Emacs ESS); install R studio
  • Get sshfs to work on Windows to mount my servers
  • Install Dropbox and symlink my .bashrc and .screenrc files
  • Get tramp in emacs to work properly in order to visit remote servers easily in emacs by first getting the latest copy of tramp and then following these instructions.
  • Python packages: numpy, pandas, csvkit (easy_install works but pip install does not), jedi, epc, pyodbc, mysqldb, pymssql, psycopg.
  • R packages: RODBC, RMySQL, RPostgreSQL, ggplot2, plyr, glmnet
  • Perl packages (for edbi in emacs): cpan RPC::EPC::Service, cpan YAML, cpan -i DBD::ODBC (gcc4 error; edit Makefile and change “CC=gcc4″ to “gcc”; “LD=g++”)
  • Set up ssh server via Cygwin and open up port 22 in Windows Firewall; freeSSHd is also an alternative
  • Use Autopatcher to install download all necessary updates and install them all at once
  • Have the following shortcut in the startup folder in order to have a a terminal open up at startup with screen initiated: C:\cygwin\bin\mintty.exe -e screen
  • Install UniKey for typing in Vietnamese (place in ~/Documents/bin), 7-Zip for handling archive files, and Virtual CloneDrive for handling disk image files
  • Install CutePDF Writer (also download and install Ghostscript from CutePDF) for printing to PDF files
  • Bash shell in emacs via shell: add (setq shell-file-name "bash")

and (setq explicit-shell-file-name shell-file-name) to the emacs init file, and add the following to ~/.bashrc:

# http://stackoverflow.com/questions/9670209/cygwin-bash-does-not-display-correctly-in-emacs-shell
if [ "$EMACS" == "t" -a "$OSTYPE" == "cygwin" ]; then
    PS1='\u@\h \w\n$ '
fi

If Dropbox cannot be installed then symlink my ~/.emacs.d directory (need to use mklink in order for symlink to work properly).

If the computer is a dual-boot with Linux installed first, then one can change the order of the bootloader to Windows by following these instructions.

This is a good post to review.

Find text or string in files of a certain type

One can use grep "mystring" myfile.ext to find the lines in myfile.ext containing mystring. One could also use grep "mystring" *.ext to find mystring in all files with extension ext. Similarly, one could use grep "mystring" /directory to search for mystring in all files in the directory. What if one wants to search for mystring in all *.ext files in a certain path /directory? Most posts online would suggest something along the line of

find /directory -type -f -name "*.ext" | xargs grep "mystring"

However, the comments of this post shows how one could do it with grep:

grep -r --include=*.ext "mystring" /directory

Guide to accessing MS SQL Server and MySQL server on Mac OS X

Native GUI client access to MS-SQL and MySQL

We can use Oracle SQL Developer with the jTDS driver to access Microsoft SQL Server. Note: jTDS version 1.3.0 did not work for me; I had to use version 1.2.6. Detailed instructions can be found here.

We can use MySQL Workbench to access MySQL server. Setup is intuitively obvious.

Overview of ODBC on Mac OS X

Mac OS X has iODBC installed as it’s default ODBC manager. Most other Linux/UNIX system uses unixODBC to manage the ODBC drivers. This is the main reason why there’s so much confusion on getting ODBC to work on Mac OS X.

ODBC is kind of like an API for any software to access any DBMS easily, regardless of what DBMS it is and what OS it’s running on. Different software (e.g., R or Python) can utilize ODBC to access different DBMS through the following logic: Software -> ODBC Manager -> ODBC Driver for the DBMS -> DBMS Server (Software: R, Python, etc.; DBMS: MySQL, MS-SQL, etc.).

It doesn’t matter whether you use iODBC or unixODBC. Whichever one you use, just make sure the DBMS Driver and software you are using are configured/compiled to use with the same ODBC manager (usually set through the configure flags). For example, the R package RODBC and Python package pyodbc are compiled by default to use iODBC on Mac OS X. The DBMS drivers used must be compiled for use with iODBC. For iODBC, one could add data source names (DSN’s) at ~/Library/ODBC/odbc.ini. For unixODBC, one could add DSN’s at ~/.odbc.ini.

My current setup utilizes iODBC. I will outline the instructions for setting up MySQL and freeTDS (MS-SQL) drivers for use with RODBC and pyodbc through iODBC.

MySQL and FreeTDS with iODBC on Mac OS X

Install the MySQL Connector/ODBC driver. Driver should be at /usr/local/lib/libmyodbc5.so or /usr/local/lib/libmyodbc5w.so. Note: I’m unable to compile the driver from source on Mac OS X.

FreeTDS is an open source ODBC driver to access MS SQL Server. Install via Home Brew:

## install homebrew
ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"

## install freetds
brew install freetds

Driver should be at /usr/local/lib/libtdsodbc.so (symbolic linked).

Create ~/Library/ODBC/odbc.ini:

[sqlserver01]
Driver=/usr/local/lib/libtdsodbc.so
TDS_Version=4.2
Server=ip.address
Port = 1433
Trace = Yes
Description=my description
# Database=
# can't specify username and password for freetds

[mysql01]
Driver=/usr/local/lib/libmyodbc5.so
Server=hostname
Port=3306
charset=UTF8 
User=username
Password=password
# Database=
## can specify an actual database to each DSN

Install pyodbc via sudo pip install pyodbc. Test connections in python:

import pyodbc as p

con1 = p.connect("DSN=sqlserver01;UID=username;PWD=password")
con1.execute("select name from master..sysdatabases").fetchall()

con2 = p.connect("DSN=mysql01;UID=username;PWD=password")
con2.execute("show databases;").fetchall()

Install R using the installer. Install RODBC in the R interpreter via install.packages("RODBC"). Test connections in R:

library(RODBC)

ch1 <- odbcConnect(dsn="sqlserver01", uid="username", pwd="password")
odbcQuery(ch1, "select name from master..sysdatabases")
odbcFetchRows(ch1)

ch2 <- odbcConnect(dsn="mysql01", uid="username", pwd="password")
odbcQuery(ch2, "show databases;")
odbcFetchRows(ch2)

More on unixODBC on Mac OS X

If one wants to use unixODBC on Mac OS X instead, note the following:

  • First install unixODBC via Homebrew with brew install unixodbc.
  • Compile R from source to have it work with unixODBC (R binaries from the installer uses iODBC by default).
  • Can choose --with-odbc-manager=odbc when compiling RODBC.
  • When compiling freeTDS, include the argument with-unixodbc (pass to Homebrew or when compiling manually).
  • I’m unable to compile the MySQL Connector driver on Mac OS X from source (Homebrew or manually). Thus, it won’t work with unixODBC. I believe I tried unixODBC and MySQL Connector from macports, and those work.
  • pyodbc only works with iODBC on Mac OS X (inspect setup file). Currently I can’t get pyodbc to work with unixODBC on Mac OS X.

More differences between unixODBC and iODBC

unixODBC comes with the isql command to access different DBMS from the command line interpreter. iODBC comes with the iodbctest and iodbctestw commands. The command isql works for me on Mac OS X when I set freeTDS up to work with unixODBC (e.g., accessing MS SQL Server). I couldn’t access MySQL server because the MySQL Connector driver was compiled for use with iODBC.

If I use iODBC, I get the following for trying to access a MySQL server:

$ iodbctestw "DSN=sqlserver01;UID=username;PWD=password"
iODBC Unicode Demonstration program
This program shows an interactive SQL processor
Driver Manager: 03.52.0607.1008
1: SQLDriverConnectW = [MySQL][ODBC 5.1 Driver]Prompting is not supported on this platform. Please provide all required connect information. (0) SQLSTATE=HY000
1: ODBC_Connect = [MySQL][ODBC 5.1 Driver]Prompting is not supported on this platform. Please provide all required connect information. (0) SQLSTATE=HY000

When I try to access SQL Server, I get

$ iodbctestw "DSN=sqlserver01;UID=username;PWD=password"
iODBC Unicode Demonstration program
This program shows an interactive SQL processor
Driver Manager: 03.52.0607.1008
1: SQLDriverConnectW = [FreeTDS][SQL Server]Login failed for user 'username'. (18456) SQLSTATE=42000
2: SQLDriverConnectW = [FreeTDS][SQL Server]Unable to connect to data source (0) SQLSTATE=08001
1: ODBC_Connect = [FreeTDS][SQL Server]Login failed for user 'username'. (18456) SQLSTATE=42000
2: ODBC_Connect = [FreeTDS][SQL Server]Unable to connect to data source (0) SQLSTATE=08001

Don’t know why that is so. I guess it’s not too important to use an interactive interpreter. What matter is that the driver works with R and Python. Perhaps I should consider sqsh or do more searching.

Avoid data truncation in SAS when importing CSV files

SAS’s Proc Import is great for importing a CSV or other delimited files:things just “work” most of the time. We don’t need to specify variable names, variable type, etc. However, data truncation or mis-matched variable type can happen as the procedure determines the data type and length of the variables based on the first few rows of the delimited file.

As this post suggests, one could use the guessingrows=32767; statement in Proc Import so SAS uses the first 32k rows to determine data type and length.

Alternatively, the safer solution would be to import the delimited file by using the Data step and explicitly use the length statement with a long length option to ensure that no truncation occurs (e.g., length my_var $100). One would also need to specify the data type with the input statement here as well. Note: Do not specify the variable length using the input statement here because SAS might read in characters from other fields as it starts reading from the last delimiter all the way to the character length.

SAS Proc SQL Group By returns multiple rows per group

Just wanted to note that for traditional SQL implementations (e.g., MySQL, MS-SQL), the Group By statement used to aggregate a variable by certain variable(s) returns 1 row for each group. When a column that is not unique within a group is also selected, then the row that’s returned is determined somehow by the DB software.

In contrast, SAS’s Proc SQL will return multiple rows for each group (the number of original rows), with the aggregated variable repeated for each row within a group. Here’s an example:

data foo ;
    infile datalines dlm=" " ;
    input name $ week $ sales ;
    datalines ;
bob 1 20000
bob 2 30000
jane 1 40000
jane 2 50000
mike 1 60000
mike 2 70000
kevin 1 80000
kevin 2 90000
    ;
run ;

proc sql ;
    create table foo_agg as
        select
        a.name
        , a.week
        , sum(a.sales) as total_sales
        from foo as a
        group by name
        ;
    quit ;
run ;

proc export data=foo_agg
    outfile="foo_agg.csv"
    DBMS=csv
    REPLACE
    ;
run ;

The content of foo_agg.csv looks like

name,week,total_sales
bob,2,50000
bob,1,50000
jane,1,90000
jane,2,90000
kevin,1,170000
kevin,2,170000
mike,2,130000
mike,1,130000

An analogous return from the SQL code in MySQL or MS-SQL might look something like

name,week,total_sales
bob,2,50000
jane,1,90000
kevin,1,170000
mike,2,130000

In SAS’s Proc SQL, one would need to use the Select Distinct statement in order to remove the duplicate rows.

Note that when combining the Group By statement with a Join, these multiple records per group still hold.

SAS’s implementation is not necessarily bad as it gives the user’s more flexibility in returning an aggregated variable with every row without re-joining the aggregated table with the original table. The user just has to remember this behavior ;).

Execute shell commands with an asterisk in SAS

I wanted to use %sysexec to execute a shell command with an asterisk (shell wildcard for globbing) in a SAS program:

%sysexec cp /tmp/foo/*.txt /tmp/bar ;

However, it wasn’t giving me the desired results, probably due to the /* characters as they begin a commented section in a SAS program. Also tried escaping the asterisk with \* and surrounding the shell command with quotes but I didn’t get any luck. Emailed the SAS-L community for help and discovered the x and call system statements in SAS. The following works:

x "cp /tmp/foo/*.txt /tmp/bar" ;
/* or */
data _null_ ;
  call system("cp /tmp/foo/*.txt /tmp/bar") ;
run ;

More information on executing shell commands in a SAS program can be found here.