## Delimited file where delimiter clashes with data values

A comma-separated values (CSV) file is a typical way to store tabular/rectangular data. If a data cell contain a comma, then the cell with the commas is typically wrapped with quotes. However, what if a data cell contains a comma and a quotation mark? To avoid such scenarios, it is typically wise to use a delimiter that has a low chance of showing up in your data, such as the pipe (“|”) or caret (“^”) character. However, there are cases when the data is a long string with all sorts of data characters, including the pipe and caret characters. What then should the delimiter be in order to avoid a delimiter collision? As the Wikipedia article suggests, using special ASCII characters such as the unit/field separator (hex: 1F) could help as they probably won’t be in your data (no keyboard key that corresponds to it!).

Currently, my rule of thumb is to use pipe as the default delimiter. If the data contains complicated strings, then I’ll default to the field separator character. In Python, one could refer to the field separator as ‘\ x1f’. In R, one could refer to it as ‘\ x1F’. In SAS, it could be specified as ‘1F’x. In bash, the character could be specified on the command line (e.g., using the cut command, csvlook command, etc) by specifying $’1f’ as the delimiter character. If the file contains the newline character in a data cell (\n), then the record separator character (hex: 1E) could be used for determining new lines. ## Automatically capitalize or uppercase or expand keywords in Emacs using Abbrev Mode I like that SQL Mode in Emacs comes with an interactive mode that I could execute a query in a buffer to a client buffer similar to how I could execute R code using ESS. However, I don’t think SQL mode is that great at formatting SQL code (eg, indenting). I guess I could live with manual indenting and selecting in multiple lines (preceded by a comma). I typically write code in lower cases, but I think the SQL convention is to use upper cases for keywords like SELECT, FROM, WHERE, etc. This can be done using Abbrev Mode in Emacs. Add the following to your init file: ;; stop asking whether to save newly added abbrev when quitting emacs (setq save-abbrevs nil) ;; turn on abbrev mode globally (setq-default abbrev-mode t)  Now, open a SQL file (/tmp/test.sql). Type SELECT, then C-x a l and type select. This saves the abbreviation for the current major mode (SQL mode). Now, when you type select then <space>, the keyword will be capitalized. Continue doing the same for other keywords. Now, use the write-abbrev-file command to save the abbreviations to ~/.emacs.d/abbrev_defs so it can be saved and usable in future Emacs sessions. To define many keywords all at once, edit the abbrev_defs directly. For example, I used this list of SQL keywords and relied on Emacs macros to add them to my abbrev_defs file. My abbreviation table for SQL mode is as follows: (define-abbrev-table 'sql-mode-abbrev-table (mapcar #'(lambda (v) (list v (upcase v) nil 1)) '("absolute" "action" "add" "after" "all" "allocate" "alter" "and" "any" "are" "array" "as" "asc" "asensitive" "assertion" "asymmetric" "at" "atomic" "authorization" "avg" "before" "begin" "between" "bigint" "binary" "bit" "bitlength" "blob" "boolean" "both" "breadth" "by" "call" "called" "cascade" "cascaded" "case" "cast" "catalog" "char" "char_length" "character" "character_length" "check" "clob" "close" "coalesce" "collate" "collation" "column" "commit" "condition" "connect" "connection" "constraint" "constraints" "constructor" "contains" "continue" "convert" "corresponding" "count" "create" "cross" "cube" "current" "current_date" "current_default_transform_group" "current_path" "current_role" "current_time" "current_timestamp" "current_transform_group_for_type" "current_user" "cursor" "cycle" "data" "date" "day" "deallocate" "dec" "decimal" "declare" "default" "deferrable" "deferred" "delete" "depth" "deref" "desc" "describe" "descriptor" "deterministic" "diagnostics" "disconnect" "distinct" "do" "domain" "double" "drop" "dynamic" "each" "element" "else" "elseif" "end" "equals" "escape" "except" "exception" "exec" "execute" "exists" "exit" "external" "extract" "false" "fetch" "filter" "first" "float" "for" "foreign" "found" "free" "from" "full" "function" "general" "get" "global" "go" "goto" "grant" "group" "grouping" "handler" "having" "hold" "hour" "identity" "if" "immediate" "in" "indicator" "initially" "inner" "inout" "input" "insensitive" "insert" "int" "integer" "intersect" "interval" "into" "is" "isolation" "iterate" "join" "key" "language" "large" "last" "lateral" "leading" "leave" "left" "level" "like" "local" "localtime" "localtimestamp" "locator" "loop" "lower" "map" "match" "map" "member" "merge" "method" "min" "minute" "modifies" "module" "month" "multiset" "names" "national" "natural" "nchar" "nclob" "new" "next" "no" "none" "not" "null" "nullif" "numeric" "object" "octet_length" "of" "old" "on" "only" "open" "option" "or" "order" "ordinality" "out" "outer" "output" "over" "overlaps" "pad" "parameter" "partial" "partition" "path" "position" "precision" "prepare" "preserve" "primary" "prior" "privileges" "procedure" "public" "range" "read" "reads" "real" "recursive" "ref" "references" "referencing" "relative" "release" "repeat" "resignal" "restrict" "result" "return" "returns" "revoke" "right" "role" "rollback" "rollup" "routine" "row" "rows" "savepoint" "schema" "scope" "scroll" "search" "second" "section" "select" "sensitive" "session" "session_user" "set" "sets" "signal" "similar" "size" "smallint" "some" "space" "specific" "specifictype" "sql" "sqlcode" "sqlerror" "sqlexception" "sqlstate" "sqlwarning" "start" "state" "static" "submultiset" "substring" "sum" "symmetric" "system" "system_user" "table" "tablesample" "temporary" "then" "time" "timestamp" "timezone_hour" "timezone_minute" "to" "trailing" "transaction" "translate" "translation" "treat" "trigger" "trim" "true" "under" "undo" "union" "unique" "unknown" "unnest" "until" "update" "upper" "usage" "user" "using" "value" "values" "varchar" "varying" "view" "when" "whenever" "where" "while" "window" "with" "within" "without" "work" "write" "year" "zone") ))  ## Raspberry Pi Emulation Game Station This is a cool use of a Raspberry Pi! ## Best practices for importing a delimited file in SAS using the DATA step The easiest way to import a delimited file (e.g., CSV) in SAS is to use PROC IMPORT: proc import datafile="/path/to/my_file.txt" out=work.my_data dbms=dlm replace ; delimiter="|" ; guessingrows=32000 ; run ;  PROC IMPORT isn’t a viable option when the fileref used in the datafile argument is not of the DISK type. For example, the fileref my_pipe would not work in the following example, filename my_pipe pipe "gunzip -c my_file.txt.gz" ;  because SAS needs “random access” to the fileref (i.e., to determine the variable type). PROC IMPORT also isn’t suitable when you have a very large data set where one of the columns might contain an element that has a very long length (and this might occur after the number of rows specified by guessingrows). Based on my experience, one should use the truncover, missover (don’t go to next line if line ends early), dsd (allow empty field) and lrecl (make this big for long lines; defaults to 256, which means your lines will be truncated if they are longer than 256 characters long) options in the infile statement to avoid unnecessary errors. Since the infile is delimited, it is easy to import the fields using the list input method. However, one should use the length statement to declare the maximum length for each character variable, and use the informat statement for numeric variables that have special formats (date, dollar amount, etc.). I usually forget and just declare the informats following the variables in the input statement, which only works when we are inputting using the input pointer method (e.g., @27 my_var date9.). Here is an example: filename my_pipe pipe "gunzip -c my_file.txt.gz" ; data my_data ; infile my_file dlm="|" dsd truncover missover lrecl=50000 ; length x2$50
x3 $25 ; informat x4 date9. ; format x4 date9. ; input x1 x2$
x3 $x4 ; run ;  ## Make Windows like Linux I’m back to a job that only allows Windows on our laptops and desktops. Here’s how I configured my workstation to be more Linux-like in order to increase my productivity: • Install Google Chrome and Firefox • Install an antivirus or security suite (Norton or Mcafee?); a free one is Avast • Map my caps lock key to control; if Admin access is not available, then use AutoHotKey by creating caps_to_control.ahk with Capslock::Control and creating a Startup shortcut • Download Cygwin and install the following: xinit (X server), python (2 and 3), gcc-*, openssh, screen, rsync, python-setuptools (easy_install)), git, subversion, xwinclipboard, procps (top command + others), aspell (I use flyspell in emacs), aspell-en, make, zip, unzip, patch, wget, perl, perl-dbi, libcrypt-devel (for perl DBD:ODBC), automake, autoconf, email (modify /etc/email/email.conf and enter correct server and credential information; sendmail is not needed for sending outbound emails, only needed to send internal emails) • Install the emacs binaries to C:\Documents and Settings\my_username\bin\emacs-ver_num (Windows XP) or C:\Users\my_username\bin\emacs-ver-num (Windows 7) and copy relevant image library dll files into the emacs binary directory in order for doc-view to work properly (eg, need libpng14-14.dll for emacs 24.3 • See Vincent Goulet’s emacs distro to see what dll files are needed) • One could also use emacs w32 provided by cygwin, but ESS doesn’t seem to work because that version of emacs does not have the function w32-short-file-name compiled with it as needed by ESS; tentative solution can be found here, but it’s probably better to use the compiled emacs binaries available on GNU • Set the HOME environment variable to C:\Documents and Settings\my_username or C:\Users\my_username; set LC_ALL and LANG to en_US.utf-8 (for perl dbi error); set CYGWIN=nodosfilewarning • Edit environment variables by running the following in the command prompt: rundll32 sysdm.cpl,EditEnvironmentVariables. Add the following to the PATH environment variable: path_to_R;path_to_JRE;C:\Documents and Settings\my_username\My Documents\bin;C:\Documents and Settings\my_username\My Documents\bin\emacs-24.3\bin;C:\cygwin\bin. If we cannot edit System variables, then edit the user’s variables (eg, PATH to be path1;path2;%PATH%). If the settings aren’t saved for future sessions (eg, in Citrix), then create a symbolic link from /home/user_id to the desired home (eg, C:/Users/user_id), and add the following to ~/.bashrc: export PATH=/cygdrive/f/R/R-3.1.1/bin/x64:/cygdrive/f/bin/emacs-24.3/bin:/cygdrive/f/bin:/cygdrive/f/bin/cygwin64/bin:$PATH and export JAVA_HOME=F:/bin/jre7.
• Add the following to ~/.bash_profile: . ~/.bashrc
• Fix the carriage return issue (only for Windows XP)
• Run touch ~/.startxwinrc to prevent xterm from launching whenever X server is started.
• Install R using the Windows installer (works with Emacs ESS); install R studio
• Get sshfs to work on Windows to mount my servers
• Install Dropbox and symlink my .bashrc and .screenrc files
• Get tramp in emacs to work properly in order to visit remote servers easily in emacs by first getting the latest copy of tramp, then configure and byte-compile the code (make) per the proper installation. Add plink, pageant, and all putty-related binaries into the PATH (~/bin). After creating an ssh key, use the putty kegen to convert id_rsa to id_rsa.ppk. Create a shortcut at startup that launches pageant /path/to/id_rsa.ppk. Then in emacs, one could access remote files using tramp via plink: /plink:username@host:/path/to/dir.
• Copy X Windows shortcut to the Startup folder to automatically start it
• Add export DISPLAY:0.0= to ~/.bashrc
• Python packages: numpy, pandas, csvkit (easy_install works but pip install does not), jedi, epc, pyodbc, mysqldb, pymssql, psycopg; ddlgenerator (Python 3).
• R packages: RODBC, RMySQL, RPostgreSQL, ggplot2, plyr, glmnet
• Perl packages (for edbi in emacs): cpan RPC::EPC::Service, cpan YAML, cpan -i DBD::ODBC (gcc4 error; edit Makefile and change “CC=gcc4″ to “gcc”; “LD=g++”)
• Set up ssh server via Cygwin and open up port 22 in Windows Firewall; freeSSHd is also an alternative
• On Windows 8, the user might not be able to change group permissions (eg, can’t ssh using keys because the key is “too open”); fix by changing files/directories group to ‘User’
• Use Autopatcher to install download all necessary updates and install them all at once
• Have the following shortcut in the startup folder in order to have a a terminal open up at startup with screen initiated: C:\cygwin\bin\mintty.exe -e screen -s bash; in the shortcut, specify the home directory as the ‘Start In’ path
• Install UniKey for typing in Vietnamese (place in ~/Documents/bin), 7-Zip for handling archive files, and Virtual CloneDrive for handling disk image files
• Install CutePDF Writer (also download and install Ghostscript from CutePDF) for printing to PDF files
• Install Java Runtime Environment (JRE); if admin privileges aren’t available, then extract the files manually into ~/Documents/bin/jre/
• Other tools per Lifehacker: VLC, PDF-XChange, Foxit PDF Viewer
• Configure sshd using openssh, make sure it starts at startup (Start > Run > Services; look for CYGWIN sshd), and allow /usr/sbin/sshd to pass through the Firewall; one could alternatively use freeSSHd
• Bash shell in emacs via shell: add (setq shell-file-name "bash")

and (setq explicit-shell-file-name shell-file-name) to the emacs init file, and add the following to ~/.bashrc:

# http://stackoverflow.com/questions/9670209/cygwin-bash-does-not-display-correctly-in-emacs-shell
if [ "$EMACS" == "t" -a "$OSTYPE" == "cygwin" ]; then
PS1='\u@\h \w\n$' fi  If Dropbox cannot be installed then symlink my ~/.emacs.d directory (need to use mklink in order for symlink to work properly). If the computer is a dual-boot with Linux installed first, then one can change the order of the bootloader to Windows by following these instructions. This is a good post to review. ## Find text or string in files of a certain type One can use grep "mystring" myfile.ext to find the lines in myfile.ext containing mystring. One could also use grep "mystring" *.ext to find mystring in all files with extension ext. Similarly, one could use grep "mystring" /directory to search for mystring in all files in the directory. What if one wants to search for mystring in all *.ext files in a certain path /directory? Most posts online would suggest something along the line of <pre class="src src-sh">find /directory -type -f -name <span style="color: #ffa07a;">"*.ext"</span> | xargs grep <span style="color: #ffa07a;">"mystring"</span>  However, the comments of this post shows how one could do it with grep: <pre class="src src-sh">grep -r --include=*.ext <span style="color: #ffa07a;">"mystring"</span> /directory  ## Change default or primary group for a user in Linux or Unix Do the following: ## see which groups you belong to groups ## safely change primary group usermod -g NewPrimaryGroup -G oldgroup1,oldgroup2,oldgroup3 my.username  ## Guide to accessing MS SQL Server and MySQL server on Mac OS X ## Native GUI client access to MS-SQL and MySQL We can use Oracle SQL Developer with the jTDS driver to access Microsoft SQL Server. Note: jTDS version 1.3.0 did not work for me; I had to use version 1.2.6. Detailed instructions can be found here. We can use MySQL Workbench to access MySQL server. Setup is intuitively obvious. ## Overview of ODBC on Mac OS X Mac OS X has iODBC installed as it’s default ODBC manager. Most other Linux/UNIX system uses unixODBC to manage the ODBC drivers. This is the main reason why there’s so much confusion on getting ODBC to work on Mac OS X. ODBC is kind of like an API for any software to access any DBMS easily, regardless of what DBMS it is and what OS it’s running on. Different software (e.g., R or Python) can utilize ODBC to access different DBMS through the following logic: Software -> ODBC Manager -> ODBC Driver for the DBMS -> DBMS Server (Software: R, Python, etc.; DBMS: MySQL, MS-SQL, etc.). It doesn’t matter whether you use iODBC or unixODBC. Whichever one you use, just make sure the DBMS Driver and software you are using are configured/compiled to use with the same ODBC manager (usually set through the configure flags). For example, the R package RODBC and Python package pyodbc are compiled by default to use iODBC on Mac OS X. The DBMS drivers used must be compiled for use with iODBC. For iODBC, one could add data source names (DSN’s) at ~/Library/ODBC/odbc.ini. For unixODBC, one could add DSN’s at ~/.odbc.ini. My current setup utilizes iODBC. I will outline the instructions for setting up MySQL and freeTDS (MS-SQL) drivers for use with RODBC and pyodbc through iODBC. ## MySQL and FreeTDS with iODBC on Mac OS X Install the MySQL Connector/ODBC driver. Driver should be at /usr/local/lib/libmyodbc5.so or /usr/local/lib/libmyodbc5w.so. Note: I’m unable to compile the driver from source on Mac OS X. FreeTDS is an open source ODBC driver to access MS SQL Server. Install via Home Brew: ## install homebrew ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"

## install freetds
brew install freetds


Driver should be at /usr/local/lib/libtdsodbc.so (symbolic linked).

Create ~/Library/ODBC/odbc.ini:

[sqlserver01]
Driver=/usr/local/lib/libtdsodbc.so
TDS_Version=4.2
Port = 1433
Trace = Yes
Description=my description
# Database=

[mysql01]
Driver=/usr/local/lib/libmyodbc5.so
Server=hostname
Port=3306
charset=UTF8
# Database=
## can specify an actual database to each DSN


Install pyodbc via sudo pip install pyodbc. Test connections in python:

import pyodbc as p

con1.execute("select name from master..sysdatabases").fetchall()

con2.execute("show databases;").fetchall()


Install R using the installer. Install RODBC in the R interpreter via install.packages("RODBC"). Test connections in R:

library(RODBC)

odbcQuery(ch1, "select name from master..sysdatabases")
odbcFetchRows(ch1)

odbcQuery(ch2, "show databases;")
odbcFetchRows(ch2)


## More on unixODBC on Mac OS X

If one wants to use unixODBC on Mac OS X instead, note the following:

• First install unixODBC via Homebrew with brew install unixodbc.
• Compile R from source to have it work with unixODBC (R binaries from the installer uses iODBC by default).
• Can choose --with-odbc-manager=odbc when compiling RODBC.
• When compiling freeTDS, include the argument with-unixodbc (pass to Homebrew or when compiling manually).
• I’m unable to compile the MySQL Connector driver on Mac OS X from source (Homebrew or manually). Thus, it won’t work with unixODBC. I believe I tried unixODBC and MySQL Connector from macports, and those work.
• pyodbc only works with iODBC on Mac OS X (inspect setup file). Currently I can’t get pyodbc to work with unixODBC on Mac OS X.

## More differences between unixODBC and iODBC

unixODBC comes with the isql command to access different DBMS from the command line interpreter. iODBC comes with the iodbctest and iodbctestw commands. The command isql works for me on Mac OS X when I set freeTDS up to work with unixODBC (e.g., accessing MS SQL Server). I couldn’t access MySQL server because the MySQL Connector driver was compiled for use with iODBC.

If I use iODBC, I get the following for trying to access a MySQL server:

$iodbctestw "DSN=sqlserver01;UID=username;PWD=password" iODBC Unicode Demonstration program This program shows an interactive SQL processor Driver Manager: 03.52.0607.1008 1: SQLDriverConnectW = [MySQL][ODBC 5.1 Driver]Prompting is not supported on this platform. Please provide all required connect information. (0) SQLSTATE=HY000 1: ODBC_Connect = [MySQL][ODBC 5.1 Driver]Prompting is not supported on this platform. Please provide all required connect information. (0) SQLSTATE=HY000  When I try to access SQL Server, I get $ iodbctestw "DSN=sqlserver01;UID=username;PWD=password"
iODBC Unicode Demonstration program
This program shows an interactive SQL processor
Driver Manager: 03.52.0607.1008
1: SQLDriverConnectW = [FreeTDS][SQL Server]Login failed for user 'username'. (18456) SQLSTATE=42000
2: SQLDriverConnectW = [FreeTDS][SQL Server]Unable to connect to data source (0) SQLSTATE=08001
1: ODBC_Connect = [FreeTDS][SQL Server]Login failed for user 'username'. (18456) SQLSTATE=42000
2: ODBC_Connect = [FreeTDS][SQL Server]Unable to connect to data source (0) SQLSTATE=08001


Don’t know why that is so. I guess it’s not too important to use an interactive interpreter. What matter is that the driver works with R and Python. Perhaps I should consider sqsh or do more searching.

## Avoid data truncation in SAS when importing CSV files

SAS’s Proc Import is great for importing a CSV or other delimited files:things just “work” most of the time. We don’t need to specify variable names, variable type, etc. However, data truncation or mis-matched variable type can happen as the procedure determines the data type and length of the variables based on the first few rows of the delimited file.

As this post suggests, one could use the guessingrows=32767; statement in Proc Import so SAS uses the first 32k rows to determine data type and length.

Alternatively, the safer solution would be to import the delimited file by using the Data step and explicitly use the length statement with a long length option to ensure that no truncation occurs (e.g., length my_var $100). One would also need to specify the data type with the input statement here as well. Note: Do not specify the variable length using the input statement here because SAS might read in characters from other fields as it starts reading from the last delimiter all the way to the character length. ## SAS Proc SQL Group By returns multiple rows per group Just wanted to note that for traditional SQL implementations (e.g., MySQL, MS-SQL), the Group By statement used to aggregate a variable by certain variable(s) returns 1 row for each group. When a column that is not unique within a group is also selected, then the row that’s returned is determined somehow by the DB software. In contrast, SAS’s Proc SQL will return multiple rows for each group (the number of original rows), with the aggregated variable repeated for each row within a group. Here’s an example: <pre class="src src-sas"><span style="color: #7fffd4;">data</span> foo ; <span style="color: #00ffff;">infile</span> datalines dlm=<span style="color: #ffa07a;">" "</span> ; <span style="color: #00ffff;">input</span> name$ week \$ sales ;
datalines ;


bob 1 20000 bob 2 30000 jane 1 40000 jane 2 50000 mike 1 60000 mike 2 70000 kevin 1 80000 kevin 2 90000 ; run ;

proc sql ; create table foo_agg as select a.name , a.week , sum(a.sales) as total_sales from foo as a group by name ; quit ; run ;

proc export data=foo_agg outfile=“foo_agg.csv” DBMS=csv REPLACE ; run ;

The content of foo_agg.csv looks like

<pre class="example">name,week,total_sales


bob,2,50000 bob,1,50000 jane,1,90000 jane,2,90000 kevin,1,170000 kevin,2,170000 mike,2,130000 mike,1,130000

An analogous return from the SQL code in MySQL or MS-SQL might look something like

name,week,total_sales
bob,2,50000
jane,1,90000
kevin,1,170000
mike,2,130000


In SAS’s Proc SQL, one would need to use the Select Distinct statement in order to remove the duplicate rows.

Note that when combining the Group By statement with a Join, these multiple records per group still hold.

SAS’s implementation is not necessarily bad as it gives the user’s more flexibility in returning an aggregated variable with every row without re-joining the aggregated table with the original table. The user just has to remember this behavior ;).