Automatically capitalize or uppercase or expand keywords in Emacs using Abbrev Mode

I like that SQL Mode in Emacs comes with an interactive mode that I could execute a query in a buffer to a client buffer similar to how I could execute R code using ESS. However, I don’t think SQL mode is that great at formatting SQL code (eg, indenting). I guess I could live with manual indenting and selecting in multiple lines (preceded by a comma).

I typically write code in lower cases, but I think the SQL convention is to use upper cases for keywords like SELECT, FROM, WHERE, etc. This can be done using Abbrev Mode in Emacs. Add the following to your init file:

;; stop asking whether to save newly added abbrev when quitting emacs
(setq save-abbrevs nil)
;; turn on abbrev mode globally
(setq-default abbrev-mode t)

Now, open a SQL file (/tmp/test.sql). Type SELECT, then C-x a l and type select. This saves the abbreviation for the current major mode (SQL mode). Now, when you type select then <space>, the keyword will be capitalized. Continue doing the same for other keywords. Now, use the write-abbrev-file command to save the abbreviations to ~/.emacs.d/abbrev_defs so it can be saved and usable in future Emacs sessions.

To define many keywords all at once, edit the abbrev_defs directly. For example, I used this list of SQL keywords and relied on Emacs macros to add them to my abbrev_defs file. My abbreviation table for SQL mode is as follows:

(define-abbrev-table 'sql-mode-abbrev-table
(mapcar #'(lambda (v) (list v (upcase v) nil 1))
'("absolute" "action" "add" "after" "all" "allocate" "alter" "and" "any" "are" "array" "as" "asc" "asensitive" "assertion" "asymmetric" "at" "atomic" "authorization" "avg" "before" "begin" "between" "bigint" "binary" "bit" "bitlength" "blob" "boolean" "both" "breadth" "by" "call" "called" "cascade" "cascaded" "case" "cast" "catalog" "char" "char_length" "character" "character_length" "check" "clob" "close" "coalesce" "collate" "collation" "column" "commit" "condition" "connect" "connection" "constraint" "constraints" "constructor" "contains" "continue" "convert" "corresponding" "count" "create" "cross" "cube" "current" "current_date" "current_default_transform_group" "current_path" "current_role" "current_time" "current_timestamp" "current_transform_group_for_type" "current_user" "cursor" "cycle" "data" "date" "day" "deallocate" "dec" "decimal" "declare" "default" "deferrable" "deferred" "delete" "depth" "deref" "desc" "describe" "descriptor" "deterministic" "diagnostics" "disconnect" "distinct" "do" "domain" "double" "drop" "dynamic" "each" "element" "else" "elseif" "end" "equals" "escape" "except" "exception" "exec" "execute" "exists" "exit" "external" "extract" "false" "fetch" "filter" "first" "float" "for" "foreign" "found" "free" "from" "full" "function" "general" "get" "global" "go" "goto" "grant" "group" "grouping" "handler" "having" "hold" "hour" "identity" "if" "immediate" "in" "indicator" "initially" "inner" "inout" "input" "insensitive" "insert" "int" "integer" "intersect" "interval" "into" "is" "isolation" "iterate" "join" "key" "language" "large" "last" "lateral" "leading" "leave" "left" "level" "like" "local" "localtime" "localtimestamp" "locator" "loop" "lower" "map" "match" "map" "member" "merge" "method" "min" "minute" "modifies" "module" "month" "multiset" "names" "national" "natural" "nchar" "nclob" "new" "next" "no" "none" "not" "null" "nullif" "numeric" "object" "octet_length" "of" "old" "on" "only" "open" "option" "or" "order" "ordinality" "out" "outer" "output" "over" "overlaps" "pad" "parameter" "partial" "partition" "path" "position" "precision" "prepare" "preserve" "primary" "prior" "privileges" "procedure" "public" "range" "read" "reads" "real" "recursive" "ref" "references" "referencing" "relative" "release" "repeat" "resignal" "restrict" "result" "return" "returns" "revoke" "right" "role" "rollback" "rollup" "routine" "row" "rows" "savepoint" "schema" "scope" "scroll" "search" "second" "section" "select" "sensitive" "session" "session_user" "set" "sets" "signal" "similar" "size" "smallint" "some" "space" "specific" "specifictype" "sql" "sqlcode" "sqlerror" "sqlexception" "sqlstate" "sqlwarning" "start" "state" "static" "submultiset" "substring" "sum" "symmetric" "system" "system_user" "table" "tablesample" "temporary" "then" "time" "timestamp" "timezone_hour" "timezone_minute" "to" "trailing" "transaction" "translate" "translation" "treat" "trigger" "trim" "true" "under" "undo" "union" "unique" "unknown" "unnest" "until" "update" "upper" "usage" "user" "using" "value" "values" "varchar" "varying" "view" "when" "whenever" "where" "while" "window" "with" "within" "without" "work" "write" "year" "zone")

Best practices for importing a delimited file in SAS using the DATA step

The easiest way to import a delimited file (e.g., CSV) in SAS is to use PROC IMPORT:

proc import datafile="/path/to/my_file.txt"
    delimiter="|" ;
    guessingrows=32000 ;
run ;

PROC IMPORT isn’t a viable option when the fileref used in the datafile argument is not of the DISK type. For example, the fileref my_pipe would not work in the following example,

filename my_pipe pipe "gunzip -c my_file.txt.gz" ;

because SAS needs “random access” to the fileref (i.e., to determine the variable type). PROC IMPORT also isn’t suitable when you have a very large data set where one of the columns might contain an element that has a very long length (and this might occur after the number of rows specified by guessingrows). Based on my experience, one should use the truncover, missover (don’t go to next line if line ends early), dsd (allow empty field) and lrecl (make this big for long lines; defaults to 256, which means your lines will be truncated if they are longer than 256 characters long) options in the infile statement to avoid unnecessary errors.

Since the infile is delimited, it is easy to import the fields using the list input method. However, one should use the length statement to declare the maximum length for each character variable, and use the informat statement for numeric variables that have special formats (date, dollar amount, etc.). I usually forget and just declare the informats following the variables in the input statement, which only works when we are inputting using the input pointer method (e.g., @27 my_var date9.). Here is an example:

filename my_pipe pipe "gunzip -c my_file.txt.gz" ;
data my_data ;
    infile my_file dlm="|" dsd truncover missover lrecl=50000 ;
        x2 $50
        x3 $25
        x4 date9.
        x4 date9.
        x2 $
        x3 $
run ;

Make Windows like Linux

I’m back to a job that only allows Windows on our laptops and desktops. Here’s how I configured my workstation to be more Linux-like in order to increase my productivity:

  • Install Google Chrome and Firefox
  • Install an antivirus or security suite (Norton or Mcafee?); a free one is Avast
  • Map my caps lock key to control; if Admin access is not available, then use AutoHotKey by creating caps_to_control.ahk with Capslock::Control, compiling it as an executable, and creating a shortcut in the Startup folder
  • Download Cygwin and install the following: xinit (X server), python (2 and 3), gcc-*, openssh, screen, rsync, python-setuptools (easy_install)), git, subversion, xwinclipboard, procps (top command + others), aspell (I use flyspell in emacs), aspell-en, make, zip, unzip, patch, wget, perl, perl-dbi, libcrypt-devel (for perl DBD:ODBC), automake, autoconf, ghostscript, email (modify /etc/email/email.conf and enter correct server and credential information; sendmail is not needed for sending outbound emails, only needed to send internal emails)
  • Install the emacs binaries to C:\Documents and Settings\my_username\bin\emacs-ver_num (Windows XP) or C:\Users\my_username\bin\emacs-ver-num (Windows 7+) and copy relevant image library dll files into the emacs bin directory in order for doc-view to work properly (eg, need libpng14-14.dll for emacs 24.3; Ghostscript (gs) is needed so install that via Cygwin)
    • See Vincent Goulet’s emacs distro to see the list of dll files are needed (different versions of emacs uses different versions of the libraries, so use the latest version of emacs and match up the files listed on Vincent’s site)
    • One could also use emacs w32 provided by cygwin, but ESS doesn’t seem to work because that version of emacs does not have the function w32-short-file-name compiled with it as needed by ESS; tentative solution can be found here, but it’s probably better to use the compiled emacs binaries available on GNU
  • Set the HOME environment variable to C:\Documents and Settings\my_username (Windows XP) or C:\Users\my_username (Windows 7+); set LC_ALL and LANG to en_US.utf-8 (for perl dbi error); set CYGWIN=nodosfilewarning
    • Within a Cygwin Terminal, create a soft link from $HOME/.ssh path to /home/username/.ssh (ssh in Cygwin uses the keys from this location)
  • Edit environment variables by running the following in the command prompt: rundll32 sysdm.cpl,EditEnvironmentVariables. Add the following to the PATH environment variable: path_to_R;path_to_JRE;C:\Documents and Settings\my_username\My Documents\bin;C:\Documents and Settings\my_username\My Documents\bin\emacs-24.3\bin;C:\cygwin\bin
    • If Admin rights are not available, then modify the user’s environment variables instead (eg, PATH to be path1;path2;%PATH%)
    • If the settings aren’t saved for future sessions (eg, in Citrix), then create a symbolic link from /home/user_id to the desired home (eg, C:/Users/user_id), and add the following to ~/.bashrc: export PATH=/cygdrive/f/R/R-3.1.1/bin/x64:/cygdrive/f/bin/emacs-24.3/bin:/cygdrive/f/bin:/cygdrive/f/bin/cygwin64/bin:$PATH and export JAVA_HOME=F:/bin/jre7.
  • Add the following to ~/.bash_profile: . ~/.bashrc
  • Add “export DISPLAY=:0.0″ to ~/.bashrc
  • On Windows XP, fix the carriage return issue by adding the following to the beginning of ~/.bashrc and ~/.bash_profile: (set -o igncr) 2>/dev/null && set -o igncr; # this comment is needed
  • Run touch ~/.startxwinrc to prevent xterm from launching whenever X server is started.
    • For recent version of Cygwin, run echo "exec sleep infinity" > ~/.startxwinrc and chmod +x ~/.startxwinrc
  • Copy XWin Server shortcut from the Cygwin folder in the Start Menu to the Startup folder to start the X Windows Server at startup
  • Install CutePDF Writer (also download and install Ghostscript from CutePDF) for printing to PDF files
  • Install R using the Windows installer (works with Emacs ESS)
  • Install Anacondas (Python that is optimized; includes my usual data/analytics stack)
    • Add the 3 Anacondas binary locations to PATH; need this for emacs’ elpy to work
    • emacs elpy dependency: pip install rope jedi flake8 importmagic autopep8 yapf
  • Get tramp in emacs to work properly in order to visit remote servers easily in emacs by first getting the latest copy of tramp, then configure and byte-compile the code (make) per the proper installation.
    • Add plink, pageant, and all putty-related binaries into the PATH (~/bin).
    • After creating an ssh key, use the putty kegen to convert id_rsa to id_rsa.ppk. Create a shortcut at startup that launches pageant /path/to/id_rsa.ppk. Then in emacs, one could access remote files using tramp via plink: /plink:username@host:/path/to/dir.
  • Get sshfs to work on Windows to mount my servers
  • Install Dropbox and symlink my .bashrc and .screenrc files
  • Python packages: numpy, pandas, csvkit (easy_install works but pip install does not), jedi, epc, pyodbc, mysqldb, pymssql, psycopg; ddlgenerator (Python 3).
  • R packages: RODBC, RMySQL, RPostgreSQL, ggplot2, plyr, glmnet
  • Perl packages (for edbi in emacs): cpan RPC::EPC::Service, cpan YAML, cpan -i DBD::ODBC (gcc4 error; edit Makefile and change “CC=gcc4″ to “gcc”; “LD=g++”)
  • Set up ssh server via Cygwin and open up port 22 in Windows Firewall; freeSSHd is also an alternative
  • On Windows 8, the user might not be able to change group permissions (eg, can’t ssh using keys because the key is “too open”); fix by changing files/directories group to ‘User’
  • Use Autopatcher to install download all necessary updates and install them all at once
  • Have the following shortcut in the startup folder in order to have a a terminal open up at startup with screen initiated: C:\cygwin\bin\mintty.exe -e screen -s bash; in the shortcut, specify the home directory as the ‘Start In’ path
  • Install UniKey for typing in Vietnamese (place in ~/Documents/bin), 7-Zip for handling archive files, and Virtual CloneDrive for handling disk image files
  • Install Java Runtime Environment (JRE); if admin privileges aren’t available, then extract the files manually into ~/Documents/bin/jre/
  • Other tools per Lifehacker: VLC, PDF-XChange, Foxit PDF Viewer
  • Configure sshd using openssh, make sure it starts at startup (Start > Run > Services; look for CYGWIN sshd), and allow /usr/sbin/sshd to pass through the Firewall; one could alternatively use freeSSHd
  • Bash shell in emacs via shell: add (setq shell-file-name "bash")

and (setq explicit-shell-file-name shell-file-name) to the emacs init file, and add the following to ~/.bashrc:

if [ "$EMACS" == "t" -a "$OSTYPE" == "cygwin" ]; then
    PS1='\u@\h \w\n$ '

If Dropbox cannot be installed then symlink my ~/.emacs.d directory (need to use mklink in order for symlink to work properly).

If the computer is a dual-boot with Linux installed first, then one can change the order of the bootloader to Windows by following these instructions.

This is a good post to review.

Find text or string in files of a certain type

One can use grep "mystring" myfile.ext to find the lines in myfile.ext containing mystring. One could also use grep "mystring" *.ext to find mystring in all files with extension ext. Similarly, one could use grep "mystring" /directory to search for mystring in all files in the directory. What if one wants to search for mystring in all *.ext files in a certain path /directory? Most posts online would suggest something along the line of

<pre class="src src-sh">find /directory -type -f -name <span style="color: #ffa07a;">"*.ext"</span> | xargs grep <span style="color: #ffa07a;">"mystring"</span>

However, the comments of this post shows how one could do it with grep:

<pre class="src src-sh">grep -r --include=*.ext <span style="color: #ffa07a;">"mystring"</span> /directory

Guide to accessing MS SQL Server and MySQL server on Mac OS X

Native GUI client access to MS-SQL and MySQL

We can use Oracle SQL Developer with the jTDS driver to access Microsoft SQL Server. Note: jTDS version 1.3.0 did not work for me; I had to use version 1.2.6. Detailed instructions can be found here.

We can use MySQL Workbench to access MySQL server. Setup is intuitively obvious.

Overview of ODBC on Mac OS X

Mac OS X has iODBC installed as it’s default ODBC manager. Most other Linux/UNIX system uses unixODBC to manage the ODBC drivers. This is the main reason why there’s so much confusion on getting ODBC to work on Mac OS X.

ODBC is kind of like an API for any software to access any DBMS easily, regardless of what DBMS it is and what OS it’s running on. Different software (e.g., R or Python) can utilize ODBC to access different DBMS through the following logic: Software -> ODBC Manager -> ODBC Driver for the DBMS -> DBMS Server (Software: R, Python, etc.; DBMS: MySQL, MS-SQL, etc.).

It doesn’t matter whether you use iODBC or unixODBC. Whichever one you use, just make sure the DBMS Driver and software you are using are configured/compiled to use with the same ODBC manager (usually set through the configure flags). For example, the R package RODBC and Python package pyodbc are compiled by default to use iODBC on Mac OS X. The DBMS drivers used must be compiled for use with iODBC. For iODBC, one could add data source names (DSN’s) at ~/Library/ODBC/odbc.ini. For unixODBC, one could add DSN’s at ~/.odbc.ini.

My current setup utilizes iODBC. I will outline the instructions for setting up MySQL and freeTDS (MS-SQL) drivers for use with RODBC and pyodbc through iODBC.

MySQL and FreeTDS with iODBC on Mac OS X

Install the MySQL Connector/ODBC driver. Driver should be at /usr/local/lib/ or /usr/local/lib/ Note: I’m unable to compile the driver from source on Mac OS X.

FreeTDS is an open source ODBC driver to access MS SQL Server. Install via Home Brew:

## install homebrew
ruby -e "$(curl -fsSL"

## install freetds
brew install freetds

Driver should be at /usr/local/lib/ (symbolic linked).

Create ~/Library/ODBC/odbc.ini:

Port = 1433
Trace = Yes
Description=my description
# Database=
# can't specify username and password for freetds

# Database=
## can specify an actual database to each DSN

Install pyodbc via sudo pip install pyodbc. Test connections in python:

import pyodbc as p

con1 = p.connect("DSN=sqlserver01;UID=username;PWD=password")
con1.execute("select name from master..sysdatabases").fetchall()

con2 = p.connect("DSN=mysql01;UID=username;PWD=password")
con2.execute("show databases;").fetchall()

Install R using the installer. Install RODBC in the R interpreter via install.packages("RODBC"). Test connections in R:


ch1 <- odbcConnect(dsn="sqlserver01", uid="username", pwd="password")
odbcQuery(ch1, "select name from master..sysdatabases")

ch2 <- odbcConnect(dsn="mysql01", uid="username", pwd="password")
odbcQuery(ch2, "show databases;")

More on unixODBC on Mac OS X

If one wants to use unixODBC on Mac OS X instead, note the following:

  • First install unixODBC via Homebrew with brew install unixodbc.
  • Compile R from source to have it work with unixODBC (R binaries from the installer uses iODBC by default).
  • Can choose --with-odbc-manager=odbc when compiling RODBC.
  • When compiling freeTDS, include the argument with-unixodbc (pass to Homebrew or when compiling manually).
  • I’m unable to compile the MySQL Connector driver on Mac OS X from source (Homebrew or manually). Thus, it won’t work with unixODBC. I believe I tried unixODBC and MySQL Connector from macports, and those work.
  • pyodbc only works with iODBC on Mac OS X (inspect setup file). Currently I can’t get pyodbc to work with unixODBC on Mac OS X.

More differences between unixODBC and iODBC

unixODBC comes with the isql command to access different DBMS from the command line interpreter. iODBC comes with the iodbctest and iodbctestw commands. The command isql works for me on Mac OS X when I set freeTDS up to work with unixODBC (e.g., accessing MS SQL Server). I couldn’t access MySQL server because the MySQL Connector driver was compiled for use with iODBC.

If I use iODBC, I get the following for trying to access a MySQL server:

$ iodbctestw "DSN=sqlserver01;UID=username;PWD=password"
iODBC Unicode Demonstration program
This program shows an interactive SQL processor
Driver Manager: 03.52.0607.1008
1: SQLDriverConnectW = [MySQL][ODBC 5.1 Driver]Prompting is not supported on this platform. Please provide all required connect information. (0) SQLSTATE=HY000
1: ODBC_Connect = [MySQL][ODBC 5.1 Driver]Prompting is not supported on this platform. Please provide all required connect information. (0) SQLSTATE=HY000

When I try to access SQL Server, I get

$ iodbctestw "DSN=sqlserver01;UID=username;PWD=password"
iODBC Unicode Demonstration program
This program shows an interactive SQL processor
Driver Manager: 03.52.0607.1008
1: SQLDriverConnectW = [FreeTDS][SQL Server]Login failed for user 'username'. (18456) SQLSTATE=42000
2: SQLDriverConnectW = [FreeTDS][SQL Server]Unable to connect to data source (0) SQLSTATE=08001
1: ODBC_Connect = [FreeTDS][SQL Server]Login failed for user 'username'. (18456) SQLSTATE=42000
2: ODBC_Connect = [FreeTDS][SQL Server]Unable to connect to data source (0) SQLSTATE=08001

Don’t know why that is so. I guess it’s not too important to use an interactive interpreter. What matter is that the driver works with R and Python. Perhaps I should consider sqsh or do more searching.

Avoid data truncation in SAS when importing CSV files

SAS’s Proc Import is great for importing a CSV or other delimited files:things just “work” most of the time. We don’t need to specify variable names, variable type, etc. However, data truncation or mis-matched variable type can happen as the procedure determines the data type and length of the variables based on the first few rows of the delimited file.

As this post suggests, one could use the guessingrows=32767; statement in Proc Import so SAS uses the first 32k rows to determine data type and length.

Alternatively, the safer solution would be to import the delimited file by using the Data step and explicitly use the length statement with a long length option to ensure that no truncation occurs (e.g., length my_var $100). One would also need to specify the data type with the input statement here as well. Note: Do not specify the variable length using the input statement here because SAS might read in characters from other fields as it starts reading from the last delimiter all the way to the character length.

SAS Proc SQL Group By returns multiple rows per group

Just wanted to note that for traditional SQL implementations (e.g., MySQL, MS-SQL), the Group By statement used to aggregate a variable by certain variable(s) returns 1 row for each group. When a column that is not unique within a group is also selected, then the row that’s returned is determined somehow by the DB software.

In contrast, SAS’s Proc SQL will return multiple rows for each group (the number of original rows), with the aggregated variable repeated for each row within a group. Here’s an example:

<pre class="src src-sas"><span style="color: #7fffd4;">data</span> foo ;
<span style="color: #00ffff;">infile</span> datalines dlm=<span style="color: #ffa07a;">" "</span> ;
<span style="color: #00ffff;">input</span> name $ week $ sales ;
datalines ;

bob 1 20000 bob 2 30000 jane 1 40000 jane 2 50000 mike 1 60000 mike 2 70000 kevin 1 80000 kevin 2 90000 ; run ;

proc sql ; create table foo_agg as select , a.week , sum(a.sales) as total_sales from foo as a group by name ; quit ; run ;

proc export data=foo_agg outfile=“foo_agg.csv” DBMS=csv REPLACE ; run ;

The content of foo_agg.csv looks like

<pre class="example">name,week,total_sales

bob,2,50000 bob,1,50000 jane,1,90000 jane,2,90000 kevin,1,170000 kevin,2,170000 mike,2,130000 mike,1,130000

An analogous return from the SQL code in MySQL or MS-SQL might look something like


In SAS’s Proc SQL, one would need to use the Select Distinct statement in order to remove the duplicate rows.

Note that when combining the Group By statement with a Join, these multiple records per group still hold.

SAS’s implementation is not necessarily bad as it gives the user’s more flexibility in returning an aggregated variable with every row without re-joining the aggregated table with the original table. The user just has to remember this behavior ;).

Execute shell commands with an asterisk in SAS

I wanted to use %sysexec to execute a shell command with an asterisk (shell wildcard for globbing) in a SAS program:

<pre class="src src-sas"><span style="color: #7fffd4;">%sysexec</span> cp /tmp/foo<span style="color: #ff4500;">/*</span><span style="color: #ff4500;">.txt /tmp/bar ;</span>

However, it wasn’t giving me the desired results, probably due to the /* characters as they begin a commented section in a SAS program. Also tried escaping the asterisk with \* and surrounding the shell command with quotes but I didn’t get any luck. Emailed the SAS-L community for help and discovered the x and call system statements in SAS. The following works:

<pre class="src src-sas">x <span style="color: #ffa07a;">"cp /tmp/foo/*.txt /tmp/bar"</span> ;

/ or / data null ; call system(“cp /tmp/foo/*.txt /tmp/bar”) ; run ;

More information on executing shell commands in a SAS program can be found here.