Compile R 3.2.2 on AIX 6.1

Here are my notes compiling 64-bit R 3.2.2 on AIX 6.1. As a pre-requisite, read the AIX notes from R-admin. Like the notes, I had GCC installed from here by our admin, along with many other pre-requisites. These were installed prior to compiling R. Note that you could grab newer versions of each package by going to (needed for R-dev).

## list of packages

## conversting unicode, ascii, etc; aix version is not compatible with

## dependency for unicode support; just need to extract to root / ##
need the gnu version of make

## libm ## jre

#### python

Add /opt/freeware/bin to PATH.

Now, download the R source tarball, extract, and cd.

Next, update src/main/dcf.c from the R-dev as a bug that causes readDCF to segfault when using install.packages; no need to do this for future versions of R. Then apply this patch from here to fix the following error:

gcc -maix64 -pthread -std=gnu99 -I../../../../include -DNDEBUG
-I../../../include -I../../../../src/include -DHAVE_CONFIG_H
-I../../../../src/main -I/usr/local/include  -mminimal-toc    -O2 -g
-mcpu=power6  -c gramRd.c -o gramRd.o

gcc -maix64 -pthread -std=gnu99 -shared -Wl,-brtl -Wl,-G -Wl,-bexpall
-Wl,-bnoentry -lc -L/usr/local/lib -o text.o init.o Rmd5.o
md5.o signals.o install.o getfmts.o http.o gramLatex.o gramRd.o -lm

make[6]: Entering directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

mkdir -p -- ../../../../library/tools/libs

make[6]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

make[5]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

make[4]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

make[4]: Entering directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

installing 'sysdata.rda'

Error: Line starting 'Package: tools ...' is malformed!

Execution halted

../../../share/make/ recipe for target 'sysdata' failed

make[4]: *** [sysdata] Error 1

make[4]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

Makefile:30: recipe for target 'all' failed

make[3]: *** [all] Error 2

make[3]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

Makefile:36: recipe for target 'R' failed

make[2]: *** [R] Error 1

make[2]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library'

Makefile:28: recipe for target 'R' failed

make[1]: *** [R] Error 1

make[1]: Leaving directory '/sas/data04/vinh/R-3.2.2/src'

Makefile:59: recipe for target 'R' failed

make: *** [R] Error 1

Hopefully, this patch will make it to R-dev so that it is no longer needed for future versions of R.

export OBJECT_MODE=64
export CC="gcc -maix64 -pthread"
export CXX="g++ -maix64 -pthread"
export FC="gfortran -maix64 -pthread"
export F77="gfortran -maix64 -pthread"
export CFLAGS="-O2 -g -mcpu=power6"
export FFLAGS="-O2 -g -mcpu=power6"
export FCFLAGS="-O2 -g -mcpu=power6"
./configure --prefix=/path/to/opt ## custom location so I don't need root
make -j 16
make install
## add /path/to/opt/bin to PATH

The last step may complain about NEWS.pdf not found in a directory and a certain directory is not found in the destination. For the former, just do touch NEWS.pdf to where it’s supposed to be; for the latter, create the directory yourself.

Automatically specify line break options with termstr as CRLF or LF in SAS when importing data

It could be annoying when dealing with data from multiple platforms: Windows uses the carriage return (CR) and line feed (LF) to indicate a new line, UNIX uses LF, and Mac uses CR. Most companies have SAS running on a UNIX/Linux server, and it’s hard to tell which characters indicate a new line without going to a terminal to inspect the file.

Here’s a sas macro that creates a filename handle that could be used in PROC IMPORT or a DATA step. It will automatically detect CRLF and if not, default to LF. This assumes SAS is running on a UNIX server with access to the head and awk commands.

%macro handle_crlf(file, handle_name, other_filename_options=) ;
/* if there is a carriage return at the end, then return 1 (stored in macro variable SYSRC) */
%sysexec head -n 1 "&file" | awk '/\r$/ { exit(1) }' ;
%if &SYSRC=1 %then %let termstr=crlf ;
%else %let termstr=lf ;
filename &handle_name "&file" termstr=&termstr &other_filename_options ;
%mend ;
%handle_crlf(file=/path/to/file.txt, handle_name=fh) ;
proc import data=fh dbms=dlm replace outdata=d1 ;
   delimiter='|' ;
run ;

Repair line breaks within a field of a delimited file

Sometimes some people generate delimited files with line break characters (carriage return and/or line feed) inside a field without quoting. I previously wrote about the case when the problematic fields are quoted. I also wrote about using non-ascii characters as field and new record indicators to avoid clashes.

The following script reads in stdin and writes to stdout repaired lines by ensuring every output line has at least the number of delimiters (|) as the first/header line (call this the target number of delimiters) by continually concatenating lines (remove line breaks) until it reaches the point when concatenating the next line would yield more delimiters than the target number of delimiters. The script appears more complicated than it should be in order to address the case when there are more than one line breaks in a field (so don’t just concatenate one line but keep doing so) and the case when a line has more delimiters than the target number of delimiter (this could lead to an infinite loop if we restrict the number of delimiters to equal the target).

#! /usr/bin/env python


import sys
from signal import signal, SIGPIPE, SIG_DFL #
signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less

line = sys.stdin.readline()
n_dlm = line.count(dlm)

line0 = line
line_next = 'a'
while line:
    if line.count(dlm) > n_dlm or line_next=='':
        line = line_next
        # line = sys.stdin.readline()
        if line.count(dlm) > n_dlm: ## line with more delimiters than target?
            line0 = line_next
            line_next = sys.stdin.readline()
            line = line.replace('\r', ' ').replace('\n', ' ') + line_next
        line0 = line
        line_next = sys.stdin.readline()
        line = line.replace('\r', ' ').replace('\n', ' ') + line_next

Calculate the weighted Gini coefficient or AUC in R

This post on Kaggle provides R code for calculating the Gini for assessing a prediction rule, and this post provides R code for the weighted version (think exposure for frequency and claim count for severity in non-life insurance modeling). Note that the weighted version is not well-defined when there are ties in the predictions and where the corresponding weights vary because different Lorentz curve (gains chart) could be drawn for different orderings of the observations; see this post for an explanation and some examples.

Now, to explain the code. The calculation of the x values (variable random, the cumulative proportion of observations or weights) and y values (variable Lorentz, the cumulative proportion of the response, the good’s/1’s or positive values) are straightforward. To calculate the area between the Lorentz curve and the diagonal line, one could use the trapezoidal rule to calculate the area between the Lorentz curve and x-axis and then subtract the area of the lower triangle (1/2):

\begin{align} Gini &= \sum_{i=1}^{n} (x_{i} – x_{i-1}) \left[\frac{L(x_{i}) + L(x_{i-1})}{2}\right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i})x_{i} + L(x_{i-1})x_{i} – L(x_{i})x_{i-1} – L(x_{i-1})x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i})x_{i} – L(x_{i-1})x_{i-1} \right] + \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i})x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} L(x_{n})x_{n} + \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i}) x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i}) x_{i-1} \right] \end{align}

where the last equality comes from the fact that \(L(x_{n}) = x_{n} = 1\) for the Lorentz curve/gains chart. The remaining summation thus corresponds to sum(df$Lorentz[-1]*df$random[-n]) - sum(df$Lorentz[-n]*df$random[-1]) inside the WeightedGini function since the \(i=1\) term in the summation is 0 (\(x_i=0\) and \(L(x_{0})=0\) for the Lorentz curve), yielding \(n-1\) terms in the code.

For the unweighted case, applying the trapezoidal rule on the area between the Lorentz curve and the diagonal line yields:

\begin{align} Gini &= \sum_{i=1}^{n} \frac{1}{n} \frac{\left[ L(x_{i}) – x_{i} \right] – \left[ L(x_{i-1}) – x_{i-1} \right] }{2} \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i-1}) – x_{i-1} \right] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n-1} \left[ L(x_{i}) – x_{i} \right] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n-1} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{n}) – x_{n}] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] \ &= \frac{1}{n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] \end{align}

where we repeatedly used the fact that \(L(x_{0}) = x_{0} = 0\) and \(L(x_{n}) = x_{n} = 1\) for a Lorentz curve and that \(1/n\) is the width between points (change in cdf of the observations). The summation is what is returned by SumModelGini.

Note that both \(1/2\) and \(1/n\) are not multiplied to the sums in the weighted and unweighted functions since most people will use the normalized versions, in which case these factors just cancel.

Skeleton to create fast automatic tree diagrams using R and Graphviz


I’ve had to create tree diagrams (dendrograms, decision trees) many times in the past to illustrate the flow of data or decisions (e.g., data flow for a study). This is usually a manual task done in MS Powerpoint or Visio. I’ve also made some diagrams in the past using Graphviz based on the DOT language to make creation more reproducible. However, that still felt pretty manual.

I decided to come up with a skeleton framework to generate these diagrams using R since I could connect to various data sources, do calculations, and mash up outputs fairly fast with it.

Here is my framework illustrated by an example:

digraph <- '# dot -Tpng diagram.gv > diagram.png
digraph g {
graph [rankdir="LR"]
node [shape="rectangle" style=filled color=blue fontcolor=white] ;
n [label="%s"] ;
n_a [label="%s" color=red] ;
n_b [label="%s"] ;
n_aa [label="%s" color=red] ;
n_ab [label="%s" color=red] ;
n_ba [label="%s"] ;
n_bb [label="%s"] ;
n -> n_a [label="%s"];
n -> n_b [label="%s"];
n_a -> n_aa [label="%s"];
n_a -> n_ab [label="%s"];
n_b -> n_ba [label="%s"] ;
n_b -> n_bb [label="%s"] ;

string_list <- list(
    , 'All calls\\nn=100,000'
    , 'Night\\nn=20,000'
    , 'Day\\nn=80,000'
    , 'Closed\\nn=5,000'
    , 'Open\\nn=15,000'
    , 'Closed\\nn=10,000'
    , 'Open\\nn=7,000'
    , 'Night'
    , 'Day'
    , 'Closed'
    , 'Open'
    , 'Closed'
    , 'Open'

dot_file <-, string_list)
sink('diagram.gv', split=TRUE)
# run: dot -Tpng diagram.gv > diagram.png

This will write a file called diagram.gv:

dot -Tpng diagram.gv > diagram.png

digraph g { graph [rankdir="LR"] node [shape="rectangle" style=filled color=blue fontcolor=white] ; n [label="All calls\nn=100,000"] ; n_a [label="Night\nn=20,000" color=red] ; n_b [label="Day\nn=80,000"] ; n_aa [label="Closed\nn=5,000" color=red] ; n_ab [label="Open\nn=15,000" color=red] ; n_ba [label="Closed\nn=10,000"] ; n_bb [label="Open\nn=7,000"] ; n -> n_a [label="Night"]; n -> n_b [label="Day"]; n_a -> n_aa [label="Closed"]; n_a -> n_ab [label="Open"]; n_b -> n_ba [label="Closed"] ; n_b -> n_bb [label="Open"] ; }

Executing dot -Tpng diagram.gv > diagram.png, I will get the following output:


To make tree diagrams quickly, edit the structure of the diagram stored in the variable digraph. Dynamic texts should be inserted using sprintf via the list string_list. For example, do all calculations and then format the results in string_list. Then the diagram could be generated very quickly and will be reproducible. Changing any data source or any calculations would not result in manually re-creating the diagram!

Best practices for importing a delimited file in SAS using the DATA step

The easiest way to import a delimited file (e.g., CSV) in SAS is to use PROC IMPORT:

proc import datafile="/path/to/my_file.txt"
    delimiter="|" ;
    guessingrows=32000 ;
run ;

PROC IMPORT isn’t a viable option when the fileref used in the datafile argument is not of the DISK type. For example, the fileref my_pipe would not work in the following example,

filename my_pipe pipe "gunzip -c my_file.txt.gz" ;

because SAS needs “random access” to the fileref (i.e., to determine the variable type). PROC IMPORT also isn’t suitable when you have a very large data set where one of the columns might contain an element that has a very long length (and this might occur after the number of rows specified by guessingrows). Based on my experience, one should use the truncover, missover (don’t go to next line if line ends early), dsd (allow empty field) and lrecl (make this big for long lines; defaults to 256, which means your lines will be truncated if they are longer than 256 characters long) options in the infile statement to avoid unnecessary errors.

Since the infile is delimited, it is easy to import the fields using the list input method. However, one should use the length statement to declare the maximum length for each character variable, and use the informat statement for numeric variables that have special formats (date, dollar amount, etc.). I usually forget and just declare the informats following the variables in the input statement, which only works when we are inputting using the input pointer method (e.g., @27 my_var date9.). Here is an example:

filename my_pipe pipe "gunzip -c my_file.txt.gz" ;
data my_data ;
    infile my_file dlm="|" dsd truncover missover lrecl=50000 ;
        x2 $50
        x3 $25
        x4 date9.
        x4 date9.
        x2 $
        x3 $
run ;

Better decision tree graphics for rpart via party and partykit

I’ve been using Graphviz to create better decision tree graphics “by hand” for rpart objects created in R (final tree). I stumbled on this post that shows how one could convert an rpart object to a party project via the function in partykit to utilize the plot functions in party. It looks quite nice.

I might have to do additional hacking as I like to display the node size and percentage of success in every node. For example, in rpart, I do something like

## rpartObj created from rpart
textRpartCustom <- 
    nclass <- (ncol(yval) - 1L)/2
    group <- yval[, 1L]
    counts <- yval[, 1L + (1L:nclass)]
    if (!is.null(ylevel)) 
        group <- ylevel[group]
    temp1 <- rpart:::formatg(counts, digits)
    if (nclass > 1) {
        ## temp1 <- apply(matrix(temp1, ncol = nclass), 1, paste, 
        ##     collapse = "/")
      temp1 <- matrix(as.numeric(temp1), ncol=nclass)
      ##temp1 <- paste("p=", round(temp1[, 2] / apply(temp1, 1, sum)100, 1), "%", "; N=", apply(temp1, 1, sum), sep="")
      temp1 <- paste("", round(temp1[, 2] / apply(temp1, 1, sum)100, 1), "%", "; ", apply(temp1, 1, sum), sep="") 
    if (use.n) {
        out <- paste(format(group, justify = "left"), "\n", temp1, 
            sep = "")
    else {
        out <- format(group, justify = "left")

rpartObj$functions$text <- textRpartCustom plot(rpartObj) text(rpartObj)

to get these information to display for a classification fit.

Flipping the classroom: creating screencast lectures in Linux

I’m debating the idea (hype) of flipping the classroom for one of my classes next Fall where students watch lecture videos at home (or elsewhere) so I could spend class time doing more hands-on activities like discussing the art of data analysis and how to solve problems with statistics. I think Khan Academy, Udacity, and Coursera are doing a great service for humanity by offering high quality courses taught by excellent teachers online that are accessible to anyone with an internet connection.

I don’t claim to be a great teacher, but I think my own students might benefit from this pedagogical method. My main concern with this approach is that not all students will watch the lectures, just as how not all students read the assigned readings (guilty as a student). I guess I can give students short quizzes during lecture to push them to watch the videos. Also, I’ll give my usual challenging homework so that only students that study the material well could excel. By flipping the classroom, more material could be covered, students have access to the recordings in addition to my slides, and I could make sure everything I want to be said are recorded (as opposed to a live session where I could forget a few points). Lecture times can then be more interactive as opposed to me lecturing them for an hour.

I think most of the online education sites use Camtasia with a Wacom Cintiq to produce their videos. I use Linux and cannot afford such an expensive device. I plan on using a screencast software like recordMyDesktop or Istanbul to record the desktop screen and audio. For recordMyDesktop, I had issues with the encode on the fly option, which means recording very long videos could be an issue (1 minute of raw video takes up about 210MB, and 1 minute encoded video takes up about 8MB). Istanbul records on the fly without problem (I think). I haven’t tried recording for an hour and 20 minutes yet.

My plan is to create my lecture slides with LaTeX Beamer and use Xournal to annotate the slides as I’m lecturing; hopefully my Asus T101MT netbook is strong enough to do the recording as I utilize it’s touchscreen capabilities. I can just switch over to Emacs to illustrate data analysis in R when needed. My main concern now is where I could host these (large) videos…

Update 4/27/2012: Screencast with ffmpeg

After some testing, I think the best screencast software on Linux would have to be ffmpeg. First, remove ffmpeg and compile it from source based on the latest version per this post. Then, create

#! /bin/bash
DATE=`date +%Y%m%d`
TIME=`date +%Hh%M`
ffmpeg -y -f alsa -ac 2 -i pulse -f x11grab -r 24 -s $(xwininfo -root | grep 'geometry' | awk '{print $2;}') -i :0.0 -c:v libx264 -preset veryfast -crf 22 -c:a libmp3lame -ar 44100 -ab 24k -threads 0 /tmp/screencast_$DATE-$TIME.mp4

For more libx264 options, see this page.