## Compile R 3.2.2 on AIX 6.1

Here are my notes compiling 64-bit R 3.2.2 on AIX 6.1. As a pre-requisite, read the AIX notes from R-admin. Like the notes, I had GCC installed from here by our admin, along with many other pre-requisites. These were installed prior to compiling R. Note that you could grab newer versions of each package by going to http://www.oss4aix.org/download/RPMS/ (needed for R-dev).

## list of packages

## conversting unicode, ascii, etc; aix version is not compatible with
R

## dependency for unicode support; just need to extract to root /
need the gnu version of make

## libm
https://www.ibm.com/developerworks/community/forums/html/topic?id=72e62875-0603-4d93-a9bf-9d80c6cdc6ea
http://www-01.ibm.com/support/docview.wss?uid=isg1fileset-1318926131

https://www.ibm.com/developerworks/java/jdk/aix/service.html ## jre

#### python



Add /opt/freeware/bin to PATH.

Next, update src/main/dcf.c from the R-dev as a bug that causes readDCF to segfault when using install.packages; no need to do this for future versions of R. Then apply this patch from here to fix the following error:

gcc -maix64 -pthread -std=gnu99 -I../../../../include -DNDEBUG
-I../../../include -I../../../../src/include -DHAVE_CONFIG_H
-I../../../../src/main -I/usr/local/include  -mminimal-toc    -O2 -g
-mcpu=power6  -c gramRd.c -o gramRd.o

gcc -maix64 -pthread -std=gnu99 -shared -Wl,-brtl -Wl,-G -Wl,-bexpall
-Wl,-bnoentry -lc -L/usr/local/lib -o tools.so text.o init.o Rmd5.o
md5.o signals.o install.o getfmts.o http.o gramLatex.o gramRd.o -lm
-lintl

make[6]: Entering directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

mkdir -p -- ../../../../library/tools/libs

make[6]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

make[5]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

make[4]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

make[4]: Entering directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

installing 'sysdata.rda'

Error: Line starting 'Package: tools ...' is malformed!

Execution halted

../../../share/make/basepkg.mk:150: recipe for target 'sysdata' failed

make[4]: *** [sysdata] Error 1

make[4]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

Makefile:30: recipe for target 'all' failed

make[3]: *** [all] Error 2

make[3]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

Makefile:36: recipe for target 'R' failed

make[2]: *** [R] Error 1

make[2]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library'

Makefile:28: recipe for target 'R' failed

make[1]: *** [R] Error 1

make[1]: Leaving directory '/sas/data04/vinh/R-3.2.2/src'

Makefile:59: recipe for target 'R' failed

make: *** [R] Error 1


Hopefully, this patch will make it to R-dev so that it is no longer needed for future versions of R.

export OBJECT_MODE=64
export CFLAGS="-O2 -g -mcpu=power6"
export FFLAGS="-O2 -g -mcpu=power6"
export FCFLAGS="-O2 -g -mcpu=power6"
./configure --prefix=/path/to/opt ## custom location so I don't need root
make -j 16
make install


The last step may complain about NEWS.pdf not found in a directory and a certain directory is not found in the destination. For the former, just do touch NEWS.pdf to where it’s supposed to be; for the latter, create the directory yourself.

## Automatically specify line break options with termstr as CRLF or LF in SAS when importing data

It could be annoying when dealing with data from multiple platforms: Windows uses the carriage return (CR) and line feed (LF) to indicate a new line, UNIX uses LF, and Mac uses CR. Most companies have SAS running on a UNIX/Linux server, and it’s hard to tell which characters indicate a new line without going to a terminal to inspect the file.

Here’s a sas macro that creates a filename handle that could be used in PROC IMPORT or a DATA step. It will automatically detect CRLF and if not, default to LF. This assumes SAS is running on a UNIX server with access to the head and awk commands.

%macro handle_crlf(file, handle_name, other_filename_options=) ;
/* if there is a carriage return at the end, then return 1 (stored in macro variable SYSRC) */
%sysexec head -n 1 "&file" | awk '/\r/ { exit(1) }' ; %if &SYSRC=1 %then %let termstr=crlf ; %else %let termstr=lf ; filename &handle_name "&file" termstr=&termstr &other_filename_options ; %mend ; /* %handle_crlf(file=/path/to/file.txt, handle_name=fh) ; proc import data=fh dbms=dlm replace outdata=d1 ; delimiter='|' ; run ; */  ## Repair line breaks within a field of a delimited file Sometimes some people generate delimited files with line break characters (carriage return and/or line feed) inside a field without quoting. I previously wrote about the case when the problematic fields are quoted. I also wrote about using non-ascii characters as field and new record indicators to avoid clashes. The following script reads in stdin and writes to stdout repaired lines by ensuring every output line has at least the number of delimiters (|) as the first/header line (call this the target number of delimiters) by continually concatenating lines (remove line breaks) until it reaches the point when concatenating the next line would yield more delimiters than the target number of delimiters. The script appears more complicated than it should be in order to address the case when there are more than one line breaks in a field (so don’t just concatenate one line but keep doing so) and the case when a line has more delimiters than the target number of delimiter (this could lead to an infinite loop if we restrict the number of delimiters to equal the target). #! /usr/bin/env python dlm='|' import sys from signal import signal, SIGPIPE, SIG_DFL # http://stackoverflow.com/questions/14207708/ioerror-errno-32-broken-pipe-python signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less line = sys.stdin.readline() n_dlm = line.count(dlm) line0 = line line_next = 'a' while line: if line.count(dlm) > n_dlm or line_next=='': sys.stdout.write(line0) line = line_next # line = sys.stdin.readline() if line.count(dlm) > n_dlm: ## line with more delimiters than target? line0 = line_next line_next = sys.stdin.readline() line = line.replace('\r', ' ').replace('\n', ' ') + line_next else: line0 = line line_next = sys.stdin.readline() line = line.replace('\r', ' ').replace('\n', ' ') + line_next  ## Calculate the weighted Gini coefficient or AUC in R This post on Kaggle provides R code for calculating the Gini for assessing a prediction rule, and this post provides R code for the weighted version (think exposure for frequency and claim count for severity in non-life insurance modeling). Note that the weighted version is not well-defined when there are ties in the predictions and where the corresponding weights vary because different Lorentz curve (gains chart) could be drawn for different orderings of the observations; see this post for an explanation and some examples. Now, to explain the code. The calculation of the x values (variable random, the cumulative proportion of observations or weights) and y values (variable Lorentz, the cumulative proportion of the response, the good’s/1’s or positive values) are straightforward. To calculate the area between the Lorentz curve and the diagonal line, one could use the trapezoidal rule to calculate the area between the Lorentz curve and x-axis and then subtract the area of the lower triangle (1/2): \begin{align} Gini &= \sum_{i=1}^{n} (x_{i} – x_{i-1}) \left[\frac{L(x_{i}) + L(x_{i-1})}{2}\right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i})x_{i} + L(x_{i-1})x_{i} – L(x_{i})x_{i-1} – L(x_{i-1})x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i})x_{i} – L(x_{i-1})x_{i-1} \right] + \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i})x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} L(x_{n})x_{n} + \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i}) x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i}) x_{i-1} \right] \end{align} where the last equality comes from the fact that $$L(x_{n}) = x_{n} = 1$$ for the Lorentz curve/gains chart. The remaining summation thus corresponds to sum(dfLorentz[-1]*df$random[-n]) - sum(df$Lorentz[-n]*dfrandom[-1]) inside the WeightedGini function since the $$i=1$$ term in the summation is 0 ($$x_i=0$$ and $$L(x_{0})=0$$ for the Lorentz curve), yielding $$n-1$$ terms in the code. For the unweighted case, applying the trapezoidal rule on the area between the Lorentz curve and the diagonal line yields: \begin{align} Gini &= \sum_{i=1}^{n} \frac{1}{n} \frac{\left[ L(x_{i}) – x_{i} \right] – \left[ L(x_{i-1}) – x_{i-1} \right] }{2} \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i-1}) – x_{i-1} \right] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n-1} \left[ L(x_{i}) – x_{i} \right] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n-1} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{n}) – x_{n}] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] \ &= \frac{1}{n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] \end{align} where we repeatedly used the fact that $$L(x_{0}) = x_{0} = 0$$ and $$L(x_{n}) = x_{n} = 1$$ for a Lorentz curve and that $$1/n$$ is the width between points (change in cdf of the observations). The summation is what is returned by SumModelGini. Note that both $$1/2$$ and $$1/n$$ are not multiplied to the sums in the weighted and unweighted functions since most people will use the normalized versions, in which case these factors just cancel. ## Skeleton to create fast automatic tree diagrams using R and Graphviz I’ve had to create tree diagrams (dendrograms, decision trees) many times in the past to illustrate the flow of data or decisions (e.g., data flow for a study). This is usually a manual task done in MS Powerpoint or Visio. I’ve also made some diagrams in the past using Graphviz based on the DOT language to make creation more reproducible. However, that still felt pretty manual. I decided to come up with a skeleton framework to generate these diagrams using R since I could connect to various data sources, do calculations, and mash up outputs fairly fast with it. Here is my framework illustrated by an example: digraph <- '# dot -Tpng diagram.gv > diagram.png digraph g { graph [rankdir="LR"] node [shape="rectangle" style=filled color=blue fontcolor=white] ; n [label="%s"] ; n_a [label="%s" color=red] ; n_b [label="%s"] ; n_aa [label="%s" color=red] ; n_ab [label="%s" color=red] ; n_ba [label="%s"] ; n_bb [label="%s"] ; n -> n_a [label="%s"]; n -> n_b [label="%s"]; n_a -> n_aa [label="%s"]; n_a -> n_ab [label="%s"]; n_b -> n_ba [label="%s"] ; n_b -> n_bb [label="%s"] ; } ' string_list <- list( digraph , 'All calls\\nn=100,000' , 'Night\\nn=20,000' , 'Day\\nn=80,000' , 'Closed\\nn=5,000' , 'Open\\nn=15,000' , 'Closed\\nn=10,000' , 'Open\\nn=7,000' , 'Night' , 'Day' , 'Closed' , 'Open' , 'Closed' , 'Open' ) # http://stackoverflow.com/questions/10341114/alternative-function-to-paste dot_file <- do.call(sprintf, string_list) sink('diagram.gv', split=TRUE) cat(dot_file) sink() # run: dot -Tpng diagram.gv > diagram.png  This will write a file called diagram.gv:  dot -Tpng diagram.gv > diagram.png digraph g { graph [rankdir="LR"] node [shape="rectangle" style=filled color=blue fontcolor=white] ; n [label="All calls\nn=100,000"] ; n_a [label="Night\nn=20,000" color=red] ; n_b [label="Day\nn=80,000"] ; n_aa [label="Closed\nn=5,000" color=red] ; n_ab [label="Open\nn=15,000" color=red] ; n_ba [label="Closed\nn=10,000"] ; n_bb [label="Open\nn=7,000"] ; n -> n_a [label="Night"]; n -> n_b [label="Day"]; n_a -> n_aa [label="Closed"]; n_a -> n_ab [label="Open"]; n_b -> n_ba [label="Closed"] ; n_b -> n_bb [label="Open"] ; }  Executing dot -Tpng diagram.gv > diagram.png, I will get the following output: To make tree diagrams quickly, edit the structure of the diagram stored in the variable digraph. Dynamic texts should be inserted using sprintf via the list string_list. For example, do all calculations and then format the results in string_list. Then the diagram could be generated very quickly and will be reproducible. Changing any data source or any calculations would not result in manually re-creating the diagram! ## Best practices for importing a delimited file in SAS using the DATA step The easiest way to import a delimited file (e.g., CSV) in SAS is to use PROC IMPORT: proc import datafile="/path/to/my_file.txt" out=work.my_data dbms=dlm replace ; delimiter="|" ; guessingrows=32000 ; run ;  PROC IMPORT isn’t a viable option when the fileref used in the datafile argument is not of the DISK type. For example, the fileref my_pipe would not work in the following example, filename my_pipe pipe "gunzip -c my_file.txt.gz" ;  because SAS needs “random access” to the fileref (i.e., to determine the variable type). PROC IMPORT also isn’t suitable when you have a very large data set where one of the columns might contain an element that has a very long length (and this might occur after the number of rows specified by guessingrows). Based on my experience, one should use the truncover, missover (don’t go to next line if line ends early), dsd (allow empty field) and lrecl (make this big for long lines; defaults to 256, which means your lines will be truncated if they are longer than 256 characters long) options in the infile statement to avoid unnecessary errors. Since the infile is delimited, it is easy to import the fields using the list input method. However, one should use the length statement to declare the maximum length for each character variable, and use the informat statement for numeric variables that have special formats (date, dollar amount, etc.). I usually forget and just declare the informats following the variables in the input statement, which only works when we are inputting using the input pointer method (e.g., @27 my_var date9.). Here is an example: filename my_pipe pipe "gunzip -c my_file.txt.gz" ; data my_data ; infile my_file dlm="|" dsd truncover missover lrecl=50000 ; length x250
x3 $25 ; informat x4 date9. ; format x4 date9. ; input x1 x2$