Note on using T-SQL via R and ODBC

Reminder to self: when trying to query data into R using an existing SQL script that uses T-SQL to access data in Microsoft SQL Server, add SET NOCOUNT ON; to the beginning of the script, then add SET NOCOUNT OFF before the last select statement (the query returning the data of interest). If this is not done, an empty is returned instead.

Estimating standard errors using the Delta Method in R

I just discovered the msm package in R that offers the deltamethod function for calculating the standard error of a transformation of estimated regression coefficients. A tutorial could be found here. I had to use it recently to calculate the standard error for a difference in proportions when logistic regression model was used to estimate the log-odds of an event:

fit <- glm(Response ~ treat, data=my_data, family=binomial(link='logit')) # treat is binary
diff <- 1/(1 + exp(-coef(fit)[1] - coef(fit)[2])) - 1/(1 + exp(-coef(fit)[1])) # difference in proportion
se.diff <- deltamethod(~ 1/(1 + exp(-x1 - x2)) - 1/(1 + exp(-x1)), coef(fit), vcov(fit)) # returns standard error

Home Security: record CCTV cameras with ZoneMinder as DVR


I started working on this project and documenting it in late October 2014. I’ve only recently had the time to finish writing it all up.

I was interested in a home surveillance security system for a several reasons:

  1. a friend’s house and car were burglarized three times in the last few years, and
  2. another friend was sued by his neighbor, claiming his dogs were excessively loud during the day.

A surveillance camera system would have helped in identifying the culprits in the first case. In the second case, my friend was able to bring video footage from his home CCTV surveillance system that he bought from Costco to court to disprove his neighbor’s claim: that to the contrary, the dogs were not barking during the stated time.

In implementing my own CCTV surveillance security system on my premise, I had several options:

  1. purchase a system from a home monitoring security company and make monthly payments for their monitoring service,
  2. purchase a CCTV surveillance system that Costco, Sam’s Club, and many online retailers sell, or
  3. build my own DIY system.

I decided on the DIY route because

  1. it saves me money in the long run (ie, no recurring fees),
  2. I have the freedom to configure it the way I want it,
  3. I am able to reuse the hardware and service if I ever decide to move, and
  4. the stars were aligned regarding the critical components:
    • the cost of an energy efficient server with a small footprint is relatively inexpensive
    • a fully mature open source software exists on Linux for monitoring and recording
    • commodity wireless cameras are relatively inexpensive (namely, Foscam FI8910W)
    • mobile phone apps exist for easy remote viewing (Android: tinyCam Monitor and ZmView)

Note: be mindful of the laws regulating surveillance and privacy in your local jurisdiction (for me, it’s California).

The main ingredient to my home surveillance system is ZoneMinder, a Linux CCTV recording platform that doesn’t require much processing power to run. More information about it’s use could be found here and here.

Regarding cameras, I opted for wireless IP cameras over wired analogue cameras because wiring can make installations time-consuming and is something I’m just not very good at (this was the primary reason why I decided not to install a surveillance system from Costco when I first moved into my current house). In particular, I chose the Foscam FI8910W for inside the house and Foscam FI8904W for outside the house based on a friend’s recommendation.

With the server’s hardware, I wanted to maximize computing power while constraining power consumption (< 30W), cost ($200), and footprint (mini-ITX). It took a couple of days of research, but I found the Zotac ZBOX CI320 nano to be more than sufficient. Based on my testing, it consumes about 9W of power when recording one camera. This isn’t too surprising as system is based on a quad-core Intel Celeron processor. I normally refrain from Celerons but this is plenty of juice for the server’s only job: recording videos using ZoneMinder. I slapped on 8GB of ram and a 500GB hard drive 2.5″ hard drive. Ideally, I would have liked to install a 4TB drive if I had a 3.5″ slot or a maybe 2TB drive in the 2.5″ slot. However, the drive was given to me for free and it more than suffices for making a few week’s worth of video footages readily available. Update: this hard drive died after running for 1.5 years; I upgraded to a WD Red Mobile 2.5″ 1TB server-grade hard drive.

Installing ZoneMinder

When configuring the server, first go into the BIOS (hold DEL during boot) and change the UEFI settings to Windows 7 (Windows 8 by default). Otherwise, the device will not boot without a monitor plugged in, and this will render the device useless as a headless server.

I originally planned to run CentOS 6.5 on the Zotac for stability, but was unable to install it because of some kernel issue when trying to install from the LiveCD. I therefore went with Ubuntu Server 14.04 (LTS), which I’m even more familiar with. During installation, I installed LAMP and OpenSSH as well in order to run ZoneMinder and manage the server remotely. Installation was relatively straightforward if you install from the repo (as of 3/24/2016, I used version 1.29.0) :

sudo apt-get update && sudo apt-get upgrade
sudo a2enmod cgi
sudo service apache2 restart
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:iconnor/zoneminder
sudo apt-get update
sudo apt-get install zoneminder php5-gd libvlc-dev libvlccore-dev vlc
mysql -uroot -p < /usr/share/zoneminder/db/zm_create.sql
mysql -uroot -p -e "grant all on zm.* to 'zmuser'@localhost identified by 'zmpass';"
mysqladmin -uroot -p reload
sudo chmod 740 /etc/zm/zm.conf
sudo chown root:www-data /etc/zm/zm.conf
sudo adduser www-data video

Add the sleep line in /etc/init.d/zoneminder:

start() {
      sleep 15
      echo -n "Starting $prog: "

Finish configuring with the following:

sudo a2enmod cgi
sudo a2enconf zoneminder
sudo a2enmod rewrite
sudo service zoneminder start

Edit date.time_zone in /etc/php5/apache2/php.ini:

; Defines the default timezone used by the date functions
date.timezone = America/Los_Angeles

Restart the computer.

Configure SSL by doing:

sudo make-ssl-cert generate-default-snakeoil --force-overwrite
sudo a2enmod ssl
sudo a2ensite default-ssl
sudo service apache2 restart

The encrypted url https://server.ip/zm should now work; the unencrypted url is http://server.ip/zm.

Configuring ZoneMinder

  • Go to http://server.ip/zm in the web browser
  • Go to Options > Paths and set PATH_ZMS to /zm/cgi-bin/nph-zms (Caution: make sure you enter only /zm/cgi-bin/nph-zms with no space at the end or your video will not work!)
  • Go to Options > Users: change the default admin username and password and add a new user for viewing only (no admin rights)
  • Go to Options > System: check OPT_USE_AUTH for to require user login and OPT_CONTROL to enable control on cameras
  • Go to Filters, select PurgeWhenFull from the Use Filter dropdown. Set Disk Percent to 90 so events (recorded videos) are deleted when the disk reaches 90%. Click Save.
    • The length of saved videos will thus depend on the hard drive’s capacity and number of cameras.
  • On the main page, go to Running > Restart > Apply

Adding Cameras to ZoneMinder

I have a few Foscam FI8910W cameras, so I followed these instructions.

  • Check if /usr/share/perl5/ZoneMinder/Control/ exists on the server; if not, download from
  • Click Add New Monitor
    • In the General tab, give the camera a Name, set Source Type to Remote, and set Function to Mocord (Record and Motion Detection)
    • In the Source tab,
      • Enter in the camera’s IP address in Remote Host Name
      • In the Remote Path field, enter /videostream.cgi?user=visitor_user&pwd=visitor_pw, where visitor_user and visitor_pw is the camera’s login with at least visitor status
      • Enter 640 and 480 for width and height, and select Perserve Aspect Ratio
    • In the Control tab,
      • Set Controllable
      • Select Foscam FI8918W for Control Type
      • Set Control Address to the camera’s IP address
      • Set Control Device to operator_user:operator_pw where operator_user and operator_pw are the camera’s login info with at least operator status (this format is camera firmware specific)

Now, on the main ZoneMinder page, we could view the stream by clicking on each of the camera. From there, we could also view old recorded events.

On the ZoneMinder server, the recorded events (videos) are saved at /var/cache/zoneminder/events (many jpeg files since we are using the jpeg protocol of the Foscam camera).

Remote viewing

In order to view ZoneMinder outside the home network, set port forwarding on the local router. For example, one could use the ZmView app on an Android device to view the cameras remotely at https://server.ip:port/zm.

For remote viewing, it might be better to view directly from the cameras (so long as they are set up securely). Viewing the videos from ZoneMinder will increase the CPU load. Remember, the primary purpose of ZoneMinder is to record the video footage in case they are needed in a month.

Secure access to home IP cameras over the web with SSL

IP cameras are cheap these days. I have a few Foscam FI8910W cameras at home as part of my home security system. A friend recently gave me a Foscam C1 to be used as a baby monitor. I try to avoid using the manufacturer’s mobile device apps for viewing the video feeds because 1) I don’t want my private information (video, IP address, credentials, etc) exchanged or leaked to the manufacturer and 2) I don’t want to have to use multiple apps if I buy multiple cameras from different manufacturers. I currently use tinyCam Monitor PRO on my Android device to view the video feed both on the home network and while I’m away. It works very well, is customizable, and supports many different manufacturers.

I took precautions when setting up the cameras on my home network since I know these cameras could be easily hacked with a default setup:

  • Secure home network with a router running Tomato Firmware
  • Use the latest firmware on the cameras
  • Change the default username and password on the cameras
  • Avoid using any of the manufacturer’s services on the camera that makes it easy to view the video feed on the web (eg, DDNS)
  • Use SSL (https) to encrypt the internet traffic/video feed

Regarding the use of SSL, newer cameras like the Foscam C1 should support it by default, so making it world-accessible is as easy as setting up a port forward on the router to the IP camera port 443 (or whatever the https port is). However, for older cameras like the Foscam FI8901W that don’t support SSL, one could set up an encrypted reverse proxy using a web server like nginx. Luckily, nginx is available on my router running Tomato Firmware so I don’t have to use a separate Linux server. Here’s how I set this up on my router running Tomato:

  • On the Administration page of the router running Tomato, enable both http and https access to the router’s administration. Check the “Save to NVRAM” box. The point of this is to create a certificate and key on the router that we could use with nginx so we don’t have to generate these files elsewhere. I ssh’d into the router and found the following files: /tmp/etc/cert.pem and /tmp/etc/key.pem.
  • Go to the Web Server page of the Tomato router, and follow this guide step by step to incrementally set up a reverse proxy on nginx. Paste the final server code inside the “http” section of the web server page on Tomato (assuming we want port 1234 to be the external port for the camera, is the camera’s IP address internally, and 88 is the web port of the camera internally):

server {
   listen 1234; # doesn't have to be port 443 - could be any port (say 8080) if you 
               # connect via .  But on port 443
               # you can just use
   ssl on;
   ssl_certificate  /tmp/etc/cert.pem;
   ssl_certificate_key  /tmp/etc/key.pem;
   # certificate and private key where you just placed them

ssl_session_timeout 5m;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2; ssl_prefer_server_ciphers on; ssl_ciphers "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA+RC4 EECDH EDH+aRSA RC4 !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS"; # reasonable SSL configuration, disable some known weak ciphers.

location / { proxy_pass; proxy_redirect $scheme://$host:$server_port/; # If your webcam (or other proxied application) ever gives URLs in the HTTP headers # (e.g., in a 302 or 304 HTTP redirect), # the proxy_redirect line changes the URL in the HTTP header field # from to } }

  • Every camera should have a separate server chunk like the previous.
  • Set up port forwarding: external 1234 to, where is the IP address of the router.
  • To make the router respond to the ports (eg, 1234) when requested from the outside world, we need to add a rule to iptable by pasting the following in the “Firewall” section of the Administration > Scripts page (to have it run immediately without reboot, ssh into the router and execute):
iptables -t filter -A INPUT -p tcp --dport 1234 -j ACCEPT
  • Do the previous step for each camera or port. If this is not done, then the external port will not work when we access it outside the home network (the point of all this!).
  • Turn on the nginx web server on the Web Server page.
  • When away, go to a web browser or mobile app and point to https://external_ip:1234 (enable SSL if using a mobile app) and the video feed should be encrypted and available.

Using Vagrant to scale data science projects with Google Cloud Compute

When I was in graduate school, I made heavy use of the school’s computing infrastructure for my research by scheduling many simulation jobs, utilizing multiple (if not all) compute nodes in parallel using Grid Engine. In my professional life, I’ve always pushed to have a computing environment dedicated for research and analysis. This was typically in the form of a Linux or UNIX server with many CPU cores (64 one) and as much ram as I could get. The difficulty in getting the approval to have such an environment depended on the company’s culture, so YMMV. The beauty of the work setup over the graduate school setup is that a job scheduler was never needed as the number of concurrent users vying for compute cycles are drastically less at work. When building a computing environment, I always try to build the beefiest server possible (translation: that I could get approval for) because I never want to run into a project that the server couldn’t handle (eg, loading a very large data set into memory with R). However, it’s hard to future-proof all projects completely so the line had to be drawn somewhere and thus the number of CPU cores and memory had a limit.

Now, with the ubiquity of computing environments offered by different cloud providers (eg, Amazon EC2, Google Compute Engine, Microsoft Azure, HP Public Cloud, Sense, and Domino Data Labs), spinning up an on-demand compute server or a cluster of nodes to scale data analysis projects is pretty simple, straightforward, and relatively cheap (no need to invest thousands of dollars to build a beefy server that reaches full capacity in <1% of the time). One could leverage these cloud services both in the work environment (if one could get it approved) and for personal use (eg, a Kaggle competition).

Sense and Domino Data Labs have servers pre-configured with many standard open source tools for data analysis. They are good options for quickly jumping into analysis. With the more generic cloud providers, one typically spins up a server with a standard Linux OS and then proceed to install and configure the necessary tools. To streamline the scaling process, Vagrant allows one to spin up, provision, manage, and destroy servers from the various providers in a simple and consistent manner. I’ll illustrate how one might use Vagrant to spin up a compute server on the Google Compute Engine (GCE) to analyze data. I’m only choosing GCE at the time of this writing because it appears to be the cheapest and because it charges by the minute (10 minutes miminum), unlike Amazon EC2 which charges by the hour.

gcloud init # authenticate; choose a default zone, I chose "us-central1-b", which will show up in subsequent sections
  • Install the Vagrant-Google plugin. On Windows, Vagrant-Google version 1.8.1+ is required, which hasn’t been released at the time of this writing. A work-around is to spin up an Ubuntu server using Vagrant and Virtualbox on Windows, install Vagrant in that Ubuntu server, and go from there. Run the following to install the Vagrant-Google plugin:
vagrant plugin install vagrant-google
  • Follow the instructions here to create a Service Account within the GCE console (website), download the json private key (used by the Google Cloud SDK, call it google_json_key.json), copy the api service account email address and project ID, and add your ssh public key (eg, ~/.ssh/ to the Compute Engine’s metadata.
  • Add the GCE box (template) to Vagrant:
vagrant box add gce
  • Create a project folder (eg, vagrant_google) and cd into it. Everything in this folder will get synced to /vagrant/ for any provisioned server.
  • Create the file Vagrantfile:
# -*- mode: ruby -*-
# vi: set ft=ruby :
Vagrant.configure("2") do |config| = "gce"

  config.vm.provider :google do |google, override|
    google.google_project_id = "MY_PROJECT_ID"
    google.google_client_email = ""
    google.google_json_key_location = "/absolute/path/to/google_json_key.json" = "us-central1-b"
    google.machine_type= "n1-standard-1"
    # google.machine_type = "n1-highmem-2"
    # google.machine_type = "n1-standard-16"
    # google.machine_type = "n1-highmem-8" = "gce-instance"
    google.image = "ubuntu-1404-trusty-v20160222" # may change, use "gcloud compute images list" to check
    # google.image = "image-ds"
    google.disk_name = "disk-ds" # ds for data science
    google.disk_size = "10"
    override.ssh.username = "MY_USERNAME"
    override.ssh.private_key_path = "/path/to/.ssh/id_rsa"
  config.vm.provision :shell, path: ""
  • Create the file (this gets run whenever a server is provisioned via vagrant up):
#! /usr/bin/env bash

touch /tmp/foo # whoami?

# configure mail so I could communicate status updates
cat > /tmp/nail.rc <<EOF
set smtp-use-starttls
set smtp-auth=login
set smtp=smtp://
set smtp-auth-password="MY_GMAIL_PW"
sudo su
cat /tmp/nail.rc >> /etc/nail.rc
rm /tmp/nail.rc

# configure screen
cat > ~/.screenrc <<EOF
escape ^lL
bind c screen 1
bind 0 select 10
screen 1
select 1
autodetach on
startup_message off
  • Run vagrant up and Vagrant will provision the server.
  • Run vagrant ssh to log into the server via ssh.
  • Install the necessary tools:
# Software installation
sudo su
echo "deb trusty/" > /etc/apt/sources.list.d/r.list
apt-key adv --keyserver --recv-keys E084DAB9
apt-get update 
apt-get -y install r-base-dev libcurl4-openssl-dev libssl-dev git
apt-get -y install heirloom-mailx # setup mail
exit # sudo

# R
sudo dpkg -i MRO-3.2.3-Ubuntu-14.4.x86_64.deb
tar -xzf RevoMath-3.2.3.tar.gz
cd RevoMath
sudo ./ # choose option 1, then agree
cd ..
rm -rf RevoMath
sudo R --vanilla <<EOF
install.packages(c("data.table","readr","randomForest","gbm","glmnet","ROCR","devtools"), repos="")
# options(unzip = 'internal') # vinh
# devtools::install_github("dmlc/xgboost", subdir = "R-package")
install.packages("drat", repos="") #
install.packages("xgboost", repos="", type="source")

# Python
# scroll and yes

# VW
sudo apt-get -y install libtool libboost1.55-*

# Java
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get -y install oracle-java7-installer

# H2O
sudo R --vanilla <<EOF
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
pkgs <- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils")
for (pkg in pkgs) {
    if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg, repos='') }

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos=(c("")))
  • Exit out of the ssh session. The server is still running. Let’s set the root disk to not delete when the server is destroyed. Then save the root disk as an image. With this saved private image, one can provision multiple servers using the same image without having to re-install all the previously installed software on gce-instance. Once we have a saved image, restart the instance with the previous root disk, set the disk to automatically delete, and then delete the instance again (this time, the disk will also be destroyed). What we have left is an image that could be re-used. Image storage is not that costly if one wants to keep it around for quickly provisioning their computing environment.
gcloud compute instances set-disk-auto-delete gce-instance --no-auto-delete --disk disk-ds # don't delete root disk
vagrant destroy --force
gcloud compute images create image-ds --source-disk disk-ds --source-disk-zone us-central1-b
gcloud compute images list
vagrant up
gcloud compute instances set-disk-auto-delete gce-instance --auto-delete --disk disk-ds
vagrant destroy # now disk is also destroyed, only image is left.
  • With the saved private image, we could modify Vagrantfile to use this image whenever we spin up a server:
# -*- mode: ruby -*-
# vi: set ft=ruby :
Vagrant.configure("2") do |config| = "gce"

  config.vm.provider :google do |google, override|
    google.google_project_id = "MY_PROJECT_ID"
    google.google_client_email = ""
    google.google_json_key_location = "/absolute/path/to/google_json_key.json" = "us-central1-b"
    google.machine_type= "n1-standard-1"
    # google.machine_type = "n1-highmem-2"
    # google.machine_type = "n1-standard-16"
    # google.machine_type = "n1-highmem-8" = "gce-instance"
    # google.image = "ubuntu-1404-trusty-v20160222"
    google.image = "image-ds"
    google.disk_name = "disk-ds" # ds for data science
    google.disk_size = "10"
    override.ssh.username = "MY_USERNAME"
    override.ssh.private_key_path = "/path/to/.ssh/id_rsa"
  config.vm.provision :shell, path: ""

When one needs more CPU’s, memory, or disk size, just modify google.machine_type and google.disk_size per the pricing sheet.

With this setup, I can put relevant files in vagrant_google/my_project. My current work flow is something like the following:

vagrant up # provision server
vagrant ssh # log in
screen # start a persistent terminal session
# paste the following
mkdir -p $logdir

# paste in code that I want to run, eg
R --no-save < $projdir/myfile.R > $logdir/myfile.Rlog
python $projdir/ > $logdir/myfile.pylog
echo '' | mailx -s 'Job Done' $mailaddr
# detach screen session

When the job is done, I will receive a notification email. When the job is currently running, I can use vagrant ssh to log into the server and re-attach my running screen session to inspect what is going on or make any changes. When I’m done, I could just destroy the server (the content of /vagrant/ will get re-synced to vagrant_google). I could also create batch jobs to start the server, run code, notify me, and destroy the server in one shot if I didn’t want to run my job interactively.

Conclusion: With Vagrant, open source tools, and the cloud providers, I can spin up as much compute resources as I need in order to scale out my data analysis project. If I had the need to build stand-alone applications, I could also incorporate Docker.

Google Voice using SIP via Simonics

Google Voice ended third-party support for XMPP clients in May 2014. My Asterisk auto-attendant remained functional into 2015. However it’s finally broken for me: when people call in, or when people don’t call in, some digits get pressed randomly in Asterisk (probably because of some negotiation between Asterisk and GV via XMPP) and calls are forwarded sporadically throughout the day (what I call phantom calls). Is there a way to still leverage Google Voice as my auto-attendant? I found this post talking about Simon Telephonics offering a GV gateway using SIP. There is a $5 setup fee. I paid it and now am able to continue using it with Asterisk as my auto-attendant. It also works with any SIP phone or SIP app (eg, CSipSimple). I like this solution because I still get free calls using Google Voice without having to use another third-party to pay for my calls.

To set up on Asterisk, empty out /etc/asterisk/motif.conf and /etc/asterisk/xmpp.conf. Then edit /etc/asterisk/sip.conf:

; simonics

[GV1234567890] type=friend username=GV1234567890 secret=simonics_given_pw qualify=yes context=tnttsp ; context in extensions.conf disallow=all allow=ulaw

1234567890 is the assumed Google Voice telephone number.

In my previous dial plan (/etc/asterisk/extensions.conf), I had to comment out the third line:

exten => s,1,Answer()
exten => s,n,Wait(1)
;; exten => s,n,SendDTMF(1) ;; needed for jabber/xmpp/mptif, not sip via simonics

SendDTMF(1) was previously required. However, Simon Telephonics sets everything up with Google Voice so this isn’t required.

In the same file, to call 1987654321, change




All is good!

Compile R 3.2.2 on AIX 6.1

Here are my notes compiling 64-bit R 3.2.2 on AIX 6.1. As a pre-requisite, read the AIX notes from R-admin. Like the notes, I had GCC installed from here by our admin, along with many other pre-requisites. These were installed prior to compiling R. Note that you could grab newer versions of each package by going to (needed for R-dev).

## list of packages

## conversting unicode, ascii, etc; aix version is not compatible with

## dependency for unicode support; just need to extract to root / ##
need the gnu version of make

## libm ## jre

#### python

Add /opt/freeware/bin to PATH.

Now, download the R source tarball, extract, and cd.

Next, update src/main/dcf.c from the R-dev as a bug that causes readDCF to segfault when using install.packages; no need to do this for future versions of R. Then apply this patch from here to fix the following error:

gcc -maix64 -pthread -std=gnu99 -I../../../../include -DNDEBUG
-I../../../include -I../../../../src/include -DHAVE_CONFIG_H
-I../../../../src/main -I/usr/local/include  -mminimal-toc    -O2 -g
-mcpu=power6  -c gramRd.c -o gramRd.o

gcc -maix64 -pthread -std=gnu99 -shared -Wl,-brtl -Wl,-G -Wl,-bexpall
-Wl,-bnoentry -lc -L/usr/local/lib -o text.o init.o Rmd5.o
md5.o signals.o install.o getfmts.o http.o gramLatex.o gramRd.o -lm

make[6]: Entering directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

mkdir -p -- ../../../../library/tools/libs

make[6]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

make[5]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools/src'

make[4]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

make[4]: Entering directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

installing 'sysdata.rda'

Error: Line starting 'Package: tools ...' is malformed!

Execution halted

../../../share/make/ recipe for target 'sysdata' failed

make[4]: *** [sysdata] Error 1

make[4]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

Makefile:30: recipe for target 'all' failed

make[3]: *** [all] Error 2

make[3]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library/tools'

Makefile:36: recipe for target 'R' failed

make[2]: *** [R] Error 1

make[2]: Leaving directory '/sas/data04/vinh/R-3.2.2/src/library'

Makefile:28: recipe for target 'R' failed

make[1]: *** [R] Error 1

make[1]: Leaving directory '/sas/data04/vinh/R-3.2.2/src'

Makefile:59: recipe for target 'R' failed

make: *** [R] Error 1

Hopefully, this patch will make it to R-dev so that it is no longer needed for future versions of R.

export OBJECT_MODE=64
export CC="gcc -maix64 -pthread"
export CXX="g++ -maix64 -pthread"
export FC="gfortran -maix64 -pthread"
export F77="gfortran -maix64 -pthread"
export CFLAGS="-O2 -g -mcpu=power6"
export FFLAGS="-O2 -g -mcpu=power6"
export FCFLAGS="-O2 -g -mcpu=power6"
./configure --prefix=/path/to/opt ## custom location so I don't need root
make -j 16
make install
## add /path/to/opt/bin to PATH

The last step may complain about NEWS.pdf not found in a directory and a certain directory is not found in the destination. For the former, just do touch NEWS.pdf to where it’s supposed to be; for the latter, create the directory yourself.

Automatically specify line break options with termstr as CRLF or LF in SAS when importing data

It could be annoying when dealing with data from multiple platforms: Windows uses the carriage return (CR) and line feed (LF) to indicate a new line, UNIX uses LF, and Mac uses CR. Most companies have SAS running on a UNIX/Linux server, and it’s hard to tell which characters indicate a new line without going to a terminal to inspect the file.

Here’s a sas macro that creates a filename handle that could be used in PROC IMPORT or a DATA step. It will automatically detect CRLF and if not, default to LF. This assumes SAS is running on a UNIX server with access to the head and awk commands.

%macro handle_crlf(file, handle_name, other_filename_options=) ;
/* if there is a carriage return at the end, then return 1 (stored in macro variable SYSRC) */
%sysexec head -n 1 "&file" | awk '/\r$/ { exit(1) }' ;
%if &SYSRC=1 %then %let termstr=crlf ;
%else %let termstr=lf ;
filename &handle_name "&file" termstr=&termstr &other_filename_options ;
%mend ;
%handle_crlf(file=/path/to/file.txt, handle_name=fh) ;
proc import data=fh dbms=dlm replace outdata=d1 ;
   delimiter='|' ;
run ;

Repair line breaks within a field of a delimited file

Sometimes some people generate delimited files with line break characters (carriage return and/or line feed) inside a field without quoting. I previously wrote about the case when the problematic fields are quoted. I also wrote about using non-ascii characters as field and new record indicators to avoid clashes.

The following script reads in stdin and writes to stdout repaired lines by ensuring every output line has at least the number of delimiters (|) as the first/header line (call this the target number of delimiters) by continually concatenating lines (remove line breaks) until it reaches the point when concatenating the next line would yield more delimiters than the target number of delimiters. The script appears more complicated than it should be in order to address the case when there are more than one line breaks in a field (so don’t just concatenate one line but keep doing so) and the case when a line has more delimiters than the target number of delimiter (this could lead to an infinite loop if we restrict the number of delimiters to equal the target).

#! /usr/bin/env python


import sys
from signal import signal, SIGPIPE, SIG_DFL #
signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less

line = sys.stdin.readline()
n_dlm = line.count(dlm)

line0 = line
line_next = 'a'
while line:
    if line.count(dlm) > n_dlm or line_next=='':
        line = line_next
        # line = sys.stdin.readline()
        if line.count(dlm) > n_dlm: ## line with more delimiters than target?
            line0 = line_next
            line_next = sys.stdin.readline()
            line = line.replace('\r', ' ').replace('\n', ' ') + line_next
        line0 = line
        line_next = sys.stdin.readline()
        line = line.replace('\r', ' ').replace('\n', ' ') + line_next

Calculate the weighted Gini coefficient or AUC in R

This post on Kaggle provides R code for calculating the Gini for assessing a prediction rule, and this post provides R code for the weighted version (think exposure for frequency and claim count for severity in non-life insurance modeling). Note that the weighted version is not well-defined when there are ties in the predictions and where the corresponding weights vary because different Lorentz curve (gains chart) could be drawn for different orderings of the observations; see this post for an explanation and some examples.

Now, to explain the code. The calculation of the x values (variable random, the cumulative proportion of observations or weights) and y values (variable Lorentz, the cumulative proportion of the response, the good’s/1’s or positive values) are straightforward. To calculate the area between the Lorentz curve and the diagonal line, one could use the trapezoidal rule to calculate the area between the Lorentz curve and x-axis and then subtract the area of the lower triangle (1/2):

\begin{align} Gini &= \sum_{i=1}^{n} (x_{i} – x_{i-1}) \left[\frac{L(x_{i}) + L(x_{i-1})}{2}\right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i})x_{i} + L(x_{i-1})x_{i} – L(x_{i})x_{i-1} – L(x_{i-1})x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i})x_{i} – L(x_{i-1})x_{i-1} \right] + \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i})x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} L(x_{n})x_{n} + \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i}) x_{i-1} \right] – \frac{1}{2} \ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} – L(x_{i}) x_{i-1} \right] \end{align}

where the last equality comes from the fact that \(L(x_{n}) = x_{n} = 1\) for the Lorentz curve/gains chart. The remaining summation thus corresponds to sum(df$Lorentz[-1]*df$random[-n]) - sum(df$Lorentz[-n]*df$random[-1]) inside the WeightedGini function since the \(i=1\) term in the summation is 0 (\(x_i=0\) and \(L(x_{0})=0\) for the Lorentz curve), yielding \(n-1\) terms in the code.

For the unweighted case, applying the trapezoidal rule on the area between the Lorentz curve and the diagonal line yields:

\begin{align} Gini &= \sum_{i=1}^{n} \frac{1}{n} \frac{\left[ L(x_{i}) – x_{i} \right] – \left[ L(x_{i-1}) – x_{i-1} \right] }{2} \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i-1}) – x_{i-1} \right] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n-1} \left[ L(x_{i}) – x_{i} \right] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n-1} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{n}) – x_{n}] \ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] + \frac{1}{2n} [L(x_{0}) – x_{0}] + \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] \ &= \frac{1}{n} \sum_{i=1}^{n} \left[ L(x_{i}) – x_{i} \right] \end{align}

where we repeatedly used the fact that \(L(x_{0}) = x_{0} = 0\) and \(L(x_{n}) = x_{n} = 1\) for a Lorentz curve and that \(1/n\) is the width between points (change in cdf of the observations). The summation is what is returned by SumModelGini.

Note that both \(1/2\) and \(1/n\) are not multiplied to the sums in the weighted and unweighted functions since most people will use the normalized versions, in which case these factors just cancel.