optparse R package for creating command line scripts with arguments

Just discovered the optparse package in R that allows me to write a command line R script that could incorporate arguments (similar to Python’s argparse). Here’s an example:

#! /usr/bin/env Rscript

# http://cran.r-project.org/web/packages/optparse/vignettes/optparse.pdf

option_list <- list(
    make_option(c('-d', '--date'), action='store', dest='date', type='character', default=Sys.Date(), metavar='"YYYY-MM-DD"', help='As of date to extract data from.  Defaults to today.')

opt <- parse_args(OptionParser(option_list=option_list))

# print(opt$date)
cat(sprintf('select * where contract_date > "%s"\n', opt$date)

Save this as my_scrypt.R, and do chmod +x my_script.R. Now check out ./my_script.R --help.

Package management in R and Python at work without root and behind firewall

My current job has strict security measures (referring to root access on a Linux server and the inability to access outside the company’s network), so it can be difficult in getting the tools necessary for my work, namely R packages on CRAN and Python packages via pip.

On my Windows workstation, I was able to install R by downloading the installer online and Python via Cygwin. However, R and Python are unable to connect to the internet to download and install additional packages because of the company’s firewall. To get around this for R, I could:

  • add the flag --internet2 to the execution path in R’s shortcut,
  • call setInternet2(TRUE) in the R console, or
  • set the environment variable http_proxy=http://username:password@proxy_server:port/.n

The first two tells R to use the proxy defined in Internet Explorer. I was able to access CRAN via my web browser, so this works. If CRAN is blocked on the browser, find out what proxy server is available at work and use that to access the outside world. If CRAN is also blocked on the proxy, put in a request to add it to the white list.

As for Python, install pip and use a proxy to download and install packages:

wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py --proxy="username:password@proxy_server:port"
pip install --proxy="username:password@proxy_server:port" argparse numpy pandas ## etc

NOTE: pip 1.3.1 has issues with proxy servers, so use the latest version.

On a Linux/Unix server, the added complexity is that of a lack of root access. Typically, Python is available by default on any modern distro. If not, have the admin team install R and Python via the distribution’s package manager, and if they can’t, then compile the two from source and install them locally. Once installed, use the same method as before for Python pip, but with the --user flag in order to install the packages locally in ~/.local/ (pip command is at ~/.local/bin/pip). For R, set the environment variables

export http_proxy="http://proxy_server:port/"
export http_proxy_user="username:password"

and install the libraries to ~/Rlib (add this to the library path via .libPaths() in ~/.Rprofile).

Parental control on home network

I recently looked into ways to block content on the home network. To protect the entire network, it seems like the filter should be placed on the router. This article on Lifehacker lists a few popular methods. I think using OpenDNS to filter is easy enough to get started. However, it’s quite easy to configure your connected computer to use a different DNS provider. However, one could set a static DNS on their tomato router.

Output data to Excel for reproducible post-hoc analysis or visualization

As much as I like to analyze and visualize data in R, I sometimes have the need to export results/data into Excel for my business partners or myself to consume in Excel or Powerpoint (eg, create custom/edit-able bar charts with various graphical overlays in a powerpoint slide). As I’ve been using the XLConnect package to read xls/xlsx files, I’m also using it to write data to a sheet in an Excel workbook. I write the necessary data out to the Data sheet:

## read data from files/DB and manipulate to get to the final data set
## now, write data
writeWorksheetToFile(file='foo.xlsx', data=iris, sheet='Data', clearSheets=TRUE) ## in case number of rows is less than before

I ask my collaborators to not edit the Data sheet except adding filters or sorting when they are inspecting/eye-balling the data. I ask them to do all analysis in separate sheets. The reason for this is to keep the work reproducible in case the data needs to be refreshed (error in data, repeat the analysis on new data, etc). That is, when I need to refresh the data, I ask for the modified Excel workbook and write out refreshed data to the same sheet (hence the clearSheets=TRUE option in case the number of rows is less than before). That way, calculations or plots referencing columns in the Data sheet would automatically be refreshed in the workbook with the refreshed data.

This is just another way to prevent inefficiency in the work flow and allows for reproducibility even when collaborating with Excel workbooks.

This work flow should in theory also work with SAS:

proc export data=mydata outfile='foo.xlsx'
    dbms=xlsx ;
    sheet=Data ;
run ;

First hackintosh with Windows dual boot using Intel NUC DC3217BY


My friend recently introduced me to the Intel NUC (DC3217BY). It’s basically a micro form factor barebone system that comes with Intel’s ULV i3 processor (powerful and low power consumpton). I decided to get one, slapped on 8 GB of ram, a 256 GB mSATA SSD, and a Broadcom based half-sized e-PCI network card, and Hackintosh’d it since the processor is similar to what’s in an Apple Macbook Air. Basic instructions for this particular hardware could be found here and here. A generic guide could be found here. For dual booting with Windows, this article and this post helped. This is what I recalled doing to set up:

  • Update the BIOS to the latest version
  • Create a bootable Windows 7 usb drive on Ubuntu using unetbootin (must be version 494) (drive must be formatted to NTFS)
  • Get Mac OS X 10.9.1 (Mavericks) from the Apple App store
  • Download Unibeast and Multibeast at tonymacx86
  • Download Chameleon Wizard
  • Download Kext Installer
  • On an existing machine with Mac OS X, /Applications/Utilities/Disk Utility and a (> 8 GB) usb flash drive to Mac OS Extended (Master Boot Record enabled).
  • Run Unibeast to load the Mac OS X installer on it
  • Copy Multibeast, Chameleon Wizard, and Kext Installer into this flash drive. Download DSDT.aml and the patched AppleIntelFramebufferCapri.kext here and place them on the flash drive as well (these are to get HDMI audio to work).
  • Boot up the flash drive, and boot the installer with the flags -x PCIRootUID=1 GraphicsEnabler=Yes per this post relevant to Mavericks
  • Once the Mac installer is booted, go to the Utilities Menu and launch Disk Utility. Format the hard drive into two partitions. The first should be called “Macintosh HD” and formatted to Mac OS Extended (Journaled) and the second should be called “BOOTCAMP” and formatted to MS-DOS (FAT).
  • Shutdown, insert in the Windows 7 usb, and install Windows on the second partition
  • Boot the Mac usb again, and install Mac OS X on the first partition
  • Boot the Mac usb again, and select to boot into the Mac OS X partition
  • Run Multibeast to do some post-configurations so that the hardware just works (options in image below)
  • Edit the org.chameleon.Boot via Chameleon Wizard per the image below
  • Copy DSDT.aml to /Extra/DSDT.aml
  • Install Kext Installer. Use it to install the patched AppleIntelFramebufferCapri.kext, then use it to rebuild permissions and kext cache. Restart the computer to have HDMI audio working.

Multibeast options: 2014-02-07-multibeast.png

Chameleon Wizard options for org.chameleon.Boot: 2014-02-07-chameleon-wizard.png

What would I do differently now? Consider getting a network card with bluetooth like the Dell DW1702 per this. I’m not sure if my monitor has speakers, so this would enable me to use wireless speakers. Update (8/22/2014): I ordered this wifi + bluetooth card (BCM943225 HMB/AzureWave AW-NB290) as it was cheap and that this guide shows it works well on a Mac after installing toledaARPT.kext from the repo. Now I have both wifi and bluetooth. The Windows driver for this card can be found here (direct link here).

Now, time to mount the small NUC to the back of my 27″ monitor.

Bulk resize images and keeping original files

Suppose you have a directory (with subdirectories) full of images, and you want to resize them all while keeping the original images. To do so, first create a copy of the directory tree without the image files. Then, using a for loop, find each image file and apply the convert command to it. The following is an example to resize jpg files to 40% of the original quality.

mkdir /path/to/mirror_dir
find /path/to/image_dir -type d -exec mkdir -p /path/to/mirror_dir/{} \;
cd /path/to/image_dir
for i in $(find -iname "*.jpg"); do echo $i; convert -resize 40% $i /path/to/mirror_dir/$i; done

Skeleton to create fast automatic tree diagrams using R and Graphviz


I’ve had to create tree diagrams (dendrograms, decision trees) many times in the past to illustrate the flow of data or decisions (e.g., data flow for a study). This is usually a manual task done in MS Powerpoint or Visio. I’ve also made some diagrams in the past using Graphviz based on the DOT language to make creation more reproducible. However, that still felt pretty manual.

I decided to come up with a skeleton framework to generate these diagrams using R since I could connect to various data sources, do calculations, and mash up outputs fairly fast with it.

Here is my framework illustrated by an example:

digraph <- '# dot -Tpng diagram.gv > diagram.png
digraph g {
graph [rankdir="LR"]
node [shape="rectangle" style=filled color=blue fontcolor=white] ;
n [label="%s"] ;
n_a [label="%s" color=red] ;
n_b [label="%s"] ;
n_aa [label="%s" color=red] ;
n_ab [label="%s" color=red] ;
n_ba [label="%s"] ;
n_bb [label="%s"] ;
n -> n_a [label="%s"];
n -> n_b [label="%s"];
n_a -> n_aa [label="%s"];
n_a -> n_ab [label="%s"];
n_b -> n_ba [label="%s"] ;
n_b -> n_bb [label="%s"] ;

string_list <- list(
    , 'All calls\\nn=100,000'
    , 'Night\\nn=20,000'
    , 'Day\\nn=80,000'
    , 'Closed\\nn=5,000'
    , 'Open\\nn=15,000'
    , 'Closed\\nn=10,000'
    , 'Open\\nn=7,000'
    , 'Night'
    , 'Day'
    , 'Closed'
    , 'Open'
    , 'Closed'
    , 'Open'

# http://stackoverflow.com/questions/10341114/alternative-function-to-paste
dot_file <- do.call(sprintf, string_list)
sink('diagram.gv', split=TRUE)
# run: dot -Tpng diagram.gv > diagram.png

This will write a file called diagram.gv:

dot -Tpng diagram.gv > diagram.png

digraph g { graph [rankdir="LR"] node [shape="rectangle" style=filled color=blue fontcolor=white] ; n [label="All calls\nn=100,000"] ; n_a [label="Night\nn=20,000" color=red] ; n_b [label="Day\nn=80,000"] ; n_aa [label="Closed\nn=5,000" color=red] ; n_ab [label="Open\nn=15,000" color=red] ; n_ba [label="Closed\nn=10,000"] ; n_bb [label="Open\nn=7,000"] ; n -> n_a [label="Night"]; n -> n_b [label="Day"]; n_a -> n_aa [label="Closed"]; n_a -> n_ab [label="Open"]; n_b -> n_ba [label="Closed"] ; n_b -> n_bb [label="Open"] ; }

Executing dot -Tpng diagram.gv > diagram.png, I will get the following output:


To make tree diagrams quickly, edit the structure of the diagram stored in the variable digraph. Dynamic texts should be inserted using sprintf via the list string_list. For example, do all calculations and then format the results in string_list. Then the diagram could be generated very quickly and will be reproducible. Changing any data source or any calculations would not result in manually re-creating the diagram!

Amp and USB Chargers

This is a good article that explains how USB charging works. Basically, avoid cheap chargers. For any reasonably good charger, the amperage of the charger doesn’t really matter so long as it exceeds what the device requires; that is, use a charger with at least 0.5 Amp if the device requires 0.5 Amp (what the original charger uses). Thus, it’s OK to use my 2 Amp chargers on most of my mobile device so things charge faster!