Build 32 bit R on 64 bit Ubuntu by utilizing chroot

In the past, I’ve described how one could build multiarch (64 bit and 32 bit) versions of R on a 64 bit Ubuntu machine. The method based on this thread no longer works as of R 2.13 or 2.14 I believe. I received advice from someone on #R over on freenode (forgot who) a few months ago that suggested the chroot route (see this also). I recently tried it and wanted to document the procedures. Although the solution isn’t as nice as the previous multiarch route, it will suffice for now. With the chroot method, first compile the 64 bit version of R the usual way. For the 32 bit version of R, do:

<pre class="src src-sh"><span style="color: #ff4500;">#### </span><span style="color: #ff4500;">change my.username to your username, or modify path per your taste</span>

### create chroot jail sudo apt-get install dchroot debootstrap sudo mkdir ~/chroot-R32 sudo emacs -q -nw /etc/schroot/schroot.conf ## paste the following in the file: (no quotes) [natty] description=Ubuntu Natty location=/home/my.username/chroot-R32 priority=3 users=my.username groups=sbuild root-groups=root

## build a basic Ubuntu system in the chroot jail sudo debootstrap –variant=buildd –arch i386 natty /home/my.username/chroot-R32 http://ubuntu.cs.utah.edu/ubuntu/ ## pick a mirror from https://launchpad.net/ubuntu/+archivemirrors

## copy my source locations for apt sudo cp /etc/apt/sources.list /var/chroot/etc/apt/sources.list ## edit this new file if to reflect only the needed source

### do following steps whenever you need to access 32 bit R ## access to proc and dns sudo mount -o bind /proc /home/my.username/chroot-R32/proc sudo cp /etc/resolv.conf /home/my.username/chroot-R32/etc/resolv.conf ## go into jail; do this whenever you want sudo chroot /home/my.username/chroot-R32 dpkg-architecture ## make sure system is i386 ### now the root / location should reflect the jail

### following happens in jail ## tools needed to build R apt-get install gcc g++ gfortran libreadline-dev libx11-dev xorg-dev ## get svn to get latest r source code apt-get install git-core subversion

## compile 32 bit R cd home/ mkdir R32 cd R32 svn checkout https://svn.r-project.org/R/trunk/ r-devel cd r-devel/ apt-get install rsync ./tools/rsync-recommended ./configure make make install R

How big is my /home/my.username/chroot-R32 folder? It is at 791 MB after the above steps. Let me know if you have suggestions for having both 32 bit or 64 concurrently on Linux. I believe Windows and Mac ships and compiles both 32 bit and 64 bit versions of R. I’m surprised this isn’t the case for Linux.

Build multiarch R (32 bit and 64 bit) on Debian/Ubuntu

I have the 64 bit version of R compiled from source on my Ubuntu laptop. I recently had a need for R based on 32 bit since a package I needed to compile and use only works in 32 bit. I thought it was readily available on Ubuntu since both 32 bit and 64 bit versions of R are shipped with the Windows and Mac OS X installers. I tried figuring out how to do so using the manual (R 2.13.1), but could not figure it out on my own. I seeked help on R-devel and received some helpful responses.

Here is a quick reminder for myself:

<pre class="src src-sh"><span style="color: #ff4500;">## </span><span style="color: #ff4500;">download R source from tar ball or svn</span>

## working directory has R/trunk/ ## run in trunk ./tools/rsync-recommended sudo apt-get install ia32-libs lib32readline6-dev lib32ncurses5-dev lib32icu-dev gcc-multilib gfortran-multilib ## ubuntu does not have ia32-libs-dev # cd ../.. # mkdir R32 # cd R32 # ../R/trunk/configure r_arch=i386 CC=’gcc -std=gnu99 -m32′ CXX=’g++ -m32′ FC=’gfortran -m32′ F77=’gfortran -m32′ # make -j24 && sudo make install # cd .. # mkdir R64 # ../R/trunk/configure r_arch=amd64 # make -j24 && sudo make install ./configure r_arch=i386 CC=‘gcc -std=gnu99 -m32′ CXX=‘g++ -m32′ FC=‘gfortran -m32′ F77=‘gfortran -m32′ make -j24 && sudo make install make clean ./configure r_arch=amd64 make -j24 && sudo make install

You can build directly in the R/trunk, but make sure you execute make clean first to clear out any previous builds. Not doing so gave me errors like:

 <pre class="src src-sh">/usr/bin/install: cannot create regular file

`../../include/i386/Rconfig.h': No such file or directory

Now, executing R will give me the 64 bit version of R (whatever was the last build). If I want to specify the arch, I can just issue the commands R --arch i386 or R --arch amd64. When launching R in emacs, do C-u M-x R and type in the --arch argument.

Do a make uninstall to remove R prior to installing a new version of R to the same location.

R from source

The following are notes for myself.

I like to use the bleeding edge version of R:

<pre class="example">svn checkout https://svn.r-project.org/R/trunk/ r-devel

cd r-devel ./tools/rsync-recommended

use the following to update sources: svn update

pre-reqs

sudo apt-get build-dep r-base

sudo apt-get install gcc g++ gfortran libreadline-dev libx11-dev xorg-dev

sudo apt-get install texlive texinfo

./configure make sudo make install

My own programming style convention for most languages

I write code mainly in R, and from times to times, in C, C++, SAS, bash, python, and perl. There are style guides out there that help make your code more consistent and readable to yourself and others. Here is a style guide for C++, and here is Google’s style guide for R and here is Hadley Wickam’s guide for R. For R, I agree more with Google’s style guide than Hadley Wickam’s because I absolutely hate typing the underscore (personal preference) and because Google’s style guide seems more related to that of the C++’s guide. Style guides differ by languages because the languages are different (restrictions on names, etc.).

My brain goes crazy if I have to remember and apply multiple styles, so I want to use a convention that I can use consistently for all languages. This boils down to refraining from using special characters such as “-“, “.”, and “_” in names as these characters can have special meaning in different languages. Here it goes:

  • constants: the letter k or n followed by a description. For example, kConstant or nSim.
  • variable names: description words where the first word is lower case and every subsequent words are lower case but with an upper case first letter. For example, varBeta1 for variance of beta1.
  • function names: a verb followed by description words, with each word capitalized except. For example, savePlot and computeGRhoGamma. This is the same as the variable names. I originally was going to follow the Google R style guide instead of the C++ style guide, but opted for the latter because the reasoning made sense: the distinction is obvious based on syntax when a function is called (has parentheses and possibly arguments).
  • function arguments: depending on the type of argument, use the variable names or function names style for each. To help distinguish an argument that takes in a function, the argument should begin with f if it takes in a function. For example, drop=TRUE and fSummarize=summarizeDefault.

When breaking these conventions lead to a better understanding of the code (easier on the brain), I will not hesitate to break them. For example, using i, j, k as iterator variables, using na.rm as functional argument in R, or rKM for a function that draw random numbers from a Kaplan-Meier survival curve.

Now, if only I can just magically transform all of my own code into this convention. I’m going to really hate going back to old code that doesn’t follow my own style, especially when they refer to code in packages that I will update according to my new convention.

serialize or turn a large parallel R job into smaller chunks for use with SGE

I use the snow package in R with OpenMPI and SGE quite often for my simulation studies; I’ve outlined how this can be done in the past.

The ease of these methods make it so simple for me to just specify the maximum number of cores available all the time. However, unless you own your own dedicated cluster, you are most likely sharing the resources with many others at your institution, and the resources are managed by a scheduler (SGE). The scheduler is nice in that your job automatically starts whenever the resources are available and it tries to be fair with everyone. Other people’s smaller jobs will most likely be running, and unless no one is using the cluster, my job is usually waiting in limbo when I specify a lot of cores (100). I’ve outlined how to determine the number of free nodes in the cluster to help in the core specification. However, what if after your job starts, other jobs are done and more cores are available? Are you stuck waiting for your job that’s barely using 5 CPU’s?

My system admin always advised me to break my job into smaller chunks and request a small number of cores for each to be more efficient (getting the available CPU time through the scheduler whenever they are available). I finally got around to thinking about how I can implement this, and it’s quite easy. For random number generation, I just specify a different seed for each chunk as described here; this isn’t the ideal solution for random number generation, but it suffices for now. A wrapper function that gets called repeatedly is always written when I use snow anyways, so adapting it is quite easy. Here is a quick toy example.

sim.R, with my simulation function:

<pre class="src src-sh"><span style="color: #ff4500;">#</span><span style="color: #ff4500;">! /bin/</span><span style="color: #00ffff;">env</span><span style="color: #ff4500;"> Rscript</span>

## get arguments: “seed <- 100″ “iter.start <- 1″ “iter.end <- 100″ arguments <- commandArgs(trailingOnly=TRUE) for(i in 1:length(arguments)) { eval(parse(text=arguments[i])) }

set.seed(seed) n <- 100 beta0 <- 1 beta1 <- 1

simulate <- function(iter=1) { x <- rnorm(n) epsilon <- rnorm(n) y <- beta0 + beta1*x + epsilon fit <- lm(y ~ x) return(list(iter=iter, beta=coef(fit), varcov=vcov(fit))) }

result <- lapply(iter.start:iter.end, simulate) dput(result, paste(“SimResult”, iter.start, “.Robj”, sep=“”))

submit.R, submitting many smaller chunks to SGE:

<pre class="src src-sh"><span style="color: #ff4500;">#</span><span style="color: #ff4500;">! /bin/</span><span style="color: #00ffff;">env</span><span style="color: #ff4500;"> Rscript</span>

job.name <- “MySim” sim.script <- “sim.R” Q <- “12hour_cluster.q” set.seed(111) n.chunk <- 2 n.chunk.each <- 10 my.seeds <- runif(n.chunk, max=100000000) dput(my.seeds, “my.seeds.Robj”) for(current.chunk in 1:n.chunk) { seed <- my.seeds[current.chunk] iter.start <- current.chunk * n.chunk.each – n.chunk.each + 1 iter.end <- current.chunk * n.chunk.each current.job <- paste(job.name, current.chunk, sep=“”) current.job.files <- paste(current.job, c(“”, “.stdout”, “.stderr”), sep=“”) submit <- paste(“qsub -q “, Q, ” -S /usr/bin/Rscript -N “, current.job.files[1], ” -o “, current.job.files[2], ” -e “, current.job.files[3], ” -M email@institution -m beas “, sim.script, ” ‘seed=”, seed, “‘ ‘iter.start=”, iter.start, “‘ ‘iter.end=”, iter.end, “‘”,sep=“”) ## read sim.R directly ## qsub -q 12hour_cluster.q -S /usr/bin/Rscript -N MySim1 -o MySim1.stdout -e MySim1.stderr -M email@institution -m beas sim.R ‘seed=123′ ‘iter.start=1′ ‘iter.end=50′ system(submit)

#### OR USE FOLLOWING METHOD ## submit <- paste(“qsub -q “, Q, ” -S /bin/bash -N “, current.job.files[1], ” -o “, current.job.files[2], ” -e “, current.job.files[3], ” -M email@institution -m beas “, sep=””) ## command <- paste(“Rscript “, sim.script, ” ‘seed=”, seed, “‘ ‘iter.start=”, iter.start, “‘ ‘iter.end=”, iter.end, “‘”, sep=””) ## job.script <- paste(job.name, current.chunk, “.sh”, sep=””)

## sink(job.script) ## cat(“#! /bin/env bashn”) ## cat(“module load R/2.12.1n”) ## cat(command, “n”) ## sink() ## system(paste(submit, job.script)) ## ## qsub -q 12hour_cluster.q -S /bin/bash -N MySim1 -o MySim1.stdout -e MySim1.stderr -M email@institution -m beas MySim1.sh ## ## MySim1.sh: Rscript sim.R ‘seed=123′ ‘iter.start=1′ ‘iter.end=50′ }

I apologize for the html replacements that kind of screws up the code. Just replace-string in emacs.

Now, I can just specify some parameters in submit.R and multiple smaller jobs will be submitted to SGE. I just have to write another script (aggregate.R) to put the results together and compute the information I need. The nice thing is that OpenMPI or other third-party software isn’t even required.

Determining number of nodes or cores available in an SGE Queue

To determine the status of a queue in SGE, one can issue the command qstat -g c to get such information like number of CPU available and the current CPU and memory load. However, this information can be misleading when nodes can be cross-listed in multiple Q’s. A Q can say X number of nodes are unused, when in reality, they are in use in a different Q. Consequently, a submitted parallel job asking for X cores can wait in limbo for quite some time depending on the cluster’s load. The following sgeQload.R R script uses some commands explained in the cheat sheet to output the number of cores really available:

 <pre class="example">#! /bin/env Rscript

This script shows me the number of cores available for each Q.

Since many Q’s on BDUC contain overlapping nodes, information from “qstat -g c” could be misleading and lead to submitted jobs that are waiting…

This script utilizes R, qconf

References

http://moo.nac.uci.edu/~hjm/bduc/sge-quick-reference_v3_cheatsheet.pdf

http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

qstatgc <- system(“qstat -g c”, intern=TRUE) qstatgc.list <- strsplit(qstatgc, split=”\s+”, perl=TRUE) ## remove — line and all.q qstatgc.list[[1]] <- qstatgc.list[[1]][-1] ## CLUSTER QUEUE is one thing -> QUEUE qstat <- t(sapply(qstatgc.list[-1], function(x) as.numeric(x[-1]))) colnames(qstat) <- qstatgc.list[[1]][-1] rownames(qstat) <- sapply(qstatgc.list[-1], function(x) x[1]) qstat <- cbind(qstat, NCPU=NA, LOAD=NA, AVAILABLE=NA)

for(Q in rownames(qstat)){ host.list <- strsplit(grep(“hostlist”, system(paste(“qconf -sq”, Q), intern=TRUE), value=TRUE), split=”\s+”, perl=TRUE)[[1]][-1] host.vec <- NULL for(host in host.list){ host.vec <- c(host.vec, strsplit(strsplit(gsub(“\”, “”, paste(system(paste(“qconf -shgrp”, host, sep=” “), intern=TRUE), collapse=” “), fixed=TRUE), “hostlist”, fixed=TRUE)[[1]][2], “\s+”, perl=TRUE)[[1]]) } host.vec <- unique(host.vec) host.vec <- host.vec[host.vec != “”] host.vec <- gsub(“.bduc”, “”, host.vec, fixed=TRUE)

qhost <- system(“qhost”, intern=TRUE) qhost.matrix <- do.call(rbind, strsplit(qhost[-1], “\s+”, perl=TRUE)) colnames(qhost.matrix) <- strsplit(qhost[1], “\s+”, perl=TRUE)[[1]] NCPU <- sum(as.numeric(qhost.matrix[qhost.matrix[, “HOSTNAME”] %in% host.vec, “NCPU”])) LOAD <- sum(as.numeric(qhost.matrix[qhost.matrix[, “HOSTNAME”] %in% host.vec, “LOAD”])) qstat[Q, “NCPU”] <- NCPU qstat[Q, “LOAD”] <- LOAD qstat[Q, “AVAILABLE”] <- NCPU-LOAD }

qstat

Note that this script is specific to the cluster I use. It should be modified for other clusters. It does not work immediately on another cluster I have access to.

Compile R on Mac OS X

I did this before on my Macbook, but never documented. Wanted to document my attempt on the Mac virtual machine.

  1. Install Xcode (gcc/g++) and gfortran per this site.
  2. Download the R source file (2.12 in this example), and extract it.
  3. ./configure did not work (Fortran issue). Per this, this, and this, I used: ./configure --with-blas‘-framework vecLib’ –with-lapack –with-aqua –enable-R-framework F77=”gfortran -arch x86_64″= (the F77 is needed!).
  4. make then sudo make install.
  5. Add /Library/Frameworks/R.framework/Versions/Current/Resources/bin to PATH in ~/.bashrc.

Creating even NICER, publishable, embeddable plots using tikzDevice in R for use with LaTeX

It’s true. I like to do my work in R and write using LaTeX (well, I prefer to use org-mode for less formal writing and/or if I don’t have to typeset a lot of math). I haven’t done a lot of LaTeX’ing or Sweaving in the last year since 1) I’ve been collaborating with scientists (stuck using Word) and 2) my simulations in R have been a little overwhelming to keep in one file a la literate programming. I have a feeling I’ll be going back to LaTeX soon since I have to write up my dissertation (and lectures if I end up at an academic institution, **crosses finger**).

Subconsciously I’ve always wanted a tighter integration between R and LaTeX. Sweave did a fantastic job bringing R to LaTeX, greatly improving my workflow and jogging my memory when I revisit projects (just look at the one file consisting of documentation/writing and code). Despite R’s outstanding capabilities in creating publishable plots, I always felt it needed work in the realm of typesetting math. Sure it supported mathematical expressions. I used them a few times, but whenever I included the generated plot in a LaTeX document, the figure appeared out of place. I’ve explored Paul Murrell’s solution by embedding Computer Modern Font into the R-generated plot; UPDATE 10/23/2010 I also explored the psfrag in this post. The required effort probably outweighs the cost in most situation in my opinion (I haven’t done it in a real life scenario). I also tried to create a simplistic plot and overlay LaTeX code afterwards; again, haven’t done much with this, although I expect this will come in useful when I have to write over a pdf file that I do not have access to the source code.

I’ve also explored how to draw in LaTeX using the Picture package and XY package. I didn’t do much with it after the exploration because I didn’t know the syntax well and because drawing in R, Google Docs, or OpenOffice suffices 99.9% of the time. I prefer to draw in R or LaTeX to have reproducible code.

I was recently introduced to tikzDevice by this post via the R blogosphere. What it does is translate an R plot to TikZ. That is, instead of creating the plot device via pdf(), you do it with tikz(). This creates a .tex file via three modes (well four but I don’t think I’ll use the barebones mode):

  1. Just tikz code so you can use the include{} command in LaTeX (default).
  2. Tikz code surrounded by the document skeleton so the tex file can be compiled (standAlone=TRUE).
  3. Console output mode where the code are sent to stdout for use with Sweave (console=TRUE). UPDATE 10/23/2010: use this with pgfSweave; builds on cacheSweave and Sweave.

Read the vignette; it’s fairly complete. The authors claim that the software is still in Beta stage (they’re still testing certain interface features), but my initial testing shows that it is ready for prime time, at least for my usage.

If you want the results in jpeg/png for use with the internet or Word documents, you can always convert the pdf to another format via the convert command.

Here is my example for the standalone (2) case:

 <pre class="src src-sh"><span style="color: #ff4500;">## </span><span style="color: #ff4500;">look at vignette for examples and how to's</span>

library(tikzDevice) f1 <- “tikzDevice_Ex1.tex” tikz(file=, standAlone=TRUE) set.seed(100) n <- 100 x <- rnorm(100) y <- 2*x + rnorm(n) fit <- lm(y ~ x) plot(x, y, xlab=“x”, ylab=“y”, main=“$E[Y] = \beta_0 + \beta_1 \times x$”) dev.off() system(paste(“rubber –pdf”, f1)) system(“convert tikzDevice_Ex1.pdf tikzDevice_Ex1.png”) system(“gnome-open tikzDevice_Ex1.png”)

Note I make use of the rubber command. Feel free to replace it with pdflatex.

UPDATE 10/23/2010: Make use of pgfSweave with this!

S4 classes in R: printing function definition and getting help

I’m not very familiar with S4 classes and methods, but I assume it’s the recommended way to write new packages since it is newer than S3; this of course is open to debate. I’ll outline my experience of programming with S4 classes and methods in a later post, but in the mean time, I want to write down some notes on how to get help (via ? in R) and getting function definitions from S4 methods.

For S3 classes and methods, suppose I want to learn more about a certain method, say print of some class. Let’s use class lm as an example. I could type ?print.lm to get documentation on the function, and type print.lm in the R console to get the function definition printed. This allows me to learn more about the method and learn from them (perks of open source!). To recap:

 <pre class="src src-sh"><span style="color: #ff4500;">## </span><span style="color: #ff4500;">S3</span>

?generic.class ## get help to the generic function “generic” of a particular class “class” generic.class ## print the function in console

However, with S4, this is not the class. I’ve used a few packages that are written in S4 and could not get documentation open within R and get the function definitions printed to learn about the underlying code based on the previous techniques. As I learn to write and document an R package based on S4, I read this section of the R manual for writing packages. I misinterpreted the reading and thought to get help on a method I had to type something like methods?generic,signature_list-method to get help. However, I received an error due to the - symbol (it’s an operator in R). I believe the stated convention is just for the documentation piece of S4 methods in the .Rd files. After some more searching, I came across this link (examples section) that showed me how to get help. Let’s illustrate with the show method (S4’s equivalent of the print method) for the mer class in the lme4 package.

 <pre class="src src-sh"><span style="color: #ff4500;">## </span><span style="color: #ff4500;">S4</span>

showMethods(“show”) ## show all methods for show ?show ## shows the generic documentation of show method?show(“mer”) ## method?generic(“signature 1″, “signature 2″, …) — get help for the generic function for a particular signature list, usually a single class getMethod(“show”, signature=“mer”) ## function definition lme4:::printMer ## printMer is what the show method for mer calls

For our particular example, the show method for the mer class calls a printMer function in the lme4 namespace. Thus, we need to call lme4:::printMer to see the definition.

Hope this out others out there.