## Better decision tree graphics for rpart via party and partykit

I’ve been using Graphviz to create better decision tree graphics “by hand” for rpart objects created in R (final tree). I stumbled on this post that shows how one could convert an rpart object to a party project via the as.party function in partykit to utilize the plot functions in party. It looks quite nice.

I might have to do additional hacking as I like to display the node size and percentage of success in every node. For example, in rpart, I do something like

## rpartObj created from rpart
textRpartCustom <-
{
nclass <- (ncol(yval) - 1L)/2
group <- yval[, 1L]
counts <- yval[, 1L + (1L:nclass)]
if (!is.null(ylevel))
group <- ylevel[group]
temp1 <- rpart:::formatg(counts, digits)
if (nclass > 1) {
## temp1 <- apply(matrix(temp1, ncol = nclass), 1, paste,
##     collapse = "/")
temp1 <- matrix(as.numeric(temp1), ncol=nclass)
##temp1 <- paste("p=", round(temp1[, 2] / apply(temp1, 1, sum)100, 1), "%", "; N=", apply(temp1, 1, sum), sep="")
temp1 <- paste("", round(temp1[, 2] / apply(temp1, 1, sum)100, 1), "%", "; ", apply(temp1, 1, sum), sep="")
}
if (use.n) {
out <- paste(format(group, justify = "left"), "\n", temp1,
sep = "")
}
else {
out <- format(group, justify = "left")
}
return(out)
}

rpartObj$functions$text <- textRpartCustom
plot(rpartObj)
text(rpartObj)


to get these information to display for a classification fit.

## Mahalanobis distance explained

This post does a fantastic job of the Mahalanobis distance. Basically, one can think of it as a multivariate generalization of the z-score. That is, the standardized distance of a vector from the origin (mean vector).

## Math Software

This post has a good list of available math software.

## Testing a hypothesis and hypothesis generation

I like this podcast. It discusses some recent findings about genes possibly relating to Alzheimer’s Disease. In it, the guest speaks of not having a hypothesis going into the study. Then the host (or another guest) raised the question of conducting a study without a pre-defined hypothesis. The keyword was “hypothesis generation.” I’m happy this was brought up on NPR (although this wasn’t the point of the show). I might use it as an example in class some day.

## Managing a statistical analysis project – guidelines and best practices

Had to share this link today as I better read all the content it refers to and incorporate a lot of the recommended practices into my work flow. Thanks Tal Galili for compiling all those information.

## Cool articles in the New York Time’s: Statistics + R

so these articles are ‘old news,’ but here i am to blog it down before i forget. First article is entitled “For Today’s Graduate, Just One Word – Statistics,” and the second article is entitled “Data Analysts Captivated by R’s Power.”

It really does feel re-enforcing and motivating when the NY Times write about your profession AND the tool you use!

## Sage (again): everything math

I blogged about Sage in the past and stated that I won’t be using it much since R is my language/environment of choice. This is still true, but I wanted to write a few more comments about Sage after toying with it a bit more.

Sage is based on Python (good!) and its mission statement is

Mission: Creating a viable free open source alternative to Magma, Maple, Mathematica and Matlab.

I like it. If I need to solve an equation, factor, do partial fraction decomposition, do Taylor expansions, find derivatives, integrate, and all else math, Sage is there for me. It’s both free (open-source) and easy to use. The learning curve is pretty low if you want to do basic things like create examples for teaching Calculus. Plotting is also great but R is superior in my opinion. Sage graphics outperforms R graphics in one respect: it can include and display LaTeX equations natively (uses matlibplot, which is based on GNU-plot, I think). Sage also displays the vertical and horizontal axes in the center of a plot, similar to the graphs in textbooks I grew up with. Sage graphics seems more geared towards teaching whereas R is geared towards professional publishing.

Personally, I’ll use Sage when I teach stuff like Calculus where I need plots with axes and all other math features that R isn’t built for.

There is a sage mode for emacs, however, is in alpha mode as of now, so the features aren’t comparable to ESS is for R.

Another great thing about Sage is it has a notebook GUI that allows it to be run inside a web browser. Therefore, you can run a Sage server that allows users run Sage interactively! See this for example. You can run notebook() on your own computer or create an account on the previous site to test it out

## Linear Regression plot with normal curves for error (sideways)

Linear Regression: Normal Errors

I remember my first encounter to the classical linear regression model with normally distributed errors by Dan Gillen in Stat 120c in spring 2005 at UCI. I mean, I’ve done linear regression in high school science classes: you get a best fit line out of your data. In the stat class, I remember seeing data drawn on the board and a line going through the data. To illustrate the normally distributed errors, he drew sideway normal curves.

I liked this presentation of the classical model. How do convey this in R? I couldn’t find a package for it (searched for an hour I think). I always search for packages because I always think others could do a better job at it than me and because I don’t like to re-invent things, even though it requires less time for me to do it myself in many cases. Well, here it is, written by me in R:

<pre class="src src-R"><span style="color: #87cefa;">sideways.dnorm</span> <span style="color: #7fffd4;">&lt;-</span> <span style="color: #00ffff;">function</span>(where.x, where.y, values=seq(-4,4,.1), magnify=4, ...){


###… consists of mean, sd, and log passed to dnorm function dens <- dnorm(x=values, …) x <- where.x + dens * magnify y <- where.y + values return(cbind(x,y)) }

temp <- sideways.dnorm(where.x=3, where.y=3, values=seq(-2,2,.1)) plot(temp)

##jpeg(“LinearRegressionNormalErrors.jpeg”) pdf(“LinearRegressionNormalErros.pdf”) set.seed(123) n <- 200 x <- runif(n, -8,8) y <- 1 + .5*x + rnorm(n, sd=1) fit <- lm(y~x) plot(x,y, xlim=c(-10,10), ylim=c(-8, 10) , main=“Classical Linear Regression\nwith Gaussian errors” ##, main=expression(paste(“Classical Linear Regression\nwith N(0,”, sigma\^2, “) errors”, sep=”)) , col=“red” ) abline(fit, lwd=3, col=“blue”) where.normal.x <- c(-4,0,4) for(i in 1:length(where.normal.x)){ where.x <- where.normal.x[i] where.y <- predict(fit, newdata=data.frame(x=where.x)) xy <- sideways.dnorm(where.x=where.x, where.y=where.y, magnify=4) lines(xy) abline(h=where.y, lty=2) abline(v=where.x, lty=2) }

dev.off()

## Inclusion of movies / animations / interactive plots in latex docs / beamer slides

Check out the movie15 package for including multimedia inside in Beamer slides. Some references are here, here, and here.

Note that movies are currently playable in only Adobe Reader/Acrobat. Mac’s Preview and other pdf readers currently do not support it. However, I know hope support will be given in the near future.

## Animation / Interactive plots in R for inclusion in html websites or latex documents (beamer)

Been wanting to do an interactive/animated plot like the one here for a while (go to the Taylor example). Sage looks like a good program to generate animated graphics, but R is my primary language of choice as a statistician. I don’t want to learn another language just for a single feature (although the other features look good as well). For mathy stuff that require the likes of Mathematica (cannot be done in R), I’ll go to Sage. But for now, I’ll stick with R.

Check out the animation package in R and check out the examples here. It’s quite easy to create animations in R. Just create a series of plot (in png or jpeg format) and make them interactive via the convert command from ImageMagick to pack them into a gif file. We can also use mencoder to turn a series of plot into an avi file or convert to turn them into an mpeg file for inclusion in LaTeX documents or slides.

Some code:

<pre class="src src-R"><span style="color: #7fffd4;">library</span>(animation)


setwd(“~/Documents/UCI/RWorkAndExamples/”) storagefolder <- ‘Animations-CI’ if( !(storagefolder %in% dir()) ) dir.create(storagefolder) outfolder <- file.path(getwd(), storagefolder) oopt <- ani.options(interval=.05, nmax=100, outdir=outfolder, ani.type=“jpeg”, ani.dev=jpeg, filename=‘CI-Animations.html’) ##store old options first ani.start() conf.int(0.95) ani.stop()

##addprefix script:
#! /bin/bash
for file in $@ do mv$file $addtofile.$file;
done
echo "Now run rename to change prefix."

##run in shell
##vinh's script
rename prefix. 00 prefix.?.jpeg
rename prefix. 0 prefix.??.jpeg
rename prefix. 00 prefix.?.jpeg
rename prefix. "" prefix.???.jpeg


Check out following this and this to convert image files into movies (for inclusion into pdf via multimedia or movie package):

<pre class="src src-sh">mencoder <span style="color: #ffa07a;">"mf://*.jpeg"</span> -mf <span style="color: #eedd82;">fps</span>=5 -o CI-animations.avi -ovc lavc -lavcopts <span style="color: #eedd82;">vcodec</span>=msmpeg4v2:<span style="color: #eedd82;">vbitrate</span>=800


Check out this to convert image files to gif animation for use on websites:

<pre class="src src-sh">convert -delay 50 -loop 50 ??0.jpeg CI-animations.gif


##check following
saveMovie(expr={for(i in 1:10) plot(rnorm(100))}, outdir=outfolder)

ani.options(nmax=100)
saveMovie(expr=clt.ani(obs=300), outdir=outfolder)
flip.coin()

###Taylor example
ifelse1 <- function(test, yes, no) if(test) yes else no

sin.taylor <- function(x) x - x\^3/factorial(3) + x\^5/factorial(5) - x\^7/factorial(7)
sin.taylor2 <- function(x, n=NULL){
if(is.null(n)) sin(x) else{
x +ifelse1(n>=3, - x\^3/factorial(3), 0) + ifelse1(n>=5, x\^5/factorial(5), 0) + ifelse1(n>=7, - x\^7/factorial(7), 0) + ifelse1(n>=9, x\^9/factorial(9), 0) + ifelse1(n>=11, - x\^11/factorial(11), 0)
}
}
curve(sin, xlim=c(-2pi,2pi), ylim=c(-2,2))
x <- seq(-2pi, 2pi, .1)
lines(x, f.taylor(x), col="blue", lwd=2)
lines(x, f.taylor2(x, n=7), col="blue", lwd=2)

jpeg("sinTaylor%03d.jpeg")
for(i in c(1,3,5,7,9,11)){
curve(sin, xlim=c(-2pi,2pi), ylim=c(-2,2), main=paste("Taylor Expansion of f(x)=sin(x)\nOrder = ",i))
lines(x, sin.taylor2(x, n=i), col="blue", lwd=2)
legend(4,2, legend=c("f", expression(hat(f))), lwd=c(1,2), col=c("black", "blue"), bty="n")
}
dev.off()
##shell:
convert -delay 100 -loop 50 sinTaylor00?.jpeg sinTaylor.gif
mencoder "mf://sinTaylor00?.jpeg" -mf fps=1 -o sinTaylor.avi -ovc lavc -lavcopts vcodec=msmpeg4v2:vbitrate=800

##consistency
n <- 200
x <- runif(n, 0, 4)
y <- .2*exp(x) + rnorm(n,2)
jpeg("Regression%03d.jpeg")
plot(x,y, col="red")
fit <- lm(y~x)
abline(fit, lwd=2, col="blue")
dev.off()

set.seed(123)
jpeg("Consistency%03d.jpeg", quality=50)
n.seq <- c(10,20,50,200,1000)
nsim <- 4
for(i in 1:length(n.seq)){
n <- n.seq[i]
for(j in 1:4){
x <- runif(n, 0, 4)
y <- .2exp(x) + rnorm(n,2)
plot(x,y, col="red", xlim=c(0,4), ylim=c(-1,15), main=paste("n =", n), cex.main=2)
fit <- lm(y~x)
abline(fit, lwd=4, col="blue")
legend(0,15, legend=paste("Slope =", round(fit\$coef[2],2)), bty="n",cex=2)
}
}
dev.off()
##shell
convert -delay 100 -loop 50 Consistency0.jpeg Consistency-Ani.gif
mencoder "mf://Consistency0*.jpeg" -mf fps=1 -o Consistency-Ani.avi -ovc lavc -lavcopts vcodec=msmpeg4v2:vbitrate=800