This post does a fantastic job of the Mahalanobis distance. Basically, one can think of it as a multivariate generalization of the z-score. That is, the standardized distance of a vector from the origin (mean vector).

# Category: Statistics

## Math Software

This post has a good list of available math software.

## Public data sets

The Washington Post keeps a list here. There is also the UCI Machine Learning Repository.

UPDATE 2012-03-29: Genome data of 1000+ individuals per this post.

## Keep slides short and tell a story

I enjoyed this post. What did I learn?

- Don’t immediately jump to making slides when you have to give a presetation.
- If you have to use slides to help facilitate your presentation, start out by writing prose. That is, tell your story. Then make your slides. For a 20 minute presentation, try to stick to 3 slides. This way, only the most needed content (main ideas and graphics) will be on the slides, and everything else should be spoken. I like this because it forces me to know the content of my presentation cold without having to rely on the slides to know what I need to say next.

My current way of doing things? I start off by writing slides immediately. I start with an outline and fill in the gaps. This leads to **many** slides. However, I do I target my presentation to no more than 1 slide per minute. I have to admit that I always rely on my slides to remind me of the content I am to present. Many times, I even read the slides verbatim. If the slides were not available, I would not be able to deliver my presentation.

I really need to improve on my presentation skills. I think the key to it all is to know the content of your presentation as the back of your hand. This, no doubt, will lead to higher level of confidence when delivering the talk. Having a limited number of slides will definitely help with knowing the content cold.

## Working with SAS macro variables

Believe it or not, I recently had to use SAS for a simulation study because it was the only system available that could maximize the exact partial likelihood of a Cox proportional hazards model for tied data (`ties=exact`

) in a reasonable amount of time for a data set of 2,000 observations. This was the first time I ran a simulation in SAS. I made use of the MACRO capabilities of SAS to write generic functions that could be used to repeat steps; I might blog about my overall experience some day. Today, I’ll just write about one source of frustration that I had.

In writing my simulation study, I made use of Macro Variables (`%let`

command) for the different scenarios I was investigating. After getting some unexpected results, I started debugging and then realized that whatever comes after the equal sign and before the semicolon of a `%let`

statement is what the macro variable stores. For example, if I specify `%let x = 2 + log(2) / λ ;`

, what gets stored in x is `2 + log(2) / λ`

. When I utilize `&x`

, `2 + log(2) / λ`

gets substituted in the code (I incorrectly assumed this expression was evaluated), which can lead to unexpected results if it were used in mathematical expressions. For instance, if in a data step I do `y = z / &x`

, then I might get something I wasn’t expecting: `z / 2 + log(2) / λ`

instead of `z / (2 + log(2) / λ)`

. I then thought I needed to use %EVAL or %SYSEVALF, but these functions only work with additions and not other mathematical functions (e.g., `log`

). I also considered using CALL SYMPUTN but it could only be used in a `DATA`

step. My solution for getting the right math is to always surround expressions with parentheses: `%let x = (log(2) / λ)`

. This way, it’s safe to use `&x`

as an entity.

To help with debugging and to see exactly how your code is running, I suggest the `symbolgen`

option be turned on when macros are used:

```
<pre class="example">options symbolgen ; /*show macro variable resolution*/
```

## Install SAS on Linux

When installing SAS on an Ubuntu machine, I ran into issues like `/bin/sh`

being linked to `/bin/dash`

instead of `/bin/bash`

and the deployment wizard not starting. This post helped me resolve my issues.

Basically:

```
<pre class="src src-sh">sudo rm /bin/sh
```

sudo ln -s /bin/bash /bin/sh sudo apt-get install xauth x11-apps libstdc++5 ia32-libs libxp6 ## deployment wizard not starting sas -work /tmp

The default work directory for sas is `/usr/tmp`

. This isn’t available on Ubuntu. You can always use the `-work`

argument on the command line or change the default location. Change it in `!SASROOT/sasv9.cfg`

(`/usr/local/SAS/SASFoundation/9.2/sasv9.cfg`

). Other default options such as memsize could also be changed in that config file.

## Testing a hypothesis and hypothesis generation

I like this podcast. It discusses some recent findings about genes possibly relating to Alzheimer’s Disease. In it, the guest speaks of not having a hypothesis going into the study. Then the host (or another guest) raised the question of conducting a study without a pre-defined hypothesis. The keyword was “hypothesis generation.” I’m happy this was brought up on NPR (although this wasn’t the point of the show). I might use it as an example in class some day.

## Managing a statistical analysis project – guidelines and best practices

Had to share this link today as I better read all the content it refers to and incorporate a lot of the recommended practices into my work flow. Thanks Tal Galili for compiling all those information.

## Cool articles in the New York Time’s: Statistics + R

so these articles are ‘old news,’ but here i am to blog it down before i forget. First article is entitled “For Today’s Graduate, Just One Word – Statistics,” and the second article is entitled “Data Analysts Captivated by R’s Power.”

It really does feel re-enforcing and motivating when the NY Times write about your profession AND the tool you use!

## Sage (again): everything math

I blogged about Sage in the past and stated that I won’t be using it much since `R`

is my language/environment of choice. This is still true, but I wanted to write a few more comments about Sage after toying with it a bit more.

Sage is based on Python (good!) and its mission statement is

Mission: Creating a viable free open source alternative to Magma, Maple, Mathematica and Matlab.

I like it. If I need to solve an equation, factor, do partial fraction decomposition, do Taylor expansions, find derivatives, integrate, and all else math, Sage is there for me. It’s both free (open-source) and easy to use. The learning curve is pretty low if you want to do basic things like create examples for teaching Calculus. Plotting is also great but R is superior in my opinion. Sage graphics outperforms R graphics in one respect: it can include and display LaTeX equations natively (uses `matlibplot`

, which is based on GNU-plot, I think). Sage also displays the vertical and horizontal axes in the center of a plot, similar to the graphs in textbooks I grew up with. Sage graphics seems more geared towards teaching whereas R is geared towards professional publishing.

Personally, I’ll use Sage when I teach stuff like Calculus where I need plots with axes and all other math features that R isn’t built for.

There is a sage mode for emacs, however, is in alpha mode as of now, so the features aren’t comparable to ESS is for R.

Another great thing about Sage is it has a notebook GUI that allows it to be run inside a web browser. Therefore, you can run a Sage server that allows users run Sage interactively! See this for example. You can run `notebook()`

on your own computer or create an account on the previous site to test it out