Skeleton to create fast automatic tree diagrams using R and Graphviz

wpid-2014-01-17-tree_diagram.png

I’ve had to create tree diagrams (dendrograms, decision trees) many times in the past to illustrate the flow of data or decisions (e.g., data flow for a study). This is usually a manual task done in MS Powerpoint or Visio. I’ve also made some diagrams in the past using Graphviz based on the DOT language to make creation more reproducible. However, that still felt pretty manual.

I decided to come up with a skeleton framework to generate these diagrams using R since I could connect to various data sources, do calculations, and mash up outputs fairly fast with it.

Here is my framework illustrated by an example:

digraph <- '# dot -Tpng diagram.gv > diagram.png
digraph g {
graph [rankdir="LR"]
node [shape="rectangle" style=filled color=blue fontcolor=white] ;
n [label="%s"] ;
n_a [label="%s" color=red] ;
n_b [label="%s"] ;
n_aa [label="%s" color=red] ;
n_ab [label="%s" color=red] ;
n_ba [label="%s"] ;
n_bb [label="%s"] ;
n -> n_a [label="%s"];
n -> n_b [label="%s"];
n_a -> n_aa [label="%s"];
n_a -> n_ab [label="%s"];
n_b -> n_ba [label="%s"] ;
n_b -> n_bb [label="%s"] ;
} 
'

string_list <- list(
    digraph
    , 'All calls\\nn=100,000'
    , 'Night\\nn=20,000'
    , 'Day\\nn=80,000'
    , 'Closed\\nn=5,000'
    , 'Open\\nn=15,000'
    , 'Closed\\nn=10,000'
    , 'Open\\nn=7,000'
    , 'Night'
    , 'Day'
    , 'Closed'
    , 'Open'
    , 'Closed'
    , 'Open'
    )

# http://stackoverflow.com/questions/10341114/alternative-function-to-paste
dot_file <- do.call(sprintf, string_list)
sink('diagram.gv', split=TRUE)
cat(dot_file)
sink()
# run: dot -Tpng diagram.gv > diagram.png

This will write a file called diagram.gv:

# dot -Tpng diagram.gv > diagram.png
digraph g {
graph [rankdir="LR"]
node [shape="rectangle" style=filled color=blue fontcolor=white] ;
n [label="All calls\nn=100,000"] ;
n_a [label="Night\nn=20,000" color=red] ;
n_b [label="Day\nn=80,000"] ;
n_aa [label="Closed\nn=5,000" color=red] ;
n_ab [label="Open\nn=15,000" color=red] ;
n_ba [label="Closed\nn=10,000"] ;
n_bb [label="Open\nn=7,000"] ;
n -> n_a [label="Night"];
n -> n_b [label="Day"];
n_a -> n_aa [label="Closed"];
n_a -> n_ab [label="Open"];
n_b -> n_ba [label="Closed"] ;
n_b -> n_bb [label="Open"] ;
}

Executing dot -Tpng diagram.gv > diagram.png, I will get the following output:

2014-01-17-tree_diagram.png

To make tree diagrams quickly, edit the structure of the diagram stored in the variable digraph. Dynamic texts should be inserted using sprintf via the list string_list. For example, do all calculations and then format the results in string_list. Then the diagram could be generated very quickly and will be reproducible. Changing any data source or any calculations would not result in manually re-creating the diagram!

Amp and USB Chargers

This is a good article that explains how USB charging works. Basically, avoid cheap chargers. For any reasonably good charger, the amperage of the charger doesn’t really matter so long as it exceeds what the device requires; that is, use a charger with at least 0.5 Amp if the device requires 0.5 Amp (what the original charger uses). Thus, it’s OK to use my 2 Amp chargers on most of my mobile device so things charge faster!

Hosting multiple WordPress sites with WordPress Multisite and Domain Mapping

I use WordPress to run my blog. I recently had the need to run another site and wanted to also use WordPress as my CMS. However, I don’t want to run another installation of WordPress if I don’t have to. I followed the directions on here to create enable WordPress Multisite, and followed the directions on here to make use of a different domain name for my second site. Basically,

  • backup my WordPress database and WordPress directory first
  • enable WordPress Multisite via wp-config.php
  • disable all WordPress plugins
  • follow the install options on my WordPress site
  • re-enable my WordPress plugins
  • install the WordPress MU Domain Mapping plugin
  • continue with instructions here

Some things specific to my setup:

  • sub-domain was enabled by default from my previous WordPress installation; I had to add the CNAME *blog.mydomain.com to my Host DNS configuration, where blog.mydomain.com is my current WordPress site. I also had to add the line ServerAlias blog.mydomain.com =*.blog.domain.com to my VirtualHost setup my Apache web server configuration.
  • mapped my WordPress site 2 from site2.blog.mydomain.com to site2.mydomain.com, where site2 is the name of my new WordPress site.

Now I am able to host multiple WordPress sites from the same WordPress installation. What is the down side? I have to manage plugins and themes that are available for each of the site…

Remotely access files on your home Windows computer

Most of my home computers/servers run Linux, so accessing them remotely is quite easy via the ssh protocol. Even the Windows machines I own have an ssh server installed via Cygwin.

Now, for those not familiar with Linux, one could

  • Install freeSSHd and set up an ssh server on Windows.
  • Configure an account and have freeSSHd initiate at startup.
  • Forward port 22 on the home router to port 22 on the machine with freeSSHd (assign it a static ip from the router); or, use a different external port for safety reasons (eg, 1022 -> 22).
  • Use the portable executable of WinSCP to access your files on any other Windows computer with internet access.

Delimited file where delimiter clashes with data values

A comma-separated values (CSV) file is a typical way to store tabular/rectangular data. If a data cell contain a comma, then the cell with the commas is typically wrapped with quotes. However, what if a data cell contains a comma and a quotation mark? To avoid such scenarios, it is typically wise to use a delimiter that has a low chance of showing up in your data, such as the pipe (“|”) or caret (“^”) character. However, there are cases when the data is a long string with all sorts of data characters, including the pipe and caret characters. What then should the delimiter be in order to avoid a delimiter collision? As the Wikipedia article suggests, using special ASCII characters such as the unit/field separator (hex: 1F) could help as they probably won’t be in your data (no keyboard key that corresponds to it!).

Currently, my rule of thumb is to use pipe as the default delimiter. If the data contains complicated strings, then I’ll default to the field separator character. In Python, one could refer to the field separator as ‘\ x1f’. In R, one could refer to it as ‘\ x1F’. In SAS, it could be specified as ‘1F’x. In bash, the character could be specified on the command line (e.g., using the cut command, csvlook command, etc) by specifying $’1f’ as the delimiter character.

If the file contains the newline character in a data cell (\n), then the record separator character (hex: 1E) could be used for determining new lines.

Automatically capitalize or uppercase or expand keywords in Emacs using Abbrev Mode

I like that SQL Mode in Emacs comes with an interactive mode that I could execute a query in a buffer to a client buffer similar to how I could execute R code using ESS. However, I don’t think SQL mode is that great at formatting SQL code (eg, indenting). I guess I could live with manual indenting and selecting in multiple lines (preceded by a comma).

I typically write code in lower cases, but I think the SQL convention is to use upper cases for keywords like SELECT, FROM, WHERE, etc. This can be done using Abbrev Mode in Emacs. Add the following to your init file:

;; stop asking whether to save newly added abbrev when quitting emacs
(setq save-abbrevs nil)
;; turn on abbrev mode globally
(setq-default abbrev-mode t)

Now, open a SQL file (/tmp/test.sql). Type SELECT, then C-x a l and type select. This saves the abbreviation for the current major mode (SQL mode). Now, when you type select then <space>, the keyword will be capitalized. Continue doing the same for other keywords. Now, use the write-abbrev-file command to save the abbreviations to ~/.emacs.d/abbrev_defs so it can be saved and usable in future Emacs sessions.

To define many keywords all at once, edit the abbrev_defs directly. For example, I used this list of SQL keywords and relied on Emacs macros to add them to my abbrev_defs file. My abbreviation table for SQL mode is as follows:

(define-abbrev-table 'sql-mode-abbrev-table
(mapcar #'(lambda (v) (list v (upcase v) nil 1))
'("absolute" "action" "add" "after" "all" "allocate" "alter" "and" "any" "are" "array" "as" "asc" "asensitive" "assertion" "asymmetric" "at" "atomic" "authorization" "avg" "before" "begin" "between" "bigint" "binary" "bit" "bitlength" "blob" "boolean" "both" "breadth" "by" "call" "called" "cascade" "cascaded" "case" "cast" "catalog" "char" "char_length" "character" "character_length" "check" "clob" "close" "coalesce" "collate" "collation" "column" "commit" "condition" "connect" "connection" "constraint" "constraints" "constructor" "contains" "continue" "convert" "corresponding" "count" "create" "cross" "cube" "current" "current_date" "current_default_transform_group" "current_path" "current_role" "current_time" "current_timestamp" "current_transform_group_for_type" "current_user" "cursor" "cycle" "data" "date" "day" "deallocate" "dec" "decimal" "declare" "default" "deferrable" "deferred" "delete" "depth" "deref" "desc" "describe" "descriptor" "deterministic" "diagnostics" "disconnect" "distinct" "do" "domain" "double" "drop" "dynamic" "each" "element" "else" "elseif" "end" "equals" "escape" "except" "exception" "exec" "execute" "exists" "exit" "external" "extract" "false" "fetch" "filter" "first" "float" "for" "foreign" "found" "free" "from" "full" "function" "general" "get" "global" "go" "goto" "grant" "group" "grouping" "handler" "having" "hold" "hour" "identity" "if" "immediate" "in" "indicator" "initially" "inner" "inout" "input" "insensitive" "insert" "int" "integer" "intersect" "interval" "into" "is" "isolation" "iterate" "join" "key" "language" "large" "last" "lateral" "leading" "leave" "left" "level" "like" "local" "localtime" "localtimestamp" "locator" "loop" "lower" "map" "match" "map" "member" "merge" "method" "min" "minute" "modifies" "module" "month" "multiset" "names" "national" "natural" "nchar" "nclob" "new" "next" "no" "none" "not" "null" "nullif" "numeric" "object" "octet_length" "of" "old" "on" "only" "open" "option" "or" "order" "ordinality" "out" "outer" "output" "over" "overlaps" "pad" "parameter" "partial" "partition" "path" "position" "precision" "prepare" "preserve" "primary" "prior" "privileges" "procedure" "public" "range" "read" "reads" "real" "recursive" "ref" "references" "referencing" "relative" "release" "repeat" "resignal" "restrict" "result" "return" "returns" "revoke" "right" "role" "rollback" "rollup" "routine" "row" "rows" "savepoint" "schema" "scope" "scroll" "search" "second" "section" "select" "sensitive" "session" "session_user" "set" "sets" "signal" "similar" "size" "smallint" "some" "space" "specific" "specifictype" "sql" "sqlcode" "sqlerror" "sqlexception" "sqlstate" "sqlwarning" "start" "state" "static" "submultiset" "substring" "sum" "symmetric" "system" "system_user" "table" "tablesample" "temporary" "then" "time" "timestamp" "timezone_hour" "timezone_minute" "to" "trailing" "transaction" "translate" "translation" "treat" "trigger" "trim" "true" "under" "undo" "union" "unique" "unknown" "unnest" "until" "update" "upper" "usage" "user" "using" "value" "values" "varchar" "varying" "view" "when" "whenever" "where" "while" "window" "with" "within" "without" "work" "write" "year" "zone")
))