Literature

Emacs
Researching
Author

Vinh Nguyen

Published

July 24, 2011

Despite the availability of softwares such as Mendeley, Zotero, and JabRef, I like to store my papers (pdf files) and citation information (bib files) using a directory stucture, enter notes into a text file (org-mode), view notes and bibliographic information using a single file, cite references in LaTeX from a single bib file, and manage papers in Emacs using dired. I choose to manage my references using the bib format because it is the de facto standard in academia (at least in my field), and hence, is easily exported from the publisher's website. In addition, I write all my papers using LaTeX for which I utilize bibtex whenever a reference is made; if I were to write papers using another program that utilizes a different format for bibliographies, I can easily convert the bib files using bibutils.

This post was originally motivated by this post which outlines a method to manage papers in emacs and org-mode. However, the workflow did not fit me particularly well, and so I came up with a setup and workflow of my own. I've been using this workflow for the last half a year and never got around to outlining it until the recent org-bibtex discussion reminded me to do so. I will first describe my setup and then its usage. I then end with some tips on starting your academic paper library and other thoughts. It is assumed that papers are available in pdf format. All filenames used throughout should not contain spaces, just to be safe.

Setup

The base directory for the research papers I download and their corresponding bib files is ~/Documents/Literature. In this base directory, I have a folder scripts for storing scripts that help me manage the files. I also have the files books.bib and software.bib for storing citation information for books and software. The main files, bibliography.bib and literature.org are generated and updated using scripts and hence, are set to be non-writable by me in order to avoid manual editing that will be lost when the next update occurs.

I split my papers into subjects or categories according to directories in the base directory such as ./EstimatingEquations, ./Survival/, etc. In each of the subject directory, I have a category.org file that assists with structuring the literature.org file. An example of the category.org file:

* Estimating Equations :EstimatingEquations:

Place the following scripts in your script directory, ./scripts/.

LitNotes.sh:

#! /bin/bash

## template notes.org for a folder with bib.bib file

if [ -f bib.bib ]
then
    if [ -s notes.org ] ## check if filesize is greater than 0 if it exists; FALSE if it does not exsist
    then
        echo "./notes.org has content (file size greater than zero.)!" && exit 1
    else
        ## http://stackoverflow.com/questions/5162808/help-with-regex-extracting-text
        author=$(sed -n '/^[[:blank:]]*author[[:blank:]]*=[[:blank:]]*{/ {s///; s/}[^}]*$//p}' bib.bib)
        title=$(sed -n '/^[[:blank:]]*title[[:blank:]]*=[[:blank:]]*{/ {s///; s/}[^}]*$//p}' bib.bib)
        year=$(sed -n '/^[[:blank:]]*year[[:blank:]]*=[[:blank:]]*{/ {s///; s/}[^}]*$//p}' bib.bib)
        echo "** $author ($year) $title 
*** Notes
" > notes.org
    fi
else
    echo "No bib.bib in current directory." && exit 1
fi

# ## following used on inside each category folder, eg, ./Survival; remember to edit tags in this script
# for directory in `ls -p | grep "/"`
# do
#     cd "$directory"
#     ##pwd
#     ##[ -f notes.org ] && echo "yes"
#     if [ -f notes.org ];
#     then
#         ## http://stackoverflow.com/questions/5162808/help-with-regex-extracting-text
#         author=$(sed -n '/^[[:blank:]]*author[[:blank:]]*=[[:blank:]]*{/ {s///; s/}[^}]*$//p}' bib.bib)
#         title=$(sed -n '/^[[:blank:]]*title[[:blank:]]*=[[:blank:]]*{/ {s///; s/}[^}]*$//p}' bib.bib)
#         year=$(sed -n '/^[[:blank:]]*year[[:blank:]]*=[[:blank:]]*{/ {s///; s/}[^}]*$//p}' bib.bib)
#         ##author=$(sed -n '/author *=/{s/^[^{]*{\([^,]*\),.*$/\1/;s/} *$//p}' ./bib.bib) ## does not work well because there could be commas in author
#         echo "** $author ($year) $title :Survival:
# *** Notes
# " ##> notes.org
#     fi
#     cd ..
# done

LitCreateDir.sh:

#! /bin/bash

## arguments are pdf files

for file in "$@"
do
bn=`basename "$file"`
NameNoExt=${bn%.*} ## no extension
Ext=${bn/*./} ## extension http://www.linuxforums.org/forum/programming-scripting/128625-how-get-file-extension-without-dot.html
if [ `echo $Ext | tr [:upper:] [:lower:] ` = "pdf" ] ## only do pdf files
then
mkdir "$NameNoExt"
mv "$file" "$NameNoExt/"
touch "$NameNoExt/bib.bib"
touch "$NameNoExt/notes.org"
if [ -f "$NameNoExt.bib" ] ## file exists?
then
mv -f "$NameNoExt.bib" "$NameNoExt/bib.bib"
cd "$NameNoExt"
LitNotes.sh
cd ..
fi
fi
done

LitUpdate.sh:

#! /bin/bash

basedir="$HOME/Documents/Literature"
bibfile="$basedir/bibliography.bib"
litfile="$basedir/literature.org"

## delete old files
cd $basedir
rm -f $bibfile $litfile

## create $bibfile
find ./ -iname "*.bib" -print0 | xargs -0 cat > /tmp/bibliography.bib ## -0, some folders have "'" in name; *.bib and not just bib.bib to get books.bib as well
mv /tmp/bibliography.bib ./
# find ./ -iname "*.bib" -print0 > bibfiles.txt
# xargs -0 cat < bibfiles.txt > $bibfile
## books
##cat books.bib >> $bibfile ## above should already pick up books


## create $litfile
echo "#+title: Literature
#+author: YOUR NAME HERE
#+email: YOUR EMAIL HERE
" >> $litfile
for directory in `ls -p | grep "/"` ## directories in Literature
do
cd "$directory"
if [ -f category.org ]
then
cat category.org >> $litfile
##find ./ -iname "notes.org" -print0 | xargs -0 cat >> $litfile ## need to add in links to paper, location, bib, notes
for notes in `find ./ -iname "notes.org"`
do
fullpath=`readlink -f "$notes"`
##fullpath=`realpath "$notes"` ## http://www.commandlinefu.com/commands/view/7999/get-the-absolute-path-of-a-file?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Command-line-fu+%28Command-Line-Fu%29&utm_content=Google+Reader
paperdir=`dirname "$fullpath"` ## directory name of directory with notes.org
pdf=`find "$paperdir" -iname "*.pdf"`
# cat "$notes" >> $litfile
# echo "*** [[file:$pdf][paper]] [[file:$paperdir/][location]] [[file:$paperdir/bib.bib][bib]] [[file:$paperdir/notes.org][notes]]" >> $litfile
head -n1 "$notes" >> $litfile
echo "*** [[file:$pdf][paper]] [[file:$paperdir/][location]] [[file:$paperdir/bib.bib][bib]] [[file:$paperdir/notes.org][notes]]" >> $litfile
sed '1 d' "$notes" >> $litfile
done
fi
cd ..
done

## make file read-only, so I have to manually go to the individual files to edit
chmod ugo=r $litfile $bibfile

Place the following in your emacs init file:

;; Literature
(defun dired-literature-create-directory-from-pdf ()
(interactive)
(save-window-excursion
(dired-do-async-shell-command
"$HOME/Documents/Literature/scripts/LitCreateDir.sh" current-prefix-arg
(dired-get-marked-files t current-prefix-arg))))
(define-key dired-mode-map (kbd "s-l d") 'dired-literature-create-directory-from-pdf)

(defun literature-update ()
(interactive)
(shell-command "LitUpdate.sh")
)
(global-set-key (kbd "s-l u") 'literature-update)

Usage

I recommend the pdf file be named AuthorYearTitle.pdf to be consistent across all papers. The Title of course should be short and descriptive. Whenever I download a pdf paper, I make it MANDATORY that I also download the corresponding bib file (too many times did I have to cite a paper I thought I'd never cite). Nearly all publishers can export to bib, including JSTOR. If the publishing site doesn't have this feature, I recommend googling the article in Google Scholar and exporting the bib file from the search results via "Export to BibTeX" (needs to be turned on in the Google Scholar settings); if this too isn't available, then I will write my own bib file manually. Name this file AuthorYearTitle.bib, identically the same as the pdf file except the file extension (this is crucial for the scripts to work). I usually find myself visiting the downloaded bib file and editing it according to my preference. For example, I don't like to have leading spaces in each line and I want each tag (e.g., journal) to be encapsulated by curly braces and end with a comma, even if it is the last tag. Also, make sure that there is at least one empty line at the end of the bib file; more on this in the "Other Thoughts" section.

Suppose I downloaded these files in /tmp. In emacs, I will use dired to cut and paste these two files into the subject directory that it belongs to; for example, ./Survival. In dired, I will move the cursor to the pdf file and hit s-l d (d for directory); for multiple papers downloaded and moved at the same time, just mark the pdf files first before running s-l d. This will create a directory AuthorYearTitle in ./Survival for each marked paper, move the corresponding pdf and bib files into the newly created directory, and generate a templated notes.org whose content will be generated from the tags in the bib file:

** Author (Year) Title
*** Notes

Things learned and notes regarding the content of the paper is meant to be typed into notes.org file so that I can review my thoughts later (and not have to necessarily re-read a paper to see what I learned).

Why do I store each paper in its own directory? I want the paper, bib file, and my notes to be self-contained in a single entity, and a directory is the best way to achieve this goal. That way, I can move (to a new category?) or copy (send to a colleague?) the paper with all the information intact.

After downloading new papers or moving papers around into a different (new?) categroy, running s-l u updates my bibliography.bib and literature.org file. All bib files (papers, books, and software) are concatenated into a single bibliography.bib file so that a single \bibliography{$HOME/Documents/Literature/bibliography.bib} can be inserted in all my LaTeX files that make use of references. All category.org and notes.org files are concatenated into a single literature.org file to create a single "library" where I can view all the available information; the paper, location, bib, and notes file can also be opened using C-c o (org-mode link). Since it is a text file, all the power of emacs (and other tools) can be used on this file: ordinary searches, regex searches, etc. This way, I can search my thoughts (via notes) to find or trace ideas back to a paper. Here is a snippet of what a generated literature.org file looks like:

#+title: Literature
#+author: MY NAME
#+email: MY EMAIL

* Estimating Equations :EstimatingEquations:
** White, Halbert (1982) Maximum Likelihood Estimation of Misspecified Models :EstimatingEquations:
*** [[file:/home/vinh/Documents/Literature/EstimatingEquations/White1982MLEMisspecifiedModels/White1982MLEMisspecifiedModels.pdf][paper]] [[file:/home/vinh/Documents/Literature/EstimatingEquations/White1982MLEMisspecifiedModels/][location]] [[file:/home/vinh/Documents/Literature/EstimatingEquations/White1982MLEMisspecifiedModels/bib.bib][bib]] [[file:/home/vinh/Documents/Literature/EstimatingEquations/White1982MLEMisspecifiedModels/notes.org][notes]]
*** Notes

** Whitney K. Newey and Daniel McFadden (1994) Chapter 36 Large sample estimation and hypothesis testing
*** [[file:/home/vinh/Documents/Literature/EstimatingEquations/NeweyMcFadden1994LargeSampleEstimationTesting/NeweyMcFadden1994LargeSampleEstimationTesting.pdf][paper]] [[file:/home/vinh/Documents/Literature/EstimatingEquations/NeweyMcFadden1994LargeSampleEstimationTesting/][location]] [[file:/home/vinh/Documents/Literature/EstimatingEquations/NeweyMcFadden1994LargeSampleEstimationTesting/bib.bib][bib]] [[file:/home/vinh/Documents/Literature/EstimatingEquations/NeweyMcFadden1994LargeSampleEstimationTesting/notes.org][notes]]
*** Notes

** Cox, D. R. (1993) Unbiased Estimating Equations Derived from Statistics that are Functions of a Parameter
*** [[file:/home/vinh/Documents/Literature/EstimatingEquations/Cox1993UnbiasedEstEqDerived/Cox1993UnbiasedEstEqDerived.pdf][paper]] [[file:/home/vinh/Documents/Literature/EstimatingEquations/Cox1993UnbiasedEstEqDerived/][location]] [[file:/home/vinh/Documents/Literature/EstimatingEquations/Cox1993UnbiasedEstEqDerived/bib.bib][bib]] [[file:/home/vinh/Documents/Literature/EstimatingEquations/Cox1993UnbiasedEstEqDerived/notes.org][notes]]
*** Notes

...

* Genetics :Genetics:
** French, Benjamin and Lumley, Thomas and Monks, Stephanie A. and Rice, Kenneth M. and Hindorff, Lucia A. and Reiner, Alexander P. and Psaty, Bruce M. (2006) Simple estimates of haplotype relative risks in case-control data
*** [[file:/home/vinh/Documents/Literature/Genetics/FrenchLumley+Others2006HaplotypesRelativeRisk/FrenchLumley+Others2006HaplotypesRelativeRisk.pdf][paper]] [[file:/home/vinh/Documents/Literature/Genetics/FrenchLumley+Others2006HaplotypesRelativeRisk/][location]] [[file:/home/vinh/Documents/Literature/Genetics/FrenchLumley+Others2006HaplotypesRelativeRisk/bib.bib][bib]] [[file:/home/vinh/Documents/Literature/Genetics/FrenchLumley+Others2006HaplotypesRelativeRisk/notes.org][notes]]
*** Notes

** Follmann, Dean and Proschan, Michael and Leifer, Eric (2003) Multiple Outputation: Inference for Complex Clustered Data by Averaging Analyses from Independent Data
*** [[file:/home/vinh/Documents/Literature/Genetics/FollmanProschanLeifer2003MultipleOutputation/FollmanProschanLeifer2003MultipleOutputation.pdf][paper]] [[file:/home/vinh/Documents/Literature/Genetics/FollmanProschanLeifer2003MultipleOutputation/][location]] [[file:/home/vinh/Documents/Literature/Genetics/FollmanProschanLeifer2003MultipleOutputation/bib.bib][bib]] [[file:/home/vinh/Documents/Literature/Genetics/FollmanProschanLeifer2003MultipleOutputation/notes.org][notes]]
*** Notes

** Lin, D.Y. and Zeng, D. and Millikan, R. (2005) Maximum likelihood estimation of haplotype effects and haplotype-environment interactions in association studies
*** [[file:/home/vinh/Documents/Literature/Genetics/LinZengMillikan2005MLEHaplotype/LinZengMillikan2005MLEHaplotype.pdf][paper]] [[file:/home/vinh/Documents/Literature/Genetics/LinZengMillikan2005MLEHaplotype/][location]] [[file:/home/vinh/Documents/Literature/Genetics/LinZengMillikan2005MLEHaplotype/bib.bib][bib]] [[file:/home/vinh/Documents/Literature/Genetics/LinZengMillikan2005MLEHaplotype/notes.org][notes]]
*** Notes

** Lin, DY and Zeng, D. (2006) Likelihood-based inference on haplotype effects in genetic association studies
*** [[file:/home/vinh/Documents/Literature/Genetics/LinZeng2006LikelihoodInferenceHaplotype/LinZeng2006LikelihoodInferenceHaplotype.pdf][paper]] [[file:/home/vinh/Documents/Literature/Genetics/LinZeng2006LikelihoodInferenceHaplotype/][location]] [[file:/home/vinh/Documents/Literature/Genetics/LinZeng2006LikelihoodInferenceHaplotype/bib.bib][bib]] [[file:/home/vinh/Documents/Literature/Genetics/LinZeng2006LikelihoodInferenceHaplotype/notes.org][notes]]
*** Notes

...

Note that the update process can take seconds or more due to the concatenation. My library is ever-growing so it will only get longer and longer to update. I'm not sure if I can ever speed up this process (let me know if you have ideas).

After running the update process, reftex-reset-mode should be run in an opened LaTeX file if you want to make use of the most current information with RefTeX.

I make use of emacs's bookmark capabalities (C-x r b) to visit these files easily in emacs.

Getting started

Most researchers have their downloaded pdf files in one location or multiple locations (category). To get started, I recommend spending time to rename the files according to AuthorYearTitle.pdf, and downloading or creating a separate bib file for each paper. Then place the files into their category directories (create category.org as well). In emacs dired, mark all the pdf files in each category, run s-l d to generate a self-contained directory for each paper. After this is done on all the categories, run s-l u to update. The file bibliography.bib can now be used in your LaTeX documents. The file literature.org can be now be used as an all in one library.

Other thoughts

Each bib file should have at least one empty line at the end of the file or things may go wrong in the bibliography.bib file; this paper is concatenated from multiple files, and multiple lines could be joined into one line if the newline character isn't present at the end of the file.

I also have a bibliography.tex file to generate a pdf file with all my papers:

\documentclass{article}
\usepackage{fullpage}
\usepackage{natbib}

\begin{document}

\cite{*}
\bibliographystyle{apa}
\bibliography{bibliography}

\end{document}

One thing I would like to be able to do is edit the literature.org file (notes portion) directly and have the changes reflected in the individual notes.org files. I haven't thought of a process to do this well. It would be nice to view multiple files in a single buffer as if the buffer is showing a single file so that when I edit a portion of the buffer, it actually is a different file depending on its location.

I'm not sure if anyone will find my workflow useful, but I just wanted to document it for the masses.