LaTeX reference management

By John Lenz. June 14, 2012.

In this post, I describe how I manage all the papers and books and other references I have accumulated over the years. Several people have asked me how I do it, and as far as I can tell I have a unique approach. This exact post is part of the reason for creating this blog, since several people have asked me how I manage my references and now I can point them here. Since this got quite long, I have split it in pieces. This first post describes at a high level the way I manage references and part 2, part 3, and part 4 describe some tools and code I use to work with my references.

Citing references from LaTeX using BibTeX works by creating a file containing a long list of references and info about them (author, title, journal, etc) into a file. In the LaTeX document, you reference a citation using the \cite command and the BibTeX and LaTeX compilers get the correct reference number and correctly formated citation (using the author, title, etc. info) in the output pdf. The hard part here is managing this file containing the long list of references.

Existing Reference Management Tools

There are lots of programs (and some websites) for managing a list of references and as far as I can tell, all of them work around a database model. You add references into a personal database, with columns for author, title, journal etc. Also might store the paper itself or a link to the paper online and perhaps some notes attached to the paper. The program or website then has some method of automatically generating the file expected by BibTeX and LaTeX, so you add references to your personal database and manage them there, and when it comes time to compile LaTeX you manually export the list.

There are several downsides for me.

The export process is manual. Misspell an author name and fix it in the database but forget to export? Your documents don't have the fix. Even if you remember, it is a pain to have to export this all the time.
The references are stored in one huge list when my references span lots of different topics. With a database, it is hard to keep together similar papers. Sure, most of these programs have tags and notes, so you can tag certain papers as about the same topic and add notes like "This other paper solves one of the open problems of this paper." But tags are hard to manage when the number of topics and papers gets large (I have 241 papers over 41 topics), since you forget exactly which tag goes with which topic, sometimes change the spelling of tags (did that tag have an "s" at the end or not?) Also, sometimes you want to add notes about a group of papers, for example a note saying "These three papers all combine to prove..."
Some of the colleagues I know keep several databases for different topics to attempt to mitigate the above problem and keep the number of papers smaller and more manageable. Sometimes even one database per paper written, and then manually copy references between databases when starting a new paper! This does keep the number of references down and easier to keep the same topics together, but loses any notes and fixes for papers appearing in more than one database, since this info has to be manually replicated.
These databases are locked up in some hard to access format, so you can't write tools to work with the references and it is harder to store the references in mercurial.

Of course, all these disadvantages can be worked around; some of these management programs provide scripting or other access, provide more free-form notes, etc. But for example none of the programs I checked out rendered LaTeX math in notes (maybe some recent ones do, I haven't looked recently). Even if these issues can be overcome, I still think managing references around a database is the wrong approach.

Comments next to code

As mentioned, I think a personal database of references is fundamentally flawed approach. Programmers have this idea called "comments next to code." Roughly speaking, the idea is that descriptions of code and APIs should live next to the code itself and then you should use a tool called a document generator to generate documentation from the code. The reason for this is twofold.

Just from glancing at the code, you can easily see what is documented and what isn't.
The closer the comments are to the code, the more likely they are to be updated when the code changes. Of course, to the frustration of the next programmer to come along, sometimes comments aren't updated even when they are right next to the code! But since managing references is a solitary task, I have no one but myself to blame and once I get in the habit can keep the comments (in this case, notes about what the paper contains) updated.

What does this have to do with references? For similar reasons, my reference management is not designed around a database of references which have individual tags or notes attached. Instead, I markup the "code" (in this case, the reference data that BibTeX and LaTeX expect) with "comments" which can be notes about the main results of the paper, a sketch of a clever proof idea in the paper, relations of this paper to others, and so on. Also, since the list of BibTeX references is directly managed, I can put them in an order that makes sense by topic and include "comments" about groups of papers. For example, I can add a comment like "The next three papers combine to show this remarkable result that...." Also, I can add comments which link to related papers.

The key advantage of this method is that the comments and references all appear in a single document of related papers and can be read top to bottom like a survey. The comments are interspersed with the actual reference data. I then have several documents/code files each on a separate topic. At the moment I have 41 documents containing 241 references, and each document is a mini survey of comments and references.

I manage these documents as follows.

Each document is a text document with the comments written in markdown (a wiki like syntax).
Documents are stored in mercurial for versioning.
I wrote a short tool to strip out the comments since BibTeX just wants the raw code.
I use pandoc to convert the marked up documents to HTML to be able to read them in a web browser. This gets me all sorts of features, including links and LaTeX equation rendering using MathML or MathJax (similar to this post).

An Example

Here is an example from some of the references used in this paper and this one. If you are interested in a more detailed write up, Sections 3 and 4 of this paper have an overview (and are where these two references are cited).

Take a k-dimensional unit sphere.
Partition the sphere into n domains D~1~,...,D~n~ of equal measure and diameter <
$0.5 \epsilon/\sqrt{k}$.  Chose a point in each set, call the set of all points $P$.
Consider the graph with vertex set $V_1 \cup V_2$ where $V_i$ isisomorphic to $P$.

1. Join an edge $x \in V_1$ to $y \in V_2$ if $d(x,y) < \sqrt{2} - \epsilon/\sqrt{k}$.
2. Join $x,y \in V_i$ if $d(x,y) > 2 - \epsilon/\sqrt{k}$.

This graph has small independence number by properties 3 and the theorem about the
diameter, has  a large number of edges by property 1, and has no $K_4$ by the BE
Rombus theorem.

~~~ {.bib}
@article {beg-bollobas76,
    AUTHOR = {Bollob{\'a}s, B{\'e}la and Erd\"{o}s, Paul},
     TITLE = {On a {R}amsey-{T}ur\'an type problem},
   JOURNAL = {J. Combinatorial Theory Ser. B},
  FJOURNAL = {Journal of Combinatorial Theory. Series B},
    VOLUME = {21},
      YEAR = {1976},
    NUMBER = {2},
     PAGES = {166--168},
   MRCLASS = {05C99},
  MRNUMBER = {MR0424613 (54 \#12572)},
MRREVIEWER = {R. L. Graham},
       URL = {http://www.renyi.hu/~p_erdos/1976-20.pdf}
}
~~~

Rodl extended this construction to produce a graph with independence number o(n)
which does not contain either $K_4$  or $K_{3,3,3}$.

Let G be the Bollobas-Erdos graph described above, and let H be the spanning subgraph
consisting of all edges inside a part.  There exists a blowup H' of H where each
vertex is blown up into an independent set of size t and H' satisfies the following
properties:

1. For all $xy \in E(H)$ and $X' \subseteq B_x$, $Y' \subseteq B_y$ with
$|X'| > \mu t$ and $|Y'| > \mu t$, then there exists at least one edge of H' joining
X' to Y'.
2. H' does not contain cycles of lengths 3,...,k

[Our paper](Ramsey-Turan#rt-balogh11) extends this type of construction to hypergraphs.

~~~ {.bib}
@article {beg-rodl85,
    AUTHOR = {R{\"o}dl, Vojt{\v{e}}ch},
     TITLE = {Note on a {R}amsey-{T}ur\'an type problem},
   JOURNAL = {Graphs Combin.},
  FJOURNAL = {Graphs and Combinatorics},
    VOLUME = {1},
      YEAR = {1985},
    NUMBER = {3},
     PAGES = {291--293},
      ISSN = {0911-0119},
     CODEN = {GRCOE5},
   MRCLASS = {05C35 (05C55)},
  MRNUMBER = {MR951018 (89h:05034)},
MRREVIEWER = {Yair Caro},
       DOI = {10.1007/BF02582954},
       URL = {http://dx.doi.org/10.1007/BF02582954},
}
~~~

You can notice several things from this example. The BibTeX has been embedded right in the comments; while that takes up a bunch of extra space on this page, in Vim I have folding so the bibtex sections are collapsed to one line unless I use "zo" or "zO" to open the folds. Also, in the markup I use TeX equations, use numbered lists, and link to another page. Essentially I have all of pandoc's markup available.

Also, notice that both of the BibTeX blocks above contain a "URL" and "MRNUMBER". As part of converting these pages to HTML using pandoc (see this post), the URLS turn into links which I can then click on to open up the paper, and the MRNUMBER is massaged into a link to MathSciNet. For those without access (you need to be an AMS member or log in through a university), MathSciNet is a giant database of all math papers with links to the paper, links to the papers it references, and links to papers which reference it. Very helpful when discovering information about references.

The BibTeX entries above were not written manually, instead MathSciNet has the BibTeX fragments already. So when I want to add a reference, I find it on MathSciNet and copy the BibTeX into the page (at some point I will perhaps write a Vim plugin to pull it in automatically). After that, I sometimes have to add the URL.