BibTeX and Pandoc

By John Lenz. June 15, 2012.

In the previous post, I described at a high level how I manage my LaTeX references. This post contains some Haskell source code to work with these commented BibTeX files. There are two tools I use: one tool to extract the references into a raw file usable by BibTeX and a second tool to render the markdown files to HTML with links and TeX formulas. In this post, I describe the first of these tools: extracting references. The second post describes the tool to convert the files to HTML.

The rule for my BibTeX files is that they are written in pandoc markdown format. The actual references appear in code blocks with a class of "bib". This is accomplished by a code block like the following.

~~~ {.bib}
bibtex goes here

See the first post for a fragment of an example. I keep the references spread out over several files and all files are stored in Mercurial.

Extracting BibTeX

To extract the references, I use the pandoc API as follows.

module MakeBib where
import Text.Pandoc
import System.Environment (getArgs)

extractBib :: Pandoc -> String
extractBib (Pandoc _ bl) = concatMap f bl
  where f (CodeBlock (_,classes,_) s) | "bib" `elem` classes = s ++ "\n"
        f _ = []

processFile :: String -> String
processFile = extractBib . readMarkdown defaultParserState

main :: IO ()
main = getArgs >>= mapM readFile >>= mapM_ (putStrLn . processFile)

The extractBib function goes through the list of Blocks. For CodeBlocks, it checks if there is a "bib" class. If the code block has the bib class, the content of the block is returned with a newline appended. All other blocks are ignored by extractBib. processFile and main take care of the boilerplate: main reads a list of filenames from the arguments, reads each file from disk, and processes the contents of each file with processFile. processFile calls readMarkdown from pandoc and passes the result to extractBib.

Running mkbib.hs

How to use this? First, make sure Haskell and pandoc are installed. sudo apt-get install haskell-platform libghc-pandoc-dev will install everything you need. Alternatively install the Haskell Platform and then run "cabal install pandoc".

If you save the file as mkbib.hs somewhere, then at a shell you can run the following.

 # runhaskell mkbib.hs SomeFileName AnotherFileName > refs.bib

'runhaskell' takes care of compiling and then executing the haskell source. You could also compile with "ghc --make mkbib.hs". While you could run mkbib.hs manually like this, it is a pain and better to automatically generate the file. To do this, I use the following fragment in a makefile. I have separate makefiles for each paper, so each paper only includes the references from topics related to the paper.

PAGES= Quasirandom \
       Hypergraph?Quasirandom \
       Szemeredi?Theorem \
       Probabilistic?Methods \

ifneq "$(PAGES)" ''
PWITHPATH=$(addsuffix .page,$(addprefix ~/academic/references/,$(PAGES)))
    runhaskell ~/academic/references/mkbib.hs $(subst ?,\ ,$(PWITHPATH)) > $(REFNAME)

The bottom lines are actually in a generic makefile I include into my project specific makefiles, which is why it is so general. The code at the bottom adds ".page" as a suffix and the full path to my references as a prefix to every page. It then has a makefile rule to build "refs.bib" if any of the pages are updated. What is all this stuff with the question mark? Make does not support file names with spaces but it does support filenames with wildcards. The "?" is a wildcard standing for any single character. So as you can see, the pages are listed with a "?" where there is a space in the filename so make works correctly. But when running mkbib.hs, we want to replace the "?" with a space since the actual filenames have spaces. The end result is that it all works fine, but is slightly convoluted. Perhaps I should just not have used spaces in the filenames.