BibTeX and Pandoc Part 2

By John Lenz. June 19, 2012.

This post is part three in my series on managing LaTeX references using pandoc. In the first post, I described at a high level how I manage my LaTeX references. In the second post, I gave some code to extract the BibTeX entries from the marked up files so that the references can be used by BibTeX. In this post, I will give some code to convert the markdown reference files to HTML for easy browsing.

Recall from the previous posts that the references are spread out over many files (currently I have 41 files) with comments in markdown and the actual BibTeX embedded in blocks with class "bib". With vim folding, the raw documents are easy to read and sometimes when I am looking up a reference I just open the files directly (CtrlP helps a lot).
The files can be converted to html using pandoc as follows

$ pandoc -f markdown -t html --standalone --mathjax filename > output.html

This works, but the BibTeX is rendered as a block of text. We would really like the BibTeX to be rendered as usual for references: an author, title, journal, etc. Also, would like to include direct links to the article, MathSciNet, and arXiv. So I wrote a tool to use the pandoc API to read the markdown, transform the blocks, and write the document as HTML. The code for the tool is explained in detail below in Literate Haskell, but for those who just want to run the tool, here is how to do that. First, either copy the text of this webpage into a file with extension "lhs" or save this entire file to a file with extension "lhs". This file can then be compiled or run with ghc. For example, if you saved the code as "ref2html.lhs", then

$ ghc --make ref2html.lhs
$ ./ref2html *.markdown

will compile the code and then run it on all the markdown files in the current directory. The program will go through all the files given as input and produce a HTML file for each of them (the file will be named the same except the extension will change to html). Alternatively, the code can be compiled and run directly with

$ runhaskell ref2html.lhs *.markdown

The Code

Here I document the code. First, some standard imports. The required libraries are the Haskell Platform and the pandoc and split packages from hackage: "cabal install pandoc split".

module Main where

import Control.Applicative ((<$>)) import Control.Exception (throw) import Control.Monad (liftM) import Data.Maybe (fromMaybe) import Data.Char (toLower, isAlpha) import qualified Data.List as L import qualified Data.List.Split as S import System.FilePath (replaceExtension) import System.Environment (getArgs) import qualified Text.ParserCombinators.Parsec as P import Text.ParserCombinators.Parsec ((<|>)) import Text.Pandoc

Parsing BibTeX

First we need to parse the BibTeX. Originally, I used citeproc-hs to parse and render the data. The main problem with this approach is I wanted access to several custom fields in the BibTeX like URL2 and MRNUMBER, and since citeproc-hs parses everything into a common format, that data was lost. Also, citeproc-hs is somewhat annoying to compile since (for BibTeX) it requires bibutils.

So instead, I just wrote a simple BibTeX parser using parsec. If I was writing this code now I would probably use either attoparsec or trifecta since these seem to be the two modern successors to parsec. But parsec is still going strong and for this simple parsing is more than enough. For those interested in learning more about parsec, I suggest Chapter 16 of Real World Haskell. In fact, the whole book is great.

In any case, here is the parsec parser. A BibTeX entry consists of a name and a list of keys and attributes.

data Bibtex = Bibtex String [(String,String)]

A BibTeX file is a list of entries separated by spaces.

bibParser :: P.Parser [Bibtex] bibParser = do x <- P.sepEndBy bibEntry P.spaces P.eof return x

A BibTeX entry starts with '@'. Next comes the type of entry (e.g. article, book) which we ignore by skipping everything until the first {. Next we parse the open brace and then the name of the entry, which is everything up until the first comma. We then skip over the comma and any whitespace and newlines. Next, we parse the list of attributes using the bibAttr parser, with attributes separated by spaces, newlines, and commas. Finally, we parse the closing }.

bibEntry :: P.Parser Bibtex bibEntry = do P.char '@' P.many $ P.noneOf "{" P.char '{' name <- P.many $ P.noneOf "," P.many $ P.oneOf " ," attrs <- P.sepEndBy bibAttr $ P.many1 $ P.oneOf " ," P.char '}' return $ Bibtex name attrs

An attribute consists of the key which is a bunch of letters and digits, then an equals sign, and then a value wrapped in braces. We put the keys into lowercase since BibTeX is case insensitive.

bibAttr :: P.Parser (String, String) bibAttr = do key <- P.many (P.letter <|> P.digit) P.spaces P.char '=' P.spaces P.char '{' val <- bibVal P.char '}' return (map toLower key, val)

A value consists of a block of characters. The value is allowed to have embedded braces for LaTeX commands, for example \"{u} might appear to show a u with an umlaut. Or braces can appear to force BibTeX to leave the content alone (BibTeX will change capitalization of titles and other formatting like this, and putting it in braces tells BibTeX not to do this). The following parser loads the value and strips out any embedded braces, since we don't really care about them here. The matched stuff for braces is needed so that the bibVal does not consume the closing brace expected by bibAttr above. (Otherwise, we could just skip over braces.)

bibVal :: P.Parser String bibVal = liftM concat $ P.many1 (bibValMatched <|> (liftM (:[]) (P.noneOf "{}")))

bibValMatched :: P.Parser String bibValMatched = P.between (P.char '{') (P.char '}') bibVal

Rendering BibTeX to Pandoc

Now some code to render a list of Bibtex entries into pandoc's types. The first step is to sort the entries by key and render them as a definition list, calling renderEntry on each entry

renderEntries :: [Bibtex] -> Block renderEntries lst = DefinitionList $ map display lst' where lst' = L.sortBy ((Bibtex a ) (Bibtex b ) -> compare a b) lst display (Bibtex key b) = ([Strong [Str key]], [[Plain $ renderEntry key b]])

Rendering an entry consists of turning the attributes into pandoc's types.

type BibtexAttr = [(String, String)]

Render some attribute as a string.

render1 :: BibtexAttr -> String -> Inline render1 b s = case lookup s b of Just x -> Str x Nothing -> Str ""

Render some attribute as a link to the article.

articleLink :: String -> BibtexAttr -> Inline articleLink s b = case lookup s b of Just x -> Link [Str "article"] (x, []) Nothing -> Str ""

Render the mrnumber to a link to mathscinet.

mrNumber :: BibtexAttr -> Inline mrNumber b = case lookup "mrnumber" b of Just x -> mkURL x Nothing -> Str "" where mkURL x | length x > 2 = Link [Str "MathSciNet"] (mathSciNet ++ mrNum x, []) mkURL _ = Str "" mrNum = dropWhile isAlpha . head . words mathSciNet = ""

Render a link to arXiv.

arxiv :: BibtexAttr -> Inline arxiv b = case lookup "arxiv" b of Just x -> mkURL x Nothing -> Str "" where mkURL x = Link [Str "arXiv"] (url ++ dropWhile isAlpha x, []) url = ""

As mentioned above, the BibTeX might contain LaTeX commands for diacritics. This function expands those into their proper unicode representation for nice display.

expandTex :: String -> String expandTex ('\':a:'{':b:'}':xs) = expandTex ('\':a:b:xs) expandTex ('\':''':a:xs) = a' : expandTex xs where a' = case a of 'a' -> 'á' 'e' -> 'é' 'o' -> 'ó' _ -> a expandTex ('\':'H':'o':xs) = 'ő' : expandTex xs expandTex ('\':'"':a:xs) = a' : expandTex xs where a' = case a of 'a' -> 'ä' 'e' -> 'ë' 'o' -> 'ö' _ -> a expandTex (a:xs) = a : expandTex xs expandTex [] = []

Convert the author from "Lastname, Firstname" to "Firstname Lastname"

prettyAuthor :: String -> String prettyAuthor x = L.intercalate ", " $ map fixOne $ S.splitOn " and" x where fixOne s = case S.splitOn "," s of [] -> "" [a] -> a (f:xs) -> concat xs ++ " " ++ f

Render a single entry. First, we add raw html to label this entry with its name. This allows HTML links like "pagename#name" to work properly. Next, we display a bunch of entries of the BibTeX attributes interspersed with commas

renderEntry :: String -> BibtexAttr -> [Inline] renderEntry name b = raw ++ entries where raw = [(RawInline "html" $ "<a name="" ++ name ++ "">")]

    entries = L.intersperse (Str ", ") $ filter (not . isEmptyStr)
        [ mapInline (prettyAuthor . expandTex) $ render1 b "author"
        , mapInline (\a -> "\"" ++ a ++ "\"") $ render1 b "title"
        , Emph [render1 b "journal"]
        , render1 b "year"
        , mrNumber b
        , articleLink "url" b
        , articleLink "url2" b
        , arxiv b

    mapInline f (Str s) = Str $ f s
    mapInline _ x = x
    isEmptyStr (Str "") = True
    isEmptyStr _        = False


First, a function with processes a pandoc block, converting each block marked with the "bib" class using the above functions. If the parser gets an error, it displays the error instead of the rendered entries, which is helpful for debugging.

transformBlock :: Block -> Block transformBlock (CodeBlock (_, classes, namevals) contents) | "bib" elem classes = case P.parse bibParser "" contents of Left err -> BlockQuote [Para [Str $ "Error parsing bib data: " ++ show err]] Right x -> renderEntries x transformBlock x = x

The WriterOptions. I have the options using MathML. But if you want to use MathJax or some other method, it is easy to change (see this post, the pandoc use guide, and HTMLMathMethod).

wOptions :: IO WriterOptions wOptions = do t <- either throw id <$> getDefaultTemplate Nothing "html" return $ defaultWriterOptions { writerStandalone = True , writerTemplate = t , writerHTMLMathMethod = MathML Nothing }

A function which reads a file from disk, transforms it, and writes it back to disk.

processFile :: FilePath -> IO () processFile filename = do r <- readFile filename opts <- wOptions let (Pandoc m blocks) = readMarkdown defaultParserState r newp = Pandoc m $ map transformBlock blocks newhtml = writeHtmlString opts newp writeFile (replaceExtension filename "html") newhtml

And finally the main function, which loads filenames from the argument list and processes them one by one.

main :: IO () main = getArgs >>= mapM_ processFile


I wrote this code a while back before I knew about Hakyll. The above code could be converted to a hakyll site: first turn transformBlock from above into a function Pandoc -> Pandoc and then render pages with pageCompilerWithPandoc. Since the script works fine, I haven't bothered to set up hakyll.