Diving into Pandoc I

2019 May 28

Pandoc works by converting documents to and from its own internal representation of a document. We’re going to look at the process of writing a document from this representation and tackle reading to it in another post.

The most important basic types for Pandoc are actually in the pandoc-types package, which is where we find

A document is thus a list of Blocks with attached metadata. What is a Block, then? The definition is actually well-commented:

OK, now, many of these data constructors make reference to a type for inline elements, for instance a paragraph Para is simply a list of Inline elements.

So far so good. My guess is that we can see HTML’s influence (and, transitively, markdown’s) in particular with this split into inline and block elements. Let’s look at the HTML writer, then, located in the Text.Pandoc.Writers.HTML module.

From each writer module only the write... functions are exported. (Most writer modules only have one or two of these, for instance the Markdown writer only exports writeMarkdown and writePlain.) Notice that we have both a writeHtml5 and a writeHtml5String function. The type signatures of these differ only in the return type:

We know what Text is. The Html type is actually defined in the required blaze-html library (apparently also written by Haskell wunderkind Jasper van der Jeugt, author of Hakyll—I say “…kind” and not “…adult” because he’s a year younger than me). I don’t want to get into blaze now, it will come up later though. We can make progress without really understanding it, though.

writeHtml' is actually defined in terms of this function: we apply writeHtmlString' and feed the result to the preEscapedText function from the blaze library.

On the first line of this function (after the do) we see that we first apply a function called pandocToHtml to our writer options (a value of type WriterOptions from the Text.Pandoc.Options module) and the document d. We will see that this function returns in a transformed monad with state (not sure on the proper wording here) so we run the stateful computation with evalStateT and the starting state st of type WriterState (defined earlier in this module). The result is a tuple within the monad, which unwrapped is assigned to (body, context).

Once we have this, the function looks at the provided options and if no template is set, we discard the context and apply the renderHtml' function to body and then return this. renderHtml' is simply the composition of converting the Html value to the standard library’s lazy text type using the renderHtml from blaze and then applying Data.Text.Lazy.toStrict, converting this to strictly evaluated Text.

If there is a template, we make sure the lang field is set in the context, then add a pagetitle field to the context if one isn’t set already before finally rendering the document with renderTemplate function. This function takes a template and a context and applies the template. For this, our body value is renderHtml'’d and placed in the body field in our context. Let’s look at the type of renderTemplate' just to make sure we’re on the right track here.

If we look at the definition for WriterOptions, we see the writerTemplate field is indeed a Maybe String, so yes, we pattern match on the Maybe to see if there’s a template and if so we have a String to feed to renderTemplate'.

Now let’s take a look at the function that creates an Html value from a Pandoc document.

So, this function takes a set of options (a type defined earlier) and a Pandoc document and returns a tuple (thebody, context) wrapped in the StateT WriterState m monad. context is context, of type Value, from the aeson library.

thebody is an Html value that we see is defined with a let as blocks' >> notes. The line above this binding reveals that by “notes” we mean footnotes; the footnoteSection function takes a WriterOptions and the list of Html values stored in the state and returns something of type StateT WriterState m Html.

As for blocks', let’s analyze what’s going on here. Provided we don’t enable slide output, sects is just blocks converted from [Block] to [Element]. (An Element is either a Block or a list of Elements with some annotations.) Over this list we mapM a function that converts each Element to an Html object; here mapM takes the type (Element -> StateT WriterState m Html) -> [Element] -> StateT WriterState m [Html].

Now we liftM the (mconcat . intersperse (nl opts)) function to operate on the [Html] within the monad. nl looks at opts and returns the Html for a newline if that option is specified, so when we intersperse we’re either inserting a newline or the identity for the Html monoid between each Html block. Finally, we smoosh the [Html] into Html using its monoidal mconcat function. We end up with a value of type StateT WriterState m Html.

blocks' takes the value Html, as does notes. Thus we now see (>>) instantiates (wording?) as Html -> Html -> Html. Html must be a monad. Hmm.

My understanding of how this works is thanks to John MacFarlane, who answered my question on the pandoc-discuss Google group. You might expect, as I did, that since we use >> instead of >>= here, we throw away the blocks' HTML and just return the computation notes. But the Html monad is interesting. If we look at the blaze library where Html is defined, we see it’s just a type synonym for Markup, which itself is a synonym for MarkupM (). So there’s no actual value at the bottom contained in this monad—all the HTML is represented in the innards of the monad itself! This is the kind of thing I need to develop more intuition for. Really it’s no different than storing a state in a monad, I guess.

So, we could definitely look at blaze in another post, but here it’s enough to understand that x >> y, or do x; y yields an Html value that contains x followed by y, where “contains” and “followed by” are specific to how MarkupM is defined.

For reference, from the signatures above, m must be a member of the PandocMonad typeclass. If we’re actually doing IO, this will be of type PandocIO,

which is a stack of IO, State, and Except monads. The only way to get out of the PandocIO monad is with runIO:

We will get into this more in a later post.

So the key here is the elementToHtml function.

Either we have just a block Blk block or a section, which requires more sophisticated pattern matching. Let’s focus on the simple case when we just have a block, and here we simply apply a function blockToHtml. blockToHtml :: WriterOptions -> Block -> StateT WriterState m Html is defined by pattern matching on the Block type, which we showed earlier. For example,

As we would expect, many of the blockToHtml definitions use another function for processing the Inline elements they are made up of.

For a list of Inlines (such as a paragraph), we have

Simple enough: we map an inlineToHtml function over the elements of the list and concat them together using the monoid instance for Html, returning this result into the monad. This inlineToHtml function is similar, but uses a case expression. Here’s a partial definition:

Finally, we need one more function.

Notice we can return "" as a valid html value because Html is an instance of IsString and we have {-# LANGUAGE OverloadedStrings #-} enabled. (I’m oversimplifying here but I don’t want to get into the details of how the blaze library works).

And that’s a very rough overview. Like I mentioned, I next want to go through a reader (probably markdown), and after that get into some of the weeds on how the higher-level processing works.