Writing a Transpiler For a Subset of the Markdown Language

For a personal project, I needed to create a Table of Contents are given markdown content, however, the markdown content was to produce a different HTML output than the CommonMarkdown specification. The rules for getting from Markdown to HTML were similar to how Rust mdBook handles Table of Content.

While it's possible to do an ad-hoc conversion, without paying too much attention to "how it's supposed to be done", I'm learning more about compilers, lexers, parsers, transpilers and so I thought it was a good way to dip my toes into the field.

This tool is a transpiler (code to code), rather than a compiler (higher-level to lower-level code). This is because the two languages we're converting between, Markdown Common Specification and HTML, are markup languages.

Table of Contents

Goal

We want to go from this:


# Title

[About](index.md)

- [Getting Start]()

- [CLI Usage](usage/usage.md)
  - [init](usage/init.md)

to this:


<ol class="toc">
  <li class="part-title">Title</li>

  <li><a href="/about">About</a></li>

  <li class="chapter-item">
    <strong>1.</strong>
    <a class="active" href="/index.html">About</a>
  </li>

  <li>
    <ol class="section">
      <li class="chapter-item draft">
        <strong>5.1.</strong>
        Table of Contents Format
      </li>
    </ol>
  </li>
  <li class="part-title">Development</li>
  <li class="chapter-item">
    <strong>6.</strong>

    <a class="" href="/contributing.html">Contributing</a>
  </li>
</ol>

How do we go about doing it? Intuitively, we can map the character sets between the two formats. For instance,

# Title

maps to

<li class="part-title">Title</li>

and

[About](index.md)

maps to


<li class="chapter-item">
  <strong>1.</strong>
  <a class="active" href="/index.html">About</a>
</li>

and

- [init](usage/init.md)

maps to


<li>
  <ol class="section">
    <li class="chapter-item draft">
      <strong>5.1.</strong>
      Table of Contents Format
    </li>
  </ol>
</li>

To go further into mapping between the two sets, we can look at specific characters:

  • A hash/number # in the markdown format denotes that we are starting a list tag <li></li>, perhaps we can call this title
  • A left-side bracket [ followed by letters, spaces, and/or other characters, then, wrapped by a ], which then followed by a pair of parentheses (), denotes a link tag <a></a>
  • a dash sign - in the Markdown format maps to the beginning of a list item tag <li></li>in the HTML format. When combined with a link tag, it also adds a numbering, so we can call this list item

Finally, we wrap all of the character sets with a list item <li></li> tag. As you've probably noticed, we haven't introduced any formal terminology or talked about how we're going to be mapping between the two formats, and so that's what we'll look at next.

Terminology and Mapping

Before we start delving into the code, let's introduce some useful terminology for the components that we'll need.

First, we'll need a method for reading the characters from the text stream. This component is called the Reader: given a text, we iterate through each of the characters.

Next, we'll start combining characters into meaningful tokens, this is called Lexing and is the job of the Lexer.

Afterward, we'll combine the different Tokens to produce a new Syntax Tree. This is called parsing and is done by the Parser.

In the final step, we'll go through the generated Syntax Tree and write a Renderer function which will generate the HTML.

Furthermore, you can divide the different components into two or more stacks: backend and frontend. The frontend is usually responsible for lexing and parsing the code, whereas the backend transforms the code into something else.

In our case the frontend stack consists of the following components:

  • Reader
  • Lexer
  • Parser

And the backend stack consists of only one component:

  • Renderer

Reader

The reader is the simplest of the components, the only purpose is to retrieve characters from different formats, be it a string, a file, etc. The reader has an internal state of which character its currently reading, and some helper public methods that are used to retrieve a single character from the stream.


'hello + world' -> ['h','e','l','l','o','+','w','o','r','l','d']

Lexer

The Lexer is the next simplest part of the frontend stack, its main purpose is to create meaningful tokens from a stream of characters (which is also called Lexical Analysis) which it gets from the Reader. It discards unimportant tokens like spaces (if it isn't part of the lexical grammar, like in python). Similarly to the reader, it exposes some public methods that let you extract the newly minted tokens.


['h','e','l','l','o','+','w','o','r','l','d'] -> ['hello', '+', 'world']`

Parser

In our program, the Parser is the most advanced component and is responsible for generating a Syntax or Abstract Syntax Tree, which is a tree structure that represents meaningful grammar. The difference between a Syntax Tree and an Abstract Syntax Tree is that the Syntax Tree includes all information of the source code, including whitespaces which may be superfluous. The AST on the other hand is an abstract representation of the source code that does not include superfluous info.


['hello', '+', 'world']`

->

[
  { token: 'string', lexeme: 'hello' },
  { token: 'string', lexeme: '+' },
  { token: 'string', lexeme: 'world' },
]

Renderer

The renderer does the final transformation, from an AST/Syntax Tree to the desired output language, in our case, HTML.


[
  { token: 'string', lexeme: 'hello' },
  { token: 'string', lexeme: '+' },
  { token: 'string', lexeme: 'world' },
]

->

<div>hello</div>
<div>+</div>
<div>world</div>

The Road to HyperText Markup Language

Next up we'll implement the Reader, Lexer, Parser and Renderer.

1 Build the Reader

Let's start with the simplest component, the Reader. In our case, it's simply a function that takes a string as input and provides some methods to interact with said string. Like almost all of the components, it's a stateful function, it keeps track of what character we are currently reading.

The main methods it exposes are:

  • peek: lookup the nth character that is ahead of where we currently are
  • consume: return the current character and move to the next character
  • isEOF: Check if we have reached the end of the string

function TocReader(chars: string) {
  let i = 0;

  function peek(nth: number = 0) {
    return chars[i + nth];
  }

  function consume(nth: number = 0) {
    const c = chars[i + nth];
    i = i + nth + 1;

    return c;
  }

  function isEOF() {
    return chars.length - 1 < i;
  }

  return Object.freeze({
    peek,
    consume,
    isEOF,
  });
}

2 Define the grammar

Next up we'll have to identify the tokens we're going to allow in our DSL (Domain-Specific-Language).

We want to take this text:


# Title

[About](index.md)

---

- [Getting Start]()

- [CLI Usage](usage/usage.md)
  - [init](usage/init.md)

and find the tokens that represent it. One interpretation is:


| Text                | Tokens                                                                            |
|---------------------|-----------------------------------------------------------------------------------|
| # Title             | [HASH, SPACE, STRING]                                                             |
| [About](index.md)   | [LEFT_BRACE, STRING, RIGHT_BRACE, LEFT_PAREN, STRING, RIGHT_PAREN]                |
| ---                 | [HORIZONTAL_RULE]                                                                 |
| - [Getting Start]() | [DASH, SPACE, LEFT_BRACE, STRING, RIGHT_BRACE, LEFT_PAREN, RIGHT_PAREN]           |
| - [Usage](usage.md) | [DASH, SPACE, LEFT_BRACE, STRING, RIGHT_BRACE, LEFT_PAREN, STRING, RIGHT_PAREN]   |
|  - [Init](init.md)  | [INDENT, SPACE, LEFT_BRACE, STRING, RIGHT_BRACE, LEFT_PAREN, STRING\|RIGHT_PAREN] |
|                     | [EOF]                                                                             |

Then the tokens are:


enum TokenType {
  LEFT_PAREN,
  RIGHT_PAREN,
  LEFT_BRACE,
  RIGHT_BRACE,
  SPACE,
  INDENT,
  DASH,

  STRING,
  HASH,
  HORIZONTAL_RULE,

  EOF,
}

interface Token {
  type: TokenType;
  lexeme: string | null;
}

The interface Token which represents our token, consists of two properties: type and lexeme. The type is set to one of the TokenTypes and the lexeme is the value of the token, nil if it is a token that has no value and a string otherwise.

The rules that determine how a particular language groups characters into lexemes is called Lexical Grammar. Next up we'll build our Lexer.

3 Start Lexing

Similar to the Reader component, the lexer token will have three public methods, which the Parser will use to construct the AST:

  • peek
  • consume
  • isEOF

A lexer is a stateful object, it contains a pointer/index to the current element being read and an array of tokens. The core of the lexer is the function responsible for actually lexing: creating tokens from a string of characters.

The private method scanToken is responsible for lexing. It steps through the input stream (the output of the reader), which we trigger by the run method, which is a while loop that runs the scanToken function until we hit the EOF character.

Now, there are several ways to implement the scanToken method, we choose a simple switch, albeit you can use something like a recursive state function that calls itself until it hits the EOF character.

In this case, I choose a switch statement instead, where the outer loop in the run function is responsible for running the scanToken method until we hit EOF.

The switch statement looks at the current character being read from the Reader and tries to match it against a token in the lexer grammar. For simple tokens, such as LEFT_BRACE, it can match 1 to 1 and add the token directly, but for others, such as the HORIZONTAL_RULE, it uses reader.peek to look ahead. For more complicated symbols, such as strings, we define a special function that has to look several characters ahead.


function TocLexer(reader: any) {
  let i = 0;
  const tokens: Token[] = [];

  run();

  function run() {
    while (!reader.isEOF()) {
      scanToken();
    }

    addToken(TokenType.EOF);
  }

  function scanToken() {
    const c = reader.consume();

    switch (c) {
      case '\n':
        break;
      case ' ':
        if (reader.peek() === ' ') {
          addToken(TokenType.INDENT);
          reader.consume();
        }
        break;
      case '[':
        addToken(TokenType.LEFT_BRACE);
        break;
      case ']':
        addToken(TokenType.RIGHT_BRACE);
        break;
      case '(':
        addToken(TokenType.LEFT_PAREN);
        break;
      case ')':
        addToken(TokenType.RIGHT_PAREN);
        break;
      case '#':
        header();
        break;
      case '-':
        if (reader.peek() === '-' && reader.peek(1) === '-') {
          reader.consume();
          reader.consume();
          addToken(TokenType.HORIZONTAL_RULE);
        } else {
          addToken(TokenType.DASH);
        }
        break;
      default:
        stringLiteral(c);
        break;
    }
  }

  function header() {
    let text = '';
    while (reader.peek() !== '\n' && !reader.isEOF()) {
      text += reader.consume();
    }

    text = text.trim();

    addToken(TokenType.HASH, text);
  }

  function stringLiteral(c: string) {
    let text = c;
    let unClosedParens = 0;
    let unClosedBraces = 0;

    while (reader.peek() !== '\n' && !reader.isEOF()) {
      const nth = reader.peek();

      if (nth === '(') {
        unClosedParens += 1;
      } else if (nth === ')' && unClosedParens === 0) {
        break;
      } else if (nth === ')') {
        unClosedParens -= 1;
      } else if (nth === '[') {
        unClosedBraces += 1;
      } else if (nth === ']' && unClosedBraces === 0) {
        break;
      } else if (nth === ']') {
        unClosedBraces -= 1;
      }

      // All BRACE and PARENS must be closed
      text += reader.consume();
    }

    addToken(TokenType.STRING, text);
  }

  function addToken(type: TokenType, lexeme: string | null = null) {
    tokens.push({ type, lexeme });
  }

  function peek(nth: number = 0) {
    return tokens[i + nth].type;
  }

  function consume(nth: number = 0) {
    const t = tokens[i + nth];
    i = i + nth + 1;

    return t;
  }

  function isEOF() {
    return tokens.length - 1 < i;
  }

  function printTokens() {
    tokens.map(k => {
      log.info(`type: ${TokenType[k.type]} lexeme: ${k.lexeme}`);
    });
  }

  function printTokenType() {
    tokens.map(k => {
      log.info(`${TokenType[k.type]}`);
    });
  }

  return Object.freeze({ peek, consume, isEOF, printTokens, printTokenType });
}

Finally, after we have lexed the input, we will end up with an array containing the tokens:


const token[] Token = [
  {
    type: 'LEFT_BRACE',
    lexeme: nil,
  },

  {
    type: 'STRING',
    lexeme: 'About',
  },

  {
    type: 'RIGHT_BRACE',
    lexeme: nil,
  },

  {
    type: 'LEFT_PAREN',
    lexeme: nil,
  },

  {
    type: 'STRING',
    lexeme: 'index.md',
  },

  {
    type: 'RIGHT_PARENT',
    lexeme: nil,
  },
];

4 Generate an Abstract Syntax Tree

Now that have our tokens, we can start assembling them to create an AST. To do so, we have to create a formal language of how the AST is built, which will we do by creating a formal(ish) grammar.

Formal grammar is a pseudo-code(ish) description of our DSL. There are academic words for this, one is parsing expression grammar (PEM), another one is Top-down parsing language (TDPL), and yet another one is Context-free grammar (CFG). They're all very formal, and since this is a less formal project, we won't be strict with our language.

Next up we'll create a set of production rules, which are a bunch of rewrite rule that are used to substitute symbols with other values that can be recursively performed to generate new symbol sequences.

The production rules are of the form:

A -> a

Where A is a nonterminal symbol, and a a string of terminals and/or nonterminals.


  expression -> hr | header | link | list | expression

  hr     ->  ---
  header ->  # STRING
  link   ->  "[" STRING "]" "(" STRING? ")" | list
  list   ->  "INDENT"+ "-" link

terminal and nonterminal are defined as follows:

  • terminal: when we encounter terminal symbols we don't do any substitutions. So for instance, ---, STRING are terminal symbols
  • nonterminal: nonterminal symbols can contain other symbols, and so we have to examine them until we get to a terminal symbol. In our case, expression, hr, header, link list are nonterminal symbols

Now let's implement this formal grammar. We begin with the parse function that will run through the tokens we got from the lexer until it reaches the EOF token. For each token, it will parse the expression (with a switch statement) using the peek method from the Lexer.


function TocParser(lexer: any) {
  function hr() {
    return { type: 'hr', indent: 0 };
  }

  function header(text: string) {
    return { type: 'header', text, indent: 0 };
  }

  function link(title: string, ref: string = '') {
    return { type: 'link', title, ref, indent: 0 };
  }

  function list(children: any[] = []) {
    return { type: 'list', children };
  }

  function listItem(title: string, ref: string = '', indent: number): any {
    return {
      type: 'listItem',
      title,
      ref,
      indent,
    };
  }

  function createAST(statements: any): any {
    const listRefs: any = { 0: list() };
    statements.forEach((v: any, i: number, arr: any) => {
      if (i > 0 && arr[i - 1].indent > v.indent) {
        delete listRefs[arr[i - 1].indent];
      }

      if (listRefs.hasOwnProperty(v.indent)) {
        listRefs[v.indent].children.push(v);
      } else {
        listRefs[v.indent] = {
          type: 'list',
          children: [v],
        };

        listRefs[v.indent - 1].children.push(listRefs[v.indent]);
      }
    });

    return listRefs[0];
  }

  function parse() {
    const statements = [];
    while (!lexer.isEOF()) {
      const expr = expression();

      if (expr) {
        statements.push(expr);
      }
    }

    return createAST(statements);
  }

  function expression(numIndent = 0): any {
    const token = lexer.consume();

    switch (token.type) {
      case TokenType.HORIZONTAL_RULE:
        return hr();
      case TokenType.HASH:
        return header(token.lexeme);
      case TokenType.LEFT_BRACE:
        return parseLink();
      case TokenType.INDENT:
        return expression(numIndent + 1);
      case TokenType.DASH:
        return parseListItem(numIndent);
      default:
    }
  }

  function parseListItem(numIndent = 0): any {
    if (
      lexer.peek(0) === TokenType.LEFT_BRACE &&
      lexer.peek(1) === TokenType.STRING &&
      lexer.peek(2) === TokenType.RIGHT_BRACE
    ) {
      const title = lexer.consume(1);
      let ref = '';
      if (lexer.peek(2) === TokenType.STRING) {
        ref = lexer.consume(2);
      }

      lexer.consume(); // Consume right parens

      return listItem(title.lexeme, ref.lexeme, numIndent);
    }
  }

  function parseLink(): any {
    if (
      lexer.peek() === TokenType.STRING &&
      lexer.peek(1) === TokenType.RIGHT_BRACE
    ) {
      const title = lexer.consume();
      const ref = lexer.consume(2);
      lexer.consume();

      return link(title.lexeme, ref.lexeme);
    }
  }

  return parse();
}

For instance, if we were to parse the following markdown snippet:


# markBook

---

[About](index.md)

- [Getting Start](getting-started.md)
  - [Getting Start](getting-started.md)
    - [Getting Start](getting-started.md)

We would end up with this AST:


{
  "type": "list",
  "children": [
      {
          "type": "header",
          "text": "markBook",
          "indent": 0
      },
      {
          "type": "hr",
          "indent": 0
      },
      {
          "type": "link",
          "title": "About",
          "ref": "index.md",
          "indent": 0
      },
      {
          "type": "listItem",
          "title": "Getting Start",
          "ref": "getting-started.md",
          "indent": 0
      },
      {
          "type": "list",
          "children": [
              {
                  "type": "listItem",
                  "title": "Getting Start",
                  "ref": "getting-started.md",
                  "indent": 1
              },
              {
                  "type": "list",
                  "children": [
                      {
                          "type": "listItem",
                          "title": "Getting Start",
                          "ref": "getting-started.md",
                          "indent": 2
                      }
                  ]
              }
          ]
      }
  ]
}

5 Generate HTML

Now that we have our AST, we can implement the Renderer which will generate the desired HTML. The function responsible for this transformation is TocRender, a function that takes as input the AST object and loops through it to generate our HTML.

We start by writing our main loop which will handle all of the AST types and wrap it in an ol tag:


<ol class="toc">${ast.children.map((e: any) => {
      switch (e.type) {
        case "hr":
          return hr();
        case "header":
          return header(e);
        case "link":
          return link(e);
        case "listItem":
          order += 1;
          return listItem(e, [order]);
        case "list":
          return list(e, [order]);
        default:
      }
    })
    .join("")
  }
</ol>`;

Next, we write functions to handle the different HTML tags:


function formatUrl(currentFileName) {
  let link = stripExtension(currentFileName);
  link = link.replace(site.paths.content, "");

  return link;
}

function stripExtension(url) {
  let link = path.relative(site.paths.content, url);
  link = path.join("/", path.dirname(url), path.basename(url, ".md"));

  if (site.uglyURLs) {
    link += ".html";
  }

  return link;
}

function hr() {
  return '<li class="spacer"></li>';
}

function header(e: any) {
  return `<li class="part-title">${e.text}</li>`;
}

function link(e: any): any {
  let ref = e.ref !== "" ? stripExtension(e.ref) : "";
  const linkClass = ref === activePage ? "active" : "";

  // We treat index.md in root file differently
  if (ref === "/index") {
    ref = "/";
  }

  if (site.rootUrl) {
    ref = `${site.rootUrl}${ref}`
  }

  return ref
    ? `<li><a class="${linkClass}"  href="${ref}">${e.title}</a></li>`
    : `<li class="draft">${e.title}</li>`;
}

function listItem(e: any, order: number[]) {
  let ref = e.ref !== "" ? stripExtension(e.ref) : "";

  const linkClass = ref === activePage ? "active" : "";

  // We treat index.md in root file differently
  if (ref === "/index") {
    ref = "/";
  }

  if (site.rootUrl) {
    ref = `${site.rootUrl}${ref}`
  }

  return ref
    ? `
    <li class="chapter-item">
      <strong>${[...order, ""].join(".")}</strong>
      &nbsp;
      <a class="${linkClass}"
         href="${ref}">${e.title}</a>
    </li>
  `
    : `
    <li class="chapter-item draft">
      <strong>${[...order, ""].join(".")}</strong>
      &nbsp;
      ${e.title}
    </li>
  `;
}

function list(e: any, order: number[]): any {
  return `
      <li>
        <ol class="section">
          ${
    e.children
      .map((node: any, i: number) =>
        node.type === "list"
          ? list(node, [...order, i + 1])
          : listItem(node, [...order, i + 1])
      )
      .join("")
  }
        </ol>
      </li>
    `;
}

There are some ad-hoc manipulations since we need to handle some specific functionality, such as adding an active HTML class when a page is active, or when we're dealing with an index.html page.

Finally, the TocRender function in its entirety.


function TocRender(site: Site, ast: any, currentFileName: string) {
  const activePage = formatUrl(currentFileName);

  function formatUrl(currentFileName) {
    let link = stripExtension(currentFileName);
    link = link.replace(site.paths.content, "");

    return link;
  }

  function stripExtension(url) {
    let link = path.relative(site.paths.content, url);
    link = path.join("/", path.dirname(url), path.basename(url, ".md"));

    if (site.uglyURLs) {
      link += ".html";
    }

    return link;
  }

  function hr() {
    return '<li class="spacer"></li>';
  }

  function header(e: any) {
    return `<li class="part-title">${e.text}</li>`;
  }

  function link(e: any): any {
    let ref = e.ref !== "" ? stripExtension(e.ref) : "";
    const linkClass = ref === activePage ? "active" : "";

    // We treat index.md in root file differently
    if (ref === "/index") {
      ref = "/";
    }

    if (site.rootUrl) {
      ref = `${site.rootUrl}${ref}`
    }

    return ref
      ? `<li><a class="${linkClass}"  href="${ref}">${e.title}</a></li>`
      : `<li class="draft">${e.title}</li>`;
  }

  function listItem(e: any, order: number[]) {
    let ref = e.ref !== "" ? stripExtension(e.ref) : "";

    const linkClass = ref === activePage ? "active" : "";

    // We treat index.md in root file differently
    if (ref === "/index") {
      ref = "/";
    }

    if (site.rootUrl) {
      ref = `${site.rootUrl}${ref}`
    }

    return ref
      ? `
      <li class="chapter-item">
        <strong>${[...order, ""].join(".")}</strong>
        &nbsp;
        <a class="${linkClass}"
           href="${ref}">${e.title}</a>
      </li>
    `
      : `
      <li class="chapter-item draft">
        <strong>${[...order, ""].join(".")}</strong>
        &nbsp;
        ${e.title}
      </li>
    `;
  }

  function list(e: any, order: number[]): any {
    return `
        <li>
          <ol class="section">
            ${
      e.children
        .map((node: any, i: number) =>
          node.type === "list"
            ? list(node, [...order, i + 1])
            : listItem(node, [...order, i + 1])
        )
        .join("")
    }
          </ol>
        </li>
      `;
  }

  let order = 0;
  return `<ol class="toc">${
    ast.children
      .map((e: any) => {
        switch (e.type) {
          case "hr":
            return hr();
          case "header":
            return header(e);
          case "link":
            return link(e);
          case "listItem":
            order += 1;
            return listItem(e, [order]);
          case "list":
            return list(e, [order]);
          default:
        }
      })
      .join("")
  }</ol>`;
}

6 Results

Finally, we have our HTML. Here is an overview of the steps we took and what the data structure looks like for each stage.

Fig.1 - Flowchart

Final Thoughts

There are a lot of missing and important parts to writing transpilers that I've omitted:

  • Proper error handling (how the users writing the DSL language will get information on syntax and grammar errors),
  • Writing tests
  • Improve performance
  • Being a bit more idiomatic in how the parser generates the AST
  • Switch to recursive state function for the Lexer and Parser

But for a pet project, this will suffice. There's also a bunch of lessons I've learned along the way when writing transpilers:

  • Start with Lexical Grammar and the AST definition, and not with programming!
  • Transpilers is an excellent candidate for unit testing
  • When debugging, make sure your lexer behaves! Don't always assume the lexer behaves correctly when debugging the parser

Resources

Some useful resources if you want to learn more about compilers, transpilers, and parsers in general.

dark/light theme