Why You Should Use ArchieML for Your Blog

It’s easy to go from structured data to unstructured data, but trying to go the other way is a losing, uphill battle.

photograph
Feb 2022

If you’ve spent any time with static site generators such as Hugo, Jekyll, or Gatsby, you’ve probably used markdown or mdx. The benefits of using markdown are immediately apparent from the very minimal markup required to add titles, links, bullet points, and various other text formatting options in a plain-text file. Additionally, many implementations of markdown such as GitHub flavored markdown allows HTML to be directly embedded into markdown, giving additional flexibility when the primitives of markdown don’t allow for enough versatility.

However, using markdown files for blog posts reveals a problem. Markdown, like HTML, is unstructured. (Technically HTML is considered “semi-structured”, but that distinction is irrelevant to the discussion in this post) In more tangible terms, there’s no way to explicitly provide SEO tags such as a title or a feature image or any other kind of structured key-values into a blog post with plain markdown. That’s why mdx was created, with a section at the top called frontmatter which allows structured data with a YAML syntax, try it below.

With frontmatter, there are two distinct sections of markdown files: a structured YAML section and a separate section for unstructured data written in markdown or HTML down below. For simple content structures and blogs, this is sufficient. For more complex visual components interleaved inside of content, however, using mdx will create a mess of HTML and CSS abstractions leaking out all over the markdown files, making it brittle, hard to understand, and a pain to update. Even using custom components the way that Gatsby defines in their documentation, stuffing data into HTML/JSX style syntax is possible, but cumbersome and not ideal.

In my career as a graphics journalist, I’ve seen this problem come up over and over again. Making interactive articles in collaboration with a team of journalists, editors, and designers, the complexities of our articles and interactives coupled with the workflow we use to ensure that everyone can work effectively in parallel means that using purely markdown to manage content is untenable.

Instead, we use ArchieML, a structured data markup language that has been widely adopted by newsrooms that make these kinds of interactive articles, but lesser known to the broader ecosystem of bloggers and web developers. After using ArchieML for the past three years at work, I have found that the ways that it can be used and adapted are far more suitable than mere markdown, and I’ve completely rebuilt this site’s static site generator to use ArchieML files. In fact, the blog you are reading uses ArchieML and markdown together, because with a few important exceptions, ArchieML and markdown do not conflict with each other.

ArchieML

ArchieML is named after Archie Tse, the current graphics director at The New York Times. ArchieML’s syntax is, like markdown’s, very succinct and unobtrusive, but also provides advantages over YAML or JSON when it comes to structured data. From the ArchieML documentation:

  • Whitespace is not significant to the document structure In YAML, lines must be indented precisely and variably; the wrong number of spaces to the left of a key invalidates the document, and tabs can't be used. AML ignores all whitespace not within a value. We believe this makes it easier for non-programmers to use, and is essential for use in environments with non-monospaced fonts, like in Google Documents.
  • Unstructured text is ignored; there is no such thing as a parsing error AML was designed so that writers could work in a freeform environment. They should be able to add entire paragraphs as scratch work that do not appear in the output. JSON and YAML have strict schemas that forbid text deviating from a pattern. AML doesn't assume text follows any pattern. If it finds text that looks like data, it treats it as data. Otherwise, it moves on.
  • The notation makes sense to non-programmers Lists of values are noted with bullet points / asterisks, not hyphens or quoted strings that must be separated with commas. An overriding goal was to have an intuitive format that could be passed to a non-technical user — a reporter, an assigning editor or a copy editor — to edit, and to have the format be clear enough that they could make changes without breaking the parsing of the document. If we were using another format, we'd have to explain indentation rules in YAML, or how to match curly braces or properly escape quotation marks in JSON, and so forth.

From ArchieML’s documentation page, The New York Times uses ArchieML in conjunction with Google Docs to allow editors to easily update and collaboratively work on interactive graphics, but the ArchieML specification can be used anywhere, not just on Google Docs.

This is what ArchieML looks like in practice, try it below:

Keep Content Structured

One thing that isn't talked about enough in web development is that it’s easy to go from structured data to unstructured data, but trying to go the other way is a losing, uphill battle. Most content management systems for blogs, websites, or other personal pages work by having a backend with a WYSIWYG editor that generates some kind of HTML that then gets stuffed in between some kind of content wrapper inside the final HTML of the web page. This means that the final webpage’s HTML is generated in two separate places: the editor generates HTML that is wrapped by HTML generated by the theme, which blindly applies CSS to the HTML generated from the editor in hopes that everything works properly.

In this case, adding custom elements becomes bespoke and unwieldy as custom CSS either gets mixed into the editor’s HTML or custom overrides and HTML parsing hacks get written into the frontend logic. In either case, the frontend and the content cease to operate independently of each other, causing future migrations or site redesigns to become a nightmare.

A much better approach is to structure content into blocks, giving each block of content its own type and allowing the front-end to have more context on how that block should be rendered. This means that instead of the editor spitting out a block of spaghetti HTML that simply gets blindly embedded into some spot on the page, the editor spits out an array of content that contains blocks for the front-end to iterate over. This allows the editor to generally be more free of display logic specifics (sometimes it can't be avoided), handing over all of that responsibility to the frontend where it belongs.

While there are open source WYSIWYG editors such as draft.js or prosemirror that allow users to create content in a structured way, neither markdown nor mdx allow for this kind of flexibility in a plain-text format. That’s where ArchieML fits in.

ArchieML + Markdown = 😍

As I alluded to above, ArchieML can be used in conjunction with markdown for some extremely powerful results. The trick here is to parse a document in ArchieML first by putting all the content inside a freeform array, and then parse individual blocks as markdown in order to get the full flexibility of structured and minimalistic text formatting. In fact, all the posts on this blog are structured in this way, you can see the file for this post here.

The caveat is that because ArchieML has parsing logic around newlines, multi-line markdown tags such as lists or tables will not be parsed properly if used directly. Thus, these will have to be wrapped and escaped in a way that ArchieML will ignore the line breaks.

The result is a plain-text markup and formatting specification that is far more flexible than mdx, and allows content and display logic to be defined and maintained in a much more manageable way. There’s a reason the New York Times created ArchieML for graphics, and the wider internet community can take advantage of it in a similar way.