awwx.ws

Creating a parser combinator library to parse JSON

ContentsNext: Choosing a combinator interface

Introduction

Here I describe how I learned how to program with combinators, a powerful technique for writing succinct code, and show how I used them to implement a parser combinator library and a JSON parsing library in Arc.

Why succinct code?

Succinct code is:

Why Arc?

Arc is a new language being designed by Paul Graham and Robert Morris, having the goal of creating a new more powerful language, with “power” meaning being able to implement more features with the same amount of code (or, conversely, implement the same features in less code), to be able to write code more succinctly.

What I’ll be showing you has two parts: the parser combinator library, which can be used to write a number of different parsers, and the JSON parser, which uses the parser combinator library to parse JSON. The parser combinator library is an example of what they call a “domain specific language” (DSL), the idea being that if you need to write a parser, it’s easier to do it in a specialized “parser language” than it would be to have to do it all yourself in a general-purpose langauge.

So if Arc really is more powerful, in the sense of enabling us to write more succinct code, then presumably we’d be able to write the parser combinator part more succinctly. But would using Arc as our base language do anything to help us with the JSON parser, which will be written in the “parser combinator language”?

Yes. Whenever we have a task we need to do (parsing JSON) and a library to help us out (the parser combinator library), the library will do some things for us and other things we’ll need to do ourselves. Some DSL implementations keep the base language and the DSL really separate: for example, writing your code in the DSL language might involve putting it in a separate source code file. But then it becomes awkward and difficult, more work, to write the parts that we need to in the base languages, sometimes to the point where using the DSL instead writing the whole thing in the base language is no longer a net win.

At the other end of the spectrum the DSL extends the base language, so that you can write code in both the DSL and the base lagnauge at the same time. This is more powerful, so why would anyone design their DSL to be separate from the base language? Sometimes the the DSL is being designed for less experienced programmers who don’t know how to program in or aren’t comfortable with the base language, sometimes the base language isn’t very easily extensible. In our case we’re interested in maximizing power for the experienced programmer, so the test for Arc becomes whether it is easy and natural to shift between writing parts of the JSON parser in the parser combinator library and in Arc.

The JSON library not only parses a JSON expression (looks at the JSON input and figures out whether it is JSON literal, object, array, string, or number), but also converts it into an Arc value: an Arc number, string, list, table, etc. We’ll see that it turns out to be pleasantly easy to combine the parsing task with the generating Arc values task, including transforming a recursive JSON data structure into a recursive Arc one.

What is a combinator?

A “combinator” is a function which takes one or more other functions as arguments and returns a new function. For a “parser combinator”, the functions are parsers that take some input and return the parsed value of that input, and by using combinators more complex parsers can be built up out of simpler ones.

For example, the simple alphanum parser parses a single letter or digit character; the many1 parser combinator takes a parser and returns a parser that matches one or more of what the input parser parses; and so (many1 alphanum) is a parser that will match one or more letters or digits.

The power of the combinator comes from that it will work on any input parser: if I had a parser word that matched a word in the input, then (many1 word) would match one or more words.

Combinators are one tool in the abstraction toolbox that allow more succinct code to be written. If I didn’t have a many1 combinator, then each time I needed to match one or more of something I’d need to manually write a function which called the “something” parser repeatedly for as long as it matched the input.

Here I use combinators for parsing, but the technique can be used anywhere that we are combining functions in similar patterns, making it faster and easier to build programs that handle large and complicated tasks without having to make all the connections manually.

Why describe how I learned how to create combinators?

This exposition is longer than if I simply documented my parser combinator library and described how to use it. I describe the steps I went through in creating the library: for example, I describe how I started with one version of optional and changed it later, how I didn’t know how to implement many and so how I snuck up on it by first implementing something simpler.

While having a JSON parser and parser combinator library available for Arc is intrinsicly useful, the real power comes from learning how to use how to use the combinator technique in your own programs.

About JSON

JSON is a “lightweight data-interchange format, easy for humans to read and write, and easy for machines to parse and generate.” It is widely supported with 110 JSON libraries available for 39 languages. It is also a subset of JavaScript (a JSON expression can be used without change as a JavaScript literal), so it is especially popular in web applications.

JSON is a good choice for this kind of explanation because having a JSON library available for Arc is useful (it can be used for AJAX-style programming of web applications in Arc), but JSON is also pretty simple so this description doesn’t go on forever. This would be a lot longer if I were describing how to parse XML, but without adding much more useful information on the combinator technique.


ContentsNext: Choosing a combinator interface


Questions? Comments? Email me andrew.wilcox [at] gmail.com