awwx.ws

The hundred year parser library

[note: I’m out of time for this weekend but I wanted to put this draft up even though it’s still rather rough. In particular I need to do a better job of distingusing between what the goals of a “hundred year parser library” might be and my current parse1 implementation, which I think has some good ideas but is naturally far from being a “hundred year parser” yet.]

Preliminary code at parse1.arc, some test utility functions at testing0.arc with tests at parse1-tests.arc. A rough draft of a Scheme reader at scheme-parser0.arc and tests scheme-parser0-tests.arc. An example JSON parser at json-parser0.arc.

Features and anti-features

Parsers written in Arc

Other parser libraries have their own language for writing your parser in, different than the main programming language that you use for the rest of your program. For example, to use a regular expression library, you write regexps like “[A-Za-z]\w*”.

With this library, you can write your parsers in Arc. Here’s a parser to match an identifier:

(def identifier ()
  (match letter)
  (many (match alphadig)))

which says that an identifier is a letter followed by zero or more letters or numbers. Here match and many come from the parser library, while def, letter, and alphadig are standard Arc.

We can have our parser return a value, for instance we might want to return the matched identifier as a symbol.

(def identifier ()
  (sym (string (cons (match letter)
                     (many (match alphadig))))))

Let’s try it.

arc> (parse "ag445,blue,x88776" (identifier))
ag445
arc> (parse "445" (identifier))
nil

you can see that by default, parse will return nil if a parse fails.

You’ll notice that we somewhat laboriously constructed the return value from the subparsers’ return values using cons. If all we care about is what part of the input a parser matched, not what the parsers returned, we can use matched-input:

(def identifier ()
  (sym (string (matched-input
                (match letter)
                (many (match alphadig))))))

which in this case turned out not to be any shorter, but can be useful when the parsers are more complicated.

Because parsers are ordinary Arc functions, we can build complicated parsers out of simpler ones:

(def identifiers ()
  (comma-separated (identifier)))

arc> (parse "ag445,blue,x88776" (identifiers))
(ag445 blue x88776)

And, we have all the facilities of Arc available to us to write our parser. For example, if we need to match n of something, we can use Arc’s n-of:

(def four-hex-digits ()
  (n-of 4 (match hexdigit)))

Flexible handling of parse failure

A parse failure is like throwing an error: we bail out of the currently running code until we are caught by an parse failure handler.

If nothing else handles it, then as we saw above parse will catch the parse failure and return nil.

There are a bunch of facilities to handle parse failures in different ways:

optional catches a parse failure and returns nil instead. This makes a parser optional, because processing doesn’t stop if the parser doesn’t match.

must turns a parse failure into a error. For example, if we know that a comma has to be followed by a value, then after we’ve seen the comma we could say,

(must "a comma must be followed by a value"
      (value))

That way if we don’t see a value at that point, we get an error instead of the parse simply failing.

many tries a parser over and over again until it fails. It then returns a list of the return values of the successful parses.

alt tries a list of parsers one at a time until one of them matches. If none of them match, than the alt also fails.

at looks ahead in the input and succeeds or fails as its parser does, but doesn’t change the parse position. In stream I/O this is called “peeking”, while in regular expressions this is called a “zero width assertion”.

As an example of when at is useful, consider how in a list a period by itself means that we have a dotted pair like (a . 123). However, a period followed immediately by other characters is going to be part of a number or a symbol as in (a .123), which is the list of “a” and the number “.123”. So to find out if we’ve got a period by itself, we want to know if it’s followed by a character that will terminate a symbol:

(match #\.)
(match terminator)

but we don’t want to consume the terminator, because if it’s a closing parenthesis like in “(a.)” that needs to be reported as an error. The solution is to use at:

(match #\.)
(at (match terminator))

now we know that we have have a period by itself, but the value or the delimiter following the period is still available to be parsed.

not inverts the success/failure of a parse, so that a successful parse becomes a parse failure and a parse failure returns t.

Parsers easily hacked and extended

Again, since parsers are written in plain Arc, they can be easily hacked and played with like any Arc code. Want to try out a different definition of an identifier? Just type “(def identifier () ...)”.

A natural extension point for many parsers is at an alt. For example, a parser for some programming language might say that a value can be a number, string, symbol, etc.:

(def value ()
  (alt (match-number)
       (match-string)
       (match-symbol)
       ...
       ))

Let’s say you come along and want to add your own kind of value to the programming language. alt-extend will extend an existing definition, causing it to first try your parser:

(alt-extend value ()
  (match #\[)
  ...
  (match #\]))

Now value will first try your code, which starts off by looking for an opening square bracket. If your parser fails (there’s no square bracket there), then the original parser will be tried.

The definition that you’re extending doesn’t have to be an alt itself. You can extend any definition with alt-extend, causing your parser to be tried first.

Parse lists of anything

The parser library doesn’t restrict you to parsing characters. You can feed it any list, and all the functions like match will continue to work.

(In fact, in the current implementation when you parse a string or an input stream it’s turned into a list of characters, though a more sophisticated implementation wouldn’t need to do that).

For example, suppose you wanted to find definitions in which the argument “list” is a simple symbol, that is, definitions like:

(def join args
  (if (no args)
    ...

ok, that could be done with something like

(match 'def)
(next)
(match atom)

Incremental parsing

An incremental parser reads enough of the input to parse one “item” (whatever that is) and returns it, without having to read or wait for the entire input. For example, Arc’s read will read enough of the input to parse one Arc value. readline is another example (of a rather simple parser), it reads just until it sees a newline.

Incremental parsing can is useful when we want to be able to do something before we’ve seen all the input (for example in a web server we may want to close the connection based on something we see in the headers, without waiting for the entire body to arrive), it can make an application more responsive (the total time to do an operation may be the same, but it may feel faster if a user can see the initial results instead of having to wait for the whole thing), and it can save on memory if the entire input doesn’t have to be read in at once.

This version of the parser library doesn’t do incremental parsing yet. (You can feed it an input stream if you want to, but the current version rather dumbly reads the entire input and turns it into a list of characters). One possibility is that lazy lists might fit in rather naturally with the current code, so that characters would be read only when they were needed. Alternatively alt and friends could become more complicated, reading ahead in the input when they needed to, and keeping a buffer for backtracking.

No magic

One of the things I like about Arc is that it does what I tell it to. (In fact the few times that I get annoyed with Arc is when it’s doing something for me that I don’t want).

So this is a parser library that does things because you told it to do them. If it’s looking ahead to the next character, it’s because you told it to look ahead. If it’s backtracking, it’s because you told it to look for alternatives. If it’s sunk into a exponentially exploding backtracking death spiral, it’s because... you get the idea :)

Other parsers have the advantage that you can tell them what you want, and they’ll figure out how to do it. For example, in a regular expression parser, you can say “(a+)(ax)”, which means “match one or more a’s followed by an ax”. In this library you’d write the parser as,

(do (many1 (match #\a))
    (mliteral "ax"))

and it’s not going to work. Feed it a string like “aaaaax”, and the many1 is going to eat up all the a’s, and then “(mliteral "ax")” isn’t going to match because there aren’t any a’s left. A regular expression parser will automatically backtrack to let the a+ match “aaaa” leaving an “a” for the “ax” to match.

“No magic” is going to be seen as anti-feature by some people. “What? It doesn’t do optimizations? None at all?” And I wouldn’t call them wrong. If you want a parser that does X (optimization, automatic backtracking, do what I mean, etc.) then you’ll want a parser that does X.

Implementation as specification

When code can be written concisely enough, it can serve as its own specification as Arc itself does. For example, here’s a rough draft (whitespace isn’t handled yet) implementation of parsing a JSON array:

(def json-array ()
  (match #\[)
  (do1 (comma-separated (json-value)
         "a comma in a JSON array must be followed by an object")
       (must "missing closing ] in JSON array"
             (match #\]))))

That’s getting pretty darn close to if I want to know what exactly a JSON array can be, I can go look at the code itself.

I suspect that no magic may turn out to be important for implementation as specification. If I write a parser using an optimizing library and it doesn’t work because of exponential backtracking, is that because I wrote the parser wrong, or because the optimizer wasn’t able to figure out how to optimize backtracking in this particular instance? And if the behavior of my parser changes depending on which optimizer I use, can I really claim that my implementation can serve as a specification?

Slow

This parser library is slow. My comments on “no magic” optimizations doesn’t mean that it couldn’t be faster, however. Just as Arc has declarations and specialized functions such as map1, we can add optimizations that are under the control of the programmer to use when needed and desired.

Prerequisites

This hack depends on arc3.1, scheme0, implicit2, xloop0, ret0, and fn-arg-converter0.

License

Same as Arc.

Contact me

Twitter: awwx
Email: andrew.wilcox [at] gmail.com