awwx.ws
The latest version of this hack is hundred-year-parser

parse0

a rough first draft of a new parser

http://awwx.ws/parse0.arc:

(implicit parsepos*)
(implicit parsefail*)

(def parsefail ()
  (parsefail*))

; failfn must be called *outside* the w/parsefail* scope,
; thus the code to pass the normal/fail result of calling f
; out through the w/parsefail*

(def fonparsefail (failfn f)
  (with (return-value nil failed nil)
    (catch
     (w/parsefail* (fn () (set failed) (throw nil))
       (= return-value (f))))
    (if failed (failfn) return-value)))

(mac on-parsefail (onfail . body)
  `(fonparsefail (fn () ,onfail) (fn () ,@body)))

(def fparse (seq body)
  (on-parsefail nil
    (w/parsepos* (if (isa seq 'string) (coerce seq 'cons) seq)
      (body))))

(mac parse (s . body)
  `(fparse ,s (fn () ,@body)))

(def rest ()
  parsepos*)

(def at-end ()
  (no parsepos*))

(def fail-at-end ()
  (if (at-end) (parsefail)))

(def match (item)
  (fail-at-end)
  (let v car.parsepos*
    (if ((testify item) v)
         (do (= parsepos* cdr.parsepos*) v)
         (parsefail))))

(def literal (pat (o value pat))
  (if (begins parsepos* (if (isa pat 'string) (coerce pat 'cons) pat))
       (do (= parsepos* (nthcdr len.pat parsepos*))
           value)
       (parsefail)))

(def falt2 (f1 f2)
  (let mark parsepos*
    (on-parsefail
     (do (= parsepos* mark)
         (f2))
     (f1))))

(mac alt2 (p1 p2)
  `(falt2 (fn () ,p1) (fn () ,p2)))

(mac alt ps
  (if (no ps)
       `(parsefail)
      (no (cdr ps))
       (car ps)
       `(alt2 ,(car ps) (alt ,@(cdr ps)))))

(mac optional body
  `(alt (do ,@body) nil))

(def fmany (fp)
  (accum a
    (on-parsefail nil        
      (xloop ()
        (a (fp))
        (next)))))

(mac many body
  `(fmany (fn () ,@body)))

(mac many1 parser
  `(cons (do ,@parser) (many (do ,@parser))))

(mac must (errmsg . body)
  `(on-parsefail (err ,errmsg) (do ,@body)))

(def munch (f)
  (fail-at-end)
  (let v car.parsepos*
    (= parsepos* cdr.parsepos*)
    (f v)))

(mac parse-intersperse (separator parser must-message)
  `(optional
    (cons ,parser
          (many ,separator
                (must ,must-message ,parser)))))

(def skipwhite ()
  (many (match whitec)))

(mac comma-separated (must-message parser)
  `(parse-intersperse
    (do (skipwhite) (match #\,))
    ,parser
    ,must-message))

I wrote my parser combinator library in conjunction with writing my JSON parser and, as it turns out, it supports writing JSON parsers really nicely, but is painful and awkward for parsing other kinds of things. :-/

One problem is that because parsers in the parser combinator approach return both the result value of the (sub)parse and the next parse position, to use them we have to contantly be extracting the return value to stick someone and feeding the parse position into the next parser. This isn’t so bad in the JSON parser because all the parsing is recursive (JSON objects containing JSON values which can be JSON arrays and JSON object which contain JSON values...), but is a pain we want to say, “parse this, then that, then this”.

I was inspired by Conrad Barski’s regex parser. I’m not using any of his code, but when I saw how in his parser we could parse things in sequence just by using Arc’s regular control flow, I said, “gosh, why couldn’t my parser do that?” :-)

Conrad Barski’s parser, like many regex parsers, does backtracking automatically. I wanted a parser that like parsec of only backtracks when explicitly told to using alt. (There are pros and cons to both approaches, but for the kinds of parsing I’m doing at the moment, I prefer non-automatic backtracking).

I noticed that parsers written to read from an input stream can also use Arc’s regular control flow to parse things in sequence, because the “parse position” is maintained implicitly inside of the input stream (it keeps track of what is the next character to be read for you). For example if we wanted to grab the first three lines of the input and put them in separate variables, it’s simply

(with (one (readline)
       two (readline)
       three (readline))
  ...)

Reading from an input stream doesn’t let us try alternatives (more than one character deep anyway), so the kinds of parsers we can write are limited, but in cases where we can parse a character at a time it’s great.

So in my new approach I’m trying out keeping the parsing state, the parse position and what to do when a (sub)parse fails, in implicit variables. (I could also put both the parse position and the parse fail action in one implicit parse state variable, but my code seems to be cleaner using two varaibles).

So far this looks like a big win. With the parse position handled implicitly, now we can use the regular Arc control flow to parse things one, two, three in sequence.

Look at the old parser combinator code for parsing a JSON string:

(= json-string
  (on-result string
    (seq2 (match-is #\")
          (must "missing closing quote in JSON string"
                (seq1 (many (alt json-backslash-escape
                                 (match [isnt _ #\"])))
                      (match-is #\"))))))

the on-result, seq2 and seq1 are all there to deal with threading the parse state through the execution: to do something with the return value (in this case turn the list of characters into a string) and to parse things in sequence. Compare with the new code:

(def json-string ()
  (match #\")
  (must "missing closing quote in JSON string"
    (do1 (string (many (alt (json-backslash-escape)
                            (match [isnt _ #\"]))))
         (match #\"))))

Now to parse things in sequence we can do that like we do anything in sequence in Arc, just by doing them one, two, three in the body of a function or in a do, and if we want to do something with the return value of parser like call string on it now it’s just a regular function call.

And, because we can now use Arc’s normal control flow, we can also use macros like n-of. Here’s the old code to parse four hexadecimal chars:

(= fourhex
  (must "four hex digits required after \\u"  
    (with-seq (h1 (match hexdigit)
               h2 (match hexdigit)
               h3 (match hexdigit)
               h4 (match hexdigit))
      (coerce (int (coerce (list h1 h2 h3 h4) 'string) 16) 'char))))

and the new:

(def fourhex ()
  (must
   "four hex digits required after \\u"  
   (coerce (int (string (n-of 4 (match hexdigit))) 16) 'char)))

Plus (it also slices and dices!) defining new (sub)parsers can be done with a regular Arc def. The old definition to parse a JSON “false”:

(= json-false (match-literal "false" nil))

Suppose you wanted to be able to distinguish between an empty JSON array (which also parses as nil) and a JSON false. Ideally, after the library had been loaded, you could just type in

(= json-false (match-literal "false" 'false))

choosing whatever value you wanted “false” to parse to. But, when the JSON parser library was loaded, the original json-false value was incorporated into the parsers that used it. Changing it now doesn’t do anything. You’d have to change it and then reload the rest of the library. In the new code:

(def json-false ()
  (literal "false" nil))

you can redefine json-false and it will take effect immediately, because it’s being called as a function.

The parser works just as well on lists, so we’re not restricted to just parsing strings.

Consider the function span which takes an input list and a test function and returns two lists: the first list is the longest initial prefix of the input list where each element of the list meets the test, and the second list returned is the rest of the input list.

arc> (span odd '(1 3 5 2 4))
((1 3 5) (2 4))

Here’s my original code to implement span:

(def span (tst lst)
  ((afn (a lst)
     (if (and lst (tst (car lst)))
          (self (cons (car lst) a) (cdr lst))
          (list (rev a) lst)))
   nil lst))

and using the new parser code:

(def span (f seq)
  (parse seq
    (list (many (match f))
          (rest))))

Here’s some code to parse a URL:

(def snarf (f)
  (string (many1 (match f))))

(def upto stops
  (snarf [~some _ stops]))

(def parse-url (url)
  (parse url
   (obj scheme (do1 (sym (snarf letter)) (mliteral "://"))
        user   (optional (do1 (upto #\@ #\: #\/) (match #\@)))
        host   (upto #\/ #\:)
        port   (optional (match #\:) (int (snarf digit)))
        path   (upto #\? #\#)
        query  (optional (match #\?) (upto #\#))
        frag   (optional (match #\#) (string (rest))))))

Prerequisites

This hack depends on arc3.1, implicit2, and xloop0.

License

Same as Arc.

Contact me

Twitter: awwx
Email: andrew.wilcox [at] gmail.com