awwx.ws

Creating a parser combinator library to parse JSON

Prev: JSON numbers, implementedContentsNext: Better error reporting

JSON strings




Here I run into an issue about how to present this material... if I have a JSON string like "\u0041" inside of an Arc string literal, then I need to escape quotes and backslashes by prefixing each one with a backslash:

arc> (fromjson "\"\\u0041\"")

That’s pretty ugly and hard to see what the JSON string is, so I’m going to pretend that Arc has a literal string syntax like this:

arc> (fromjson «"\u0041"»)

where the stuff between the guillemets «...» becomes the contents of the Arc string as-is without any escaping. If you’re following along and want to type one of these examples into Arc, just change the guillemets into double quotes and prefix any double quotes or backslashes inside with a backslash.

OK, so I’ll need to be able to parse the four hexadecimal digits after a Unicode escape sequence \u and turn it into a character:

(def hexdigit (c)
  (and (isa c 'char)
       (or (<= #\a c #\f) (<= #\A c #\F) (<= #\0 c #\9))))
(= fourhex
  (with-seq (h1 (match hexdigit)
             h2 (match hexdigit)
             h3 (match hexdigit)
             h4 (match hexdigit))
    (coerce (int (coerce (list h1 h2 h3 h4) 'string) 16) 'char)))

Yup, with-seq turned out to be useful.

Let’s see, I’ll need to parse the other JSON backslash escape sequences:

(def json-backslash-char (c)
  (case c
    #\" #\"
    #\\ #\\
    #\/ #\/
    #\b #\backspace
    #\f #\page
    #\n #\newline
    #\r #\return
    #\t #\tab
    (err "invalid backslash char" c)))

A JSON string backslash escape sequence is one or the other:

(= json-backslash-escape
  (seq (match [is _ #\\])
       (alt (seq (match [is _ #\u])
                        fourhex)
            (fn (p)
              (return cdr.p (json-backslash-char car.p))))))

but oops, seq is giving me lists when all I want is just the character:

arc> (show-parse json-backslash-escape «\u0041»)
returning: (#\\ (#\u #\A)) remaining: 
nil

In both cases I want just the return value of the second parser in the sequence, so I’ll make a combinator to do that:

(def seq2 parsers
  (with-result results (apply seq parsers)
    (results 1)))

And, I can extract a match-is:

(def match-is (x)
  (match [is x _]))

Now I have:

(= json-backslash-escape
  (seq2 (match-is #\\)
        (alt (seq2 (match-is #\u)
                   fourhex)
             (fn (p)
               (return cdr.p (json-backslash-char car.p))))))

That’s better:

arc> (show-parse json-backslash-escape «\u0041»)
returning: #\A remaining: 
nil
arc> (show-parse json-backslash-escape «\/»)
returning: #\/ remaining: 
nil
arc> (show-parse json-backslash-escape «\"»)
returning: #\" remaining: 
nil

Other characters in the string can be anything that isn’t a closing quote:

(match [isnt _ #\"])

Now I have an implementation for json-string:

(= json-string
  (on-result string
    (seq2 (match-is #\")
          (many (alt json-backslash-escape
                     (match [isnt _ #\"])))
          (match-is #\"))))
arc> (show-parse json-string «"\u0041b\\c"»)
returning: "Ab\\c" remaining: 
nil
(= json-value
  (skipwhite:alt json-true
                 json-false
                 json-null
                 json-number
                 json-string))
arc> (fromjson «"greetings"»)
"greetings"

Prev: JSON numbers, implementedContentsNext: Better error reporting


Questions? Comments? Email me andrew.wilcox [at] gmail.com