Skip to content

Extended PEG Syntax

PhilippeSigaud edited this page Apr 9, 2012 · 48 revisions

Extended PEG Syntax

As we saw in PEG Basics and Declaring a Grammar, Pegged implements the entire PEG syntax, exactly as it was defined by its author.

Now, I felt the need to extend this a little bit. At that time, semantic actions were not implemented in Pegged so now that they are, these extensions are not strictly necessary, but they are useful shortcuts.

Dropping a Node

The first extensions act on the result of parsing expression. Given 'e' a parsing expression:

  • :e will drop e's captures. And, due to the way sequences are implemented in Pegged, the mother expression will forget e result (that's deliberate). It allows one to write:
mixin(grammar("
    JSON   <- :'{' (Pair (:',' Pair)*)? :'}'
    Pair   <- String :':' Value

    # Rest of JSON grammar ...

"));

On the first rule, see the colon before the curly braces literals and the comma. That means that when called on {"Hello":42, "World!":0}, JSON parse tree will contain only the interesting parts, not the syntactic signs necessary to structure the JSON grammar:

ParseTree("JSON",
    ParseTree("Pair",
        ParseTree("String", ...)
        ParseTree("Number", ...)),
     ParseTree("Pair",
        ParseTree("String", ...)
        ParseTree("Number", ...))  
)

Fusing Captures

The ~ (tilde) operator concatenates an expression's captures in one string. It was chosen for its proximity with the equivalent D operator. It's useful when an expression would otherwise return a long list of individual parses, whereas you're interested only in the global result:

mixin(grammar("
    # See the ':' before DoubleQuote
    # And the '~' before (Char*)
    String <- :DoubleQuote ~(Char*) :DoubleQuote
    Char <- !DoubleQuote . # Anything but a double quote
"));

Without the tilde operator, using String on a string would return a list of Char results. With tilde, you get the string content, which is most probably what you want:

auto p = String.parse(q{"Hello World!"});
assert(p.capture == ["Hello World!"];
// without tilde: p.capture == ["H", "e", "l", "l", "o", " ", "W", ...]

The same goes for number-recognizers:

Number <- ~(Digit+)
Digit <- [0-9]
auto n = Number.parse("1234");
assert(n.capture == ["1234"]);
// without tilde: n.capture == ["1", "2", "3", "4"]

Internally, it's used by Identifier and QualifiedIdentifier.

Named Captures

The =name (equal) operator is used to name a particular capture. it's defined in Named Captures. But here is the idea:

Email <- QualifiedIdentifier=name :'@' QualifiedIdentifier=domain
enum p = Email.parse("[email protected]");
assert(p.namedCaptures["name"] == "John.Doe");
assert(p.namedCaptures["domain"] == "example.org");

Semantic Actions

Semantic actions are enclosed in curly braces and put behind the expression they act upon:

XMLNode <- OpeningTag {OpeningAction} (Text / Node)* ClosingTag {ClosingAction}

You can use any delegate from Output to Output as a semantic action. See Semantic Actions.

Range of Chars Extension

The characters - (dash), [ (opening square brackets) and ] (closing square bracket) have special meaning in char ranges (the [a-z] syntax). In Pegged they can be escaped with \ to represent themselves. As usual, \ is thus \\. Use them like this:

[-+]     # '-' in first position is OK. Already possible in PEG. 
[+\-]    # Escape '-'
[-\[\]]  # Escape '-', '[' and ']'
[\\\-]   # Matches `\` and `-`
[\r\n\t] # Standard escapes are there, too.

Also, the char semantics were extended to deal with Unicode characters (and not only UTF-8). Basic PEG allows a char to be represented as an ASCII octal sequence: `\040`. **Pegged** also recognizes:

*  `\x41`: hexadecimal UTF-8 chars.

*  `\u0041`: hexadecimal UTF-16 chars.

*  `\U00000041`: hexadecimal UTF-32 chars.

I added them to deal with the W3C XML spec.

Other Extensions
----------------

**Pegged** has other extensions, such as `@` or `^` but these are in flux right now and I'll wait for the design to stabilize before documenting them.

Rule-Level Extensions
---------------------

All the previously-described extensions act upon expressions. When you want an operator to act upon an entire rule, it's possible to enclose it between parenthesis:

Rule <- ~(complicated expression I want to fuse)


This need is common enough for **Pegged** to provide a shortcut: put the operator in the arrow:

`<~` (squiggly arrow) concatenates the captures on the right-hand side of the arrow.

`<:` (colon arrow) drops the entire rule result (useful to ignore comments, for example)

`<{Action}` associates an action with a rule.

For example:

Number <~ Digit+ Digit <- [0-9]

Nested comments

Comment <: "/" (Comment / Text) "/" Text <~ (!("/"/"/") .) # Anything but begin/end markers


That makes `Number` expression a bit more readable (if you use a font that rightly distinguishes between ~ and -, as GitHub does not really do...)

Space Arrow
-----------

There is another kind of space-level rule, it's the 'space arrow', just using `< ` as an arrow. Instead of then treating the parsing expression as a standard non-space-consuming PEG sequence, it will consume spaces between elements.

So, given:

Rule1 <- A B C Rule2 < A B C

A <- 'a' B <- 'b' C <- 'c'


`Rule1` will parse `"abc"` but not `"a   b c"`, where `Rule2` will parse the two inputs (and will output a parse tree holding only an `A` node, a `B` and a `C`, no space node.

The space-consuming is done by the predefined `Spacing` parser (see [[Predefined Parsers]]), which munches blank chars, tabs and co and line terminators.

As a TODO, I plan to let the user define its own `Space` parser (that could for example consume both spaces and comments, as is done in the PEG grammar itself), which the space-sequence would call behind the scene. That way, the space consuming could be customized for each grammar. **Pegged** is not there yet.


* * * *
Next lesson: [[Parametrized Rules]]

* * * *

[[Pegged Tutorial]]
Clone this wiki locally