Skip to content

Extended PEG Syntax

PhilippeSigaud edited this page Mar 20, 2012 · 48 revisions

Extended PEG Syntax

As we saw in PEG Basics and Declaring a Grammar, Pegged implements the entire PEG syntax, exactly as it was defined by its author.

Now, I felt the need to extend this a little bit. At that time, semantic actions were not implemented in Pegged so now that they are, these extensions are not strictly necessary, but they are useful shortcuts.

Dropping a Node

The first extensions act on the result of parsing expression. Given 'e' a parsing expression:

  • :e will drop e's captures. And, due to the way sequences are implemented in Pegged, the mother expression will forget e result (that's deliberate). It allows one to write:
mixin(grammar("
    JSON   <- :'{' (Pair (:',' Pair)*)? :'}'
    Pair   <- String :':' Value

    # Rest of JSON grammar ...

"));

On the first rule, see the colon before the curly braces literals and the comma. That means that when called on {"Hello":42, "World!":0}, JSON parse tree will contain only the interesting parts, not the syntactic signs necessary to structure the JSON grammar:

ParseTree("JSON",
    ParseTree("Pair",
        ParseTree("String", ...)
        ParseTree("Number", ...)),
     ParseTree("Pair",
        ParseTree("String", ...)
        ParseTree("Number", ...))  
)

Fusing Captures

The ~ (tilde) operator concatenates an expression's captures in one string. It was chosen for its proximity with the equivalent D operator. It's useful when an expression would otherwise return a long list of individual parses, whereas you're interested only in the global result:

mixin(grammar("
    # See the ':' before DoubleQuote
    # And the '~' before (Char*)
    String <- :DoubleQuote ~(Char*) :DoubleQuote
    Char <- !DoubleQuote . # Anything but a double quote
"));

Without the tilde operator, using String on a string would return a list of Char results. With tilde, you get the string content, which is most probably what you want:

auto p = String.parse(q{"Hello World!"});
assert(p.capture == ["Hello World!"];
// without tilde: p.capture == ["H", "e", "l", "l", "o", " ", "W", ...]

The same goes for number-recognizers:

Number <- ~(Digit+)
Digit <- [0-9]
auto n = Number.parse("1234");
assert(n.capture == ["1234"]);
// without tilde: n.capture == ["1", "2", "3", "4"]

Internally, it's used by Identifier and QualifiedIdentifier.

Named Captures

The =name (equal) operator is used to name a particular capture. it's defined in Named Captures. But here is the idea:

Email <- QualifiedIdentifier=name :'@' QualifiedIdentifier=domain
enum p = Email.parse("[email protected]");
assert(p.namedCaptures["name"] == "John.Doe");
assert(p.namedCaptures["domain"] == "example.org");

Semantic Actions

Semantic actions are enclosed in curly braces and put behind the expression they act upon:

XMLNode <- OpeningTag {OpeningAction} (Text / Node)* ClosingTag {ClosingAction}

You can use any delegate from Output to Output as a semantic action. See Semantic Actions.

Other Extensions

Pegged has other extensions, such as @ or ^ but these are in flux right now and I'll wait for the design to stabilize before documenting them.

Rule-Level Extensions

All the previously-described extensions act upon expressions. When you want an operator to act upon an entire rule, it's possible to enclose it between parenthesis:

Rule <- ~(complicated expression I want to fuse)

This need is common enough for Pegged to provide a shortcut: put the operator in the arrow:

<~ (squiggly arrow) concatenates the captures on the right-hand side of the arrow.

<: (colon arrow) drops the entire rule result (useful to ignore comments, for example)

<{Action} associates an action with a rule.

For example:

Number <~ Digit+
Digit  <- [0-9]

# Nested comments
Comment <: "/*" (Comment / Text)* "*/"
Text    <~ (!("/*"/"*/") .)* # Anything but begin/end markers

That makes Number expression a bit more readable (if you use a font that rightly distinguishes between ~ and -, as GitHub does not really do...)

Space Arrow

There is another kind of space-level rule, it's the 'space arrow', just using < as an arrow. Instead of then treating the parsing expression as a standard non-space-consuming PEG sequence, it will consume spaces between elements.

So, given:

Rule1 <- A B C
Rule2 <  A B C

A <- 'a'
B <- 'b'
C <- 'c'

Rule1 will parse "abc" but not "a b c", where Rule2 will parse the two inputs (and will output a parse tree holding only an A node, a B and a C, no space node.

The space-consuming is done by the predefined Spacing parser (see Predefined Parsers), which munches blank chars, tabs and co and line terminators.

As a TODO, I plan to let the user define its own Space parser (that could for example consume both spaces and comments, as is done in the PEG grammar itself), which the space-sequence would call behind the scene. That way, the space consuming could be customized for each grammar. Pegged is not there yet.


Next lesson: Parametrized Rules


Pegged Tutorial

Clone this wiki locally