Extended PEG Syntax

As we saw in PEG Basics and Declaring a Grammar, Pegged implements the entire PEG syntax, exactly as it was defined by its author.

Now, I felt the need to extend this a little bit. At that time, semantic actions were not implemented in Pegged so now that they are, these extensions are not strictly necessary, but they are useful shortcuts.

Dropping a Node

The first extensions act on the result of parsing expression. Given 'e' a parsing expression:

:e will drop e's captures. And, due to the way sequences are implemented in Pegged, the mother expression will forget e result (that's deliberate). It allows one to write:

mixin(grammar("
    JSON   <- :'{' (Pair (:',' Pair)*)? :'}'
    Pair   <- String :':' Value

    # Rest of JSON grammar ...

"));

On the first rule, see the colon before the curly braces literals and the comma. That means that when called on {"Hello":42, "World!":0}, JSON parse tree will contain only the interesting parts, not the syntactic signs necessary to structure the JSON grammar:

ParseTree("JSON",
    ParseTree("Pair",
        ParseTree("String", ...)
        ParseTree("Number", ...)),
     ParseTree("Pair",
        ParseTree("String", ...)
        ParseTree("Number", ...))  
)

Fusing Captures

The ~ (tilde) operator concatenates an expression's captures in one string. It was chosen for its proximity with the equivalent D operator. It's useful when an expression would otherwise return a long list of individual parses, whereas you're interested only in the global result:

mixin(grammar("
    # See the ':' before DoubleQuote
    # And the '~' before (Char*)
    String <- :DoubleQuote ~(Char*) :DoubleQuote
    Char <- !DoubleQuote . # Anything but a double quote
"));

Without the tilde operator, using String on a string would return a list of Char results. With tilde, you get the string content, which is most probably what you want:

auto p = String.parse(q{"Hello World!"});
assert(p.capture == ["Hello World!"];
// without tilde: p.capture == ["H", "e", "l", "l", "o", " ", "W", ...]

The same goes for number-recognizers:

Number <- ~(Digit+)
Digit <- [0-9]

auto n = Number.parse("1234");
assert(n.capture == ["1234"]);
// without tilde: n.capture == ["1", "2", "3", "4"]

Internally, it's used by Identifier and QualifiedIdentifier.

Named Captures

The =name (equal) operator is used to name a particular capture. it's defined in Named Captures. But here is the idea:

Email <- QualifiedIdentifier=name :'@' QualifiedIdentifier=domain

enum p = Email.parse("[email protected]");
assert(p.namedCaptures["name"] == "John.Doe");
assert(p.namedCaptures["domain"] == "example.org");

Semantic Actions

Semantic actions are enclosed in curly braces and put behind the expression they act upon:

XMLNode <- OpeningTag {OpeningAction} (Text / Node)* ClosingTag {ClosingAction}

You can use any delegate from Output to Output as a semantic action. See Semantic Actions.

Other Extensions

Pegged has other extensions, such as @ or ^ but these are in flux right now and I'll wait for the design to stabilize before documenting them.

Rule-Level Extensions

All the previously-described extensions act upon expressions. When you want an operator to act upon an entire rule, it's possible to enclose it between parenthesis:

Rule <- ~(complicated expression I want to fuse)

This need is common enough for Pegged to provide a shortcut: put the operator in the arrow:

<~ (squiggly arrow) concatenates the captures on the right-hand side of the arrow.

<: (colon arrow) drops the entire rule result (useful to ignore comments, for example)

<{Action} associates an action with a rule.

For example:

Number <~ Digit+
Digit  <- [0-9]

# Nested comments
Comment <: "/*" (Comment / Text)* "*/"
Text    <~ (!("/*"/"*/") .)* # Anything but begin/end markers

That makes Number expression a bit more readable (if you use a font that rightly distinguishes between ~ and -, as GitHub does not really do...)

Space Arrow

There is another kind of space-level rule, it's the 'space arrow', just using < as an arrow. Instead of then treating the parsing expression as a standard non-space-consuming PEG sequence, it will consume spaces between elements.

So, given:

Rule1 <- A B C
Rule2 <  A B C

A <- 'a'
B <- 'b'
C <- 'c'

Rule1 will parse "abc" but not "a b c", where Rule2 will parse the two inputs (and will output a parse tree holding only an A node, a B and a C, no space node.

The space-consuming is done by the predefined Spacing parser (see Predefined Parsers), which munches blank chars, tabs and co and line terminators.

As a TODO, I plan to let the user define its own Space parser (that could for example consume both spaces and comments, as is done in the PEG grammar itself), which the space-sequence would call behind the scene. That way, the space consuming could be customized for each grammar. Pegged is not there yet.

Next lesson: Parametrized Rules

Pegged Tutorial

Extended PEG Syntax

Extended PEG Syntax

Dropping a Node

Fusing Captures

Named Captures

Semantic Actions

Other Extensions

Rule-Level Extensions

Space Arrow

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally