Skip to content

Extended PEG Syntax

PhilippeSigaud edited this page Mar 10, 2012 · 48 revisions

Extended PEG Syntax

As we saw in PEG Basics and Declaring a Grammar, Pegged implements the entire PEG syntax, exactly as it was defined by its author.

Now, I felt the need to extend this a little bit. At that time, semantic actions were not implemented in Pegged so now that they are, these extensions are not strictly necessary, but they are useful shortcuts.

Dropping a Node :

The first extensions act on the result of parsing expression. Given 'e' a parsing expression:

  • :e will drop e's captures. And, due to the way sequences are implemented in Pegged, the mother expression will forget e result (that's deliberate). It allows one to write:
mixin(grammar("
    JSON   <- :'{' (Pair (:',' Pair)*)? :'}'
    Pair   <- String :':' Value

    # Rest of JSON grammar ...

"));

On the first rule, see the colon before the curly braces literals and the comma. That means that when called on {"Hello":42, "World!":0}, JSON parse tree will contain only the interesting parts, not the syntactic signs necessary to structure the JSON grammar:

ParseTree("JSON",
    ParseTree("Pair",
        ParseTree("String", ...)
        ParseTree("Number", ...)),
     ParseTree("Pair",
        ParseTree("String", ...)
        ParseTree("Number", ...))  
)

Fusing Captures:

The ~ (tilde) operator concatenate an expression captures in one string. I chose it for its proximity with the equivalent D operator. It's useful when an expression would otherwise return a long list of individual parses, whereas you're interested only in the global result:

mixin(grammar("
    # See the ':' before DoubleQuote
    # And the '~' before (Char*)
    String <- :DoubleQuote ~(Char*) :DoubleQuote 
"));

Without the tilde operator, using String on a string would return a list of Char results. With tilde, you get the string content, which is most probably what you want:

auto p = String.parse(q{"Hello World!"});
assert(p.capture == ["Hello World!"];
// without tilde: p.capture == ["H", "e", "l", "l", "o", " ", "W", ...]

The same goes for number-recognizers:

Number <- ~(Digit+)
Digit <- [0-9]
auto n = Number.parse("1234");
assert(n.capture == ["1234"]);
// without tilde: n.capture == ["1", "2", "3", "4"]

Internally, it's used by Identifier and QualifiedIdentifier.

Named Captures

The =name (equal) operator is used to name a particular capture. it's defined in Named Captures. But here is the idea:

Email <- QualifiedIdentifier=name :'@' QualifiedIdentifier=domain
enum p = Email.parse("[email protected]");
assert(p.namedCaptures["name"] == "John.Doe");
assert(p.namedCaptures["domain"] == "example.org");

Semantic Actions

Semantic actions are enclosed in curly braces and put behind the expression they act upon:

XMLNode <- OpeningTag {OpAction} (Text / Node)* ClosingTag {CloseAction}

You can use any delegate from Output to Output as a semantic action. See Semantic Actions.

Other Extensions

Pegged has other extensions, such as @ or ^ but these are in flux right now and I'll wait for the design to stabilize before documenting them.

Rule-Level Extensions

All the previously-described extensions act upon expressions. When you want an operator to act upon an entire rule, it's possible to enclose it between parenthesis:

Rule <- ~(complicated expression I want to fuse)

This need is common enough for Pegged to provide a shortcut: put the operator in the arrow:

<~ (squiggly arrow) concatenates the captures on the right-hand side of the arrow.

<: (colon arrow) drops the entire rule result (useful to ignore comments, for example)

<{Action} associates an action with a rule.

For example:

Number <~ Digit+
Digit  <- [0-9]

Comment <: "/*" (Comment / Text)* "*/"
Text    <~ (!("/*"/"/*") .)*

That makes Number expression a bit more readable (if you use a font that rightly distinguishes between ~ and -, as GitHub does not really do...)

Next lesson: Parametrized Rules


PEG Tutorial

Clone this wiki locally