Extended PEG Syntax

As we saw in PEG Basics and Declaring a Grammar, Pegged implements the entire PEG syntax, exactly as it was defined by its author.

Now, I felt the need to extend this a little bit. At that time, semantic actions were not implemented in Pegged so now that they are, these extensions are not strictly necessary, but they are useful shortcuts.

Dropping a Node

The first extensions act on the result of parsing expression. Given 'e' a parsing expression:

:e will drop e's captures. And, due to the way sequences are implemented in Pegged, the mother expression will forget e result (that's deliberate). It allows one to write:

mixin(grammar("
    JSON   <- :'{' (Pair (:',' Pair)*)? :'}'
    Pair   <- String :':' Value

    # Rest of JSON grammar ...

"));

On the first rule, see the colon before the curly braces literals and the comma. That means that when called on {"Hello":42, "World!":0}, JSON parse tree will contain only the interesting parts, not the syntactic signs necessary to structure the JSON grammar:

ParseTree("JSON",
    ParseTree("Pair",
        ParseTree("String", ...)
        ParseTree("Number", ...)),
     ParseTree("Pair",
        ParseTree("String", ...)
        ParseTree("Number", ...))  
)

Fusing Captures

The ~ (tilde) operator concatenates an expression's captures in one string. It was chosen for its proximity with the equivalent D operator. It's useful when an expression would otherwise return a long list of individual parses, whereas you're interested only in the global result:

mixin(grammar("
    # See the ':' before DoubleQuote
    # And the '~' before (Char*)
    String <- :DoubleQuote ~(Char*) :DoubleQuote
    Char <- !DoubleQuote . # Anything but a double quote
"));

Without the tilde operator, using String on a string would return a list of Char results. With tilde, you get the string content, which is most probably what you want:

auto p = String.parse(q{"Hello World!"});
assert(p.capture == ["Hello World!"];
// without tilde: p.capture == ["H", "e", "l", "l", "o", " ", "W", ...]

The same goes for number-recognizers:

Number <- ~(Digit+)
Digit <- [0-9]

auto n = Number.parse("1234");
assert(n.capture == ["1234"]);
// without tilde: n.capture == ["1", "2", "3", "4"]

Internally, it's used by Identifier and QualifiedIdentifier.

Named Captures

The =name (equal) operator is used to name a particular capture. it's defined in Named Captures. But here is the idea:

Email <- QualifiedIdentifier=name :'@' QualifiedIdentifier=domain

enum p = Email.parse("[email protected]");
assert(p.namedCaptures["name"] == "John.Doe");
assert(p.namedCaptures["domain"] == "example.org");

Semantic Actions

Semantic actions are enclosed in curly braces and put behind the expression they act upon:

XMLNode <- OpeningTag {OpeningAction} (Text / Node)* ClosingTag {ClosingAction}

You can use any delegate from Output to Output as a semantic action. See Semantic Actions.

Range of Chars Extension

The characters - (dash), [ (opening square brackets) and ] (closing square bracket) have special meaning in char ranges (the [a-z] syntax). In Pegged they can be escaped with \ to represent themselves. As usual, \ is thus \\. Use them like this:

[-+]     # '-' in first position is OK. Already possible in PEG. 
[+\-]    # Escape '-'
[-\[\]]  # Escape '-', '[' and ']'
[\\\-]   # Matches `\` and `-`
[\r\n\t] # Standard escapes are there, too.

Also, the char semantics were extended to deal with Unicode characters (and not only UTF-8). Basic PEG allows a char to be represented as an ASCII octal sequence: `\040`. **Pegged** also recognizes:

*  `\x41`: hexadecimal UTF-8 chars.

*  `\u0041`: hexadecimal UTF-16 chars.

*  `\U00000041`: hexadecimal UTF-32 chars.

I added them to deal with the W3C XML spec.

Other Extensions
----------------

**Pegged** has other extensions, such as `@` or `^` but these are in flux right now and I'll wait for the design to stabilize before documenting them.

Rule-Level Extensions
---------------------

All the previously-described extensions act upon expressions. When you want an operator to act upon an entire rule, it's possible to enclose it between parenthesis:

Rule <- ~(complicated expression I want to fuse)


This need is common enough for **Pegged** to provide a shortcut: put the operator in the arrow:

`<~` (squiggly arrow) concatenates the captures on the right-hand side of the arrow.

`<:` (colon arrow) drops the entire rule result (useful to ignore comments, for example)

`<{Action}` associates an action with a rule.

For example:

Number <~ Digit+ Digit <- [0-9]

Nested comments

Comment <: "/" (Comment / Text) "/" Text <~ (!("/"/"/") .) # Anything but begin/end markers


That makes `Number` expression a bit more readable (if you use a font that rightly distinguishes between ~ and -, as GitHub does not really do...)

Space Arrow
-----------

There is another kind of space-level rule, it's the 'space arrow', just using `< ` as an arrow. Instead of then treating the parsing expression as a standard non-space-consuming PEG sequence, it will consume spaces between elements.

So, given:

Rule1 <- A B C Rule2 < A B C

A <- 'a' B <- 'b' C <- 'c'


`Rule1` will parse `"abc"` but not `"a   b c"`, where `Rule2` will parse the two inputs (and will output a parse tree holding only an `A` node, a `B` and a `C`, no space node.

The space-consuming is done by the predefined `Spacing` parser (see [[Predefined Parsers]]), which munches blank chars, tabs and co and line terminators.

As a TODO, I plan to let the user define its own `Space` parser (that could for example consume both spaces and comments, as is done in the PEG grammar itself), which the space-sequence would call behind the scene. That way, the space consuming could be customized for each grammar. **Pegged** is not there yet.


* * * *
Next lesson: [[Parametrized Rules]]

* * * *

[[Pegged Tutorial]]

Extended PEG Syntax

Extended PEG Syntax

Dropping a Node

Fusing Captures

Named Captures

Semantic Actions

Range of Chars Extension

Nested comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally