-
Notifications
You must be signed in to change notification settings - Fork 67
Extended PEG Syntax
As we saw in PEG Basics and Declaring a Grammar, Pegged implements the entire PEG syntax, exactly as it was defined by its author.
Now, I felt the need to extend this a little bit. At that time, semantic actions were not implemented in Pegged so now that they are, these extensions are not strictly necessary, but they are useful shortcuts.
The first extensions act on the result of parsing expression. Given 'e' a parsing expression:
-
:e
will drope
's captures. And, due to the way sequences are implemented in Pegged, the mother expression will forgete
result (that's deliberate). It allows one to write:
mixin(grammar("
JSON <- :'{' (Pair (:',' Pair)*)? :'}'
Pair <- String :':' Value
# Rest of JSON grammar ...
"));
On the first rule, see the colon before the curly braces literals and the comma. That means that when called on {"Hello":42, "World!":0}
, JSON
parse tree will contain only the interesting parts, not the syntactic signs necessary to structure the JSON grammar:
ParseTree("JSON",
ParseTree("Pair",
ParseTree("String", ...)
ParseTree("Number", ...)),
ParseTree("Pair",
ParseTree("String", ...)
ParseTree("Number", ...))
)
The ~
(tilde) operator concatenates an expression's captures in one string. It was chosen for its proximity with the equivalent D operator. It's useful when an expression would otherwise return a long list of individual parses, whereas you're interested only in the global result:
mixin(grammar("
# See the ':' before DoubleQuote
# And the '~' before (Char*)
String <- :DoubleQuote ~(Char*) :DoubleQuote
Char <- !DoubleQuote . # Anything but a double quote
"));
Without the tilde operator, using String
on a string would return a list of Char
results. With tilde, you get the string content, which is most probably what you want:
auto p = String.parse(q{"Hello World!"});
assert(p.capture == ["Hello World!"];
// without tilde: p.capture == ["H", "e", "l", "l", "o", " ", "W", ...]
The same goes for number-recognizers:
Number <- ~(Digit+)
Digit <- [0-9]
auto n = Number.parse("1234");
assert(n.capture == ["1234"]);
// without tilde: n.capture == ["1", "2", "3", "4"]
Internally, it's used by Identifier
and QualifiedIdentifier
.
The =name
(equal) operator is used to name a particular capture. it's defined in Named Captures. But here is the idea:
Email <- QualifiedIdentifier=name :'@' QualifiedIdentifier=domain
enum p = Email.parse("[email protected]");
assert(p.namedCaptures["name"] == "John.Doe");
assert(p.namedCaptures["domain"] == "example.org");
Semantic actions are enclosed in curly braces and put behind the expression they act upon:
XMLNode <- OpeningTag {OpeningAction} (Text / Node)* ClosingTag {ClosingAction}
You can use any delegate from Output
to Output
as a semantic action. See Semantic Actions.
Pegged has other extensions, such as @
or ^
but these are in flux right now and I'll wait for the design to stabilize before documenting them.
All the previously-described extensions act upon expressions. When you want an operator to act upon an entire rule, it's possible to enclose it between parenthesis:
Rule <- ~(complicated expression I want to fuse)
This need is common enough for Pegged to provide a shortcut: put the operator in the arrow:
<~
(squiggly arrow) concatenates the captures on the right-hand side of the arrow.
<:
(colon arrow) drops the entire rule result (useful to ignore comments, for example)
<{Action}
associates an action with a rule.
For example:
Number <~ Digit+
Digit <- [0-9]
# Nested comments
Comment <: "/*" (Comment / Text)* "*/"
Text <~ (!("/*"/"*/") .)* # Anything but begin/end markers
That makes Number
expression a bit more readable (if you use a font that rightly distinguishes between ~ and -, as GitHub does not really do...)
There is another kind of space-level rule, it's the 'space arrow', just using <
as an arrow. Instead of then treating the parsing expression as a standard non-space-consuming PEG sequence, it will consume spaces between elements.
So, given:
Rule1 <- A B C
Rule2 < A B C
A <- 'a'
B <- 'b'
C <- 'c'
Rule1
will parse "abc"
but not "a b c"
, where Rule2
will parse the two inputs (and will output a parse tree holding only an A
node, a B
and a C
, no space node.
The space-consuming is done by the predefined Spacing
parser (see Predefined Parsers), which munches blank chars, tabs and co and line terminators.
As a TODO, I plan to let the user define its own Space
parser (that could for example consume both spaces and comments, as is done in the PEG grammar itself), which the space-sequence would call behind the scene. That way, the space consuming could be customized for each grammar. Pegged is not there yet.
Next lesson: Parametrized Rules