W3cubDocs

/Nim

Module pegs

Simple PEG (Parsing expression grammar) matching. Uses no memorization, but uses superoperators and symbol inlining to improve performance. Note: Matching performance is hopefully competitive with optimized regular expression engines.

PEG syntax and semantics

A PEG (Parsing expression grammar) is a simple deterministic grammar, that can be directly used for parsing. The current implementation has been designed as a more powerful replacement for regular expressions. UTF-8 is supported.

The notation used for a PEG is similar to that of EBNF:

notation meaning
A / ... / Z Ordered choice: Apply expressions A, ..., Z, in this order, to the text ahead, until one of them succeeds and possibly consumes some text. Indicate success if one of expressions succeeded. Otherwise do not consume any text and indicate failure.
A ... Z Sequence: Apply expressions A, ..., Z, in this order, to consume consecutive portions of the text ahead, as long as they succeed. Indicate success if all succeeded. Otherwise do not consume any text and indicate failure. The sequence's precedence is higher than that of ordered choice: A B / C means (A B) / Z and not A (B / Z).
(E) Grouping: Parenthesis can be used to change operator priority.
{E} Capture: Apply expression E and store the substring that matched E into a capture that can be accessed after the matching process.
$i Back reference to the i``th capture. ``i counts from 1.
$ Anchor: Matches at the end of the input. No character is consumed. Same as !..
^ Anchor: Matches at the start of the input. No character is consumed.
&E And predicate: Indicate success if expression E matches the text ahead; otherwise indicate failure. Do not consume any text.
!E Not predicate: Indicate failure if expression E matches the text ahead; otherwise indicate success. Do not consume any text.
E+ One or more: Apply expression E repeatedly to match the text ahead, as long as it succeeds. Consume the matched text (if any) and indicate success if there was at least one match. Otherwise indicate failure.
E* Zero or more: Apply expression E repeatedly to match the text ahead, as long as it succeeds. Consume the matched text (if any). Always indicate success.
E? Zero or one: If expression E matches the text ahead, consume it. Always indicate success.
[s] Character class: If the character ahead appears in the string s, consume it and indicate success. Otherwise indicate failure.
[a-b] Character range: If the character ahead is one from the range a through b, consume it and indicate success. Otherwise indicate failure.
's' String: If the text ahead is the string s, consume it and indicate success. Otherwise indicate failure.
i's' String match ignoring case.
y's' String match ignoring style.
v's' Verbatim string match: Use this to override a global \i or \y modifier.
i$j String match ignoring case for back reference.
y$j String match ignoring style for back reference.
v$j Verbatim string match for back reference.
. Any character: If there is a character ahead, consume it and indicate success. Otherwise (that is, at the end of input) indicate failure.
_ Any Unicode character: If there is an UTF-8 character ahead, consume it and indicate success. Otherwise indicate failure.
@E Search: Shorthand for (!E .)* E. (Search loop for the pattern E.)
{@} E Captured Search: Shorthand for {(!E .)*} E. (Search loop for the pattern E.) Everything until and exluding E is captured.
@@ E Same as {@} E.
A <- E Rule: Bind the expression E to the nonterminal symbol A. Left recursive rules are not possible and crash the matching engine.
\identifier Built-in macro for a longer expression.
\ddd Character with decimal code ddd.
\", etc Literal ", etc.

Built-in macros

macro meaning
\d any decimal digit: [0-9]
\D any character that is not a decimal digit: [^0-9]
\s any whitespace character: [ \9-\13]
\S any character that is not a whitespace character: [^ \9-\13]
\w any "word" character: [a-zA-Z0-9_]
\W any "non-word" character: [^a-zA-Z0-9_]
\a same as [a-zA-Z]
\A same as [^a-zA-Z]
\n any newline combination: \10 / \13\10 / \13
\i ignore case for matching; use this at the start of the PEG
\y ignore style for matching; use this at the start of the PEG
\skip pat skip pattern pat before trying to match other tokens; this is useful for whitespace skipping, for example: \skip(\s*) {\ident} ':' {\ident} matches key value pairs ignoring whitespace around the ':'.
\ident a standard ASCII identifier: [a-zA-Z_][a-zA-Z_0-9]*
\letter any Unicode letter
\upper any Unicode uppercase letter
\lower any Unicode lowercase letter
\title any Unicode title letter
\white any Unicode whitespace character

A backslash followed by a letter is a built-in macro, otherwise it is used for ordinary escaping:

notation meaning
\\ a single backslash
\* same as '*'
\t not a tabulator, but an (unknown) built-in

Supported PEG grammar

The PEG parser implements this grammar (written in PEG syntax):

# Example grammar of PEG in PEG syntax.
# Comments start with '#'.
# First symbol is the start symbol.

grammar <- rule* / expr

identifier <- [A-Za-z][A-Za-z0-9_]*
charsetchar <- "\\" . / [^\]]
charset <- "[" "^"? (charsetchar ("-" charsetchar)?)+ "]"
stringlit <- identifier? ("\"" ("\\" . / [^"])* "\"" /
                          "'" ("\\" . / [^'])* "'")
builtin <- "\\" identifier / [^\13\10]

comment <- '#' @ \n
ig <- (\s / comment)* # things to ignore

rule <- identifier \s* "<-" expr ig
identNoArrow <- identifier !(\s* "<-")
prefixOpr <- ig '&' / ig '!' / ig '@' / ig '{@}' / ig '@@'
literal <- ig identifier? '$' [0-9]+ / '$' / '^' /
           ig identNoArrow /
           ig charset /
           ig stringlit /
           ig builtin /
           ig '.' /
           ig '_' /
           (ig "(" expr ig ")")
postfixOpr <- ig '?' / ig '*' / ig '+'
primary <- prefixOpr* (literal postfixOpr*)

# Concatenation has higher priority than choice:
# ``a b / c`` means ``(a b) / c``

seqExpr <- primary+
expr <- seqExpr (ig "/" expr)*

Note: As a special syntactic extension if the whole PEG is only a single expression, identifiers are not interpreted as non-terminals, but are interpreted as verbatim string:

abc =~ peg"abc" # is true

So it is not necessary to write peg" 'abc' " in the above example.

Examples

Check if s matches Nim's "while" keyword:

s =~ peg" y'while'"

Exchange (key, val)-pairs:

"key: val; key2: val2".replacef(peg"{\ident} \s* ':' \s* {\ident}", "$2: $1")

Determine the #include'ed files of a C file:

for line in lines("myfile.c"):
  if line =~ peg"""s <- ws '#include' ws '"' {[^"]+} '"' ws
                   comment <- '/*' @ '*/' / '//' .*
                   ws <- (comment / \s+)* """:
    echo matches[0]

PEG vs regular expression

As a regular expression \[.*\] matches the longest possible text between '[' and ']'. As a PEG it never matches anything, because a PEG is deterministic: .* consumes the rest of the input, so \] never matches. As a PEG this needs to be written as: \[ ( !\] . )* \] (or \[ @ \]).

Note that the regular expression does not behave as intended either: in the example * should not be greedy, so \[.*?\] should be used instead.

PEG construction

There are two ways to construct a PEG in Nim code:

  1. Parsing a string into an AST which consists of Peg nodes with the peg proc.
  2. Constructing the AST directly with proc calls. This method does not support constructing rules, only simple expressions and is not as convenient. Its only advantage is that it does not pull in the whole PEG parser into your executable.

Imports

strutils, unicode

Types

NonTerminal = ref NonTerminalObj
Peg = Node
type that represents a PEG
Captures = object
  matches: array[0 .. 20 - 1, tuple[first, last: int]]
  ml: int
  origStart: int
contains the captured substrings.
EInvalidPeg = object of ValueError
raised if an invalid PEG has been detected

Consts

MaxSubpatterns = 20
defines the maximum number of subpatterns that can be captured. More subpatterns cannot be captured!

Procs

proc term(t: string): Peg {.nosideEffect, gcsafe, extern: "npegs$1Str", raises: [],
                        tags: [].}
constructs a PEG from a terminal string
proc termIgnoreCase(t: string): Peg {.nosideEffect, gcsafe, extern: "npegs$1",
                                  raises: [], tags: [].}
constructs a PEG from a terminal string; ignore case for matching
proc termIgnoreStyle(t: string): Peg {.nosideEffect, gcsafe, extern: "npegs$1",
                                   raises: [], tags: [].}
constructs a PEG from a terminal string; ignore style for matching
proc term(t: char): Peg {.nosideEffect, gcsafe, extern: "npegs$1Char", raises: [],
                      tags: [].}
constructs a PEG from a terminal char
proc charSet(s: set[char]): Peg {.nosideEffect, gcsafe, extern: "npegs$1", raises: [],
                              tags: [].}
constructs a PEG from a character set s
proc `/`(a: varargs[Peg]): Peg {.nosideEffect, gcsafe, extern: "npegsOrderedChoice",
                             raises: [], tags: [].}
constructs an ordered choice with the PEGs in a
proc sequence(a: varargs[Peg]): Peg {.nosideEffect, gcsafe, extern: "npegs$1",
                                  raises: [], tags: [].}
constructs a sequence with all the PEGs from a
proc `?`(a: Peg): Peg {.nosideEffect, gcsafe, extern: "npegsOptional", raises: [],
                    tags: [].}
constructs an optional for the PEG a
proc `*`(a: Peg): Peg {.nosideEffect, gcsafe, extern: "npegsGreedyRep", raises: [],
                    tags: [].}
constructs a "greedy repetition" for the PEG a
proc `!*`(a: Peg): Peg {.nosideEffect, gcsafe, extern: "npegsSearch", raises: [], tags: [].}
constructs a "search" for the PEG a
proc `!*\`(a: Peg): Peg {.noSideEffect, gcsafe, extern: "npgegsCapturedSearch",
                      raises: [], tags: [].}
constructs a "captured search" for the PEG a
proc `+`(a: Peg): Peg {.nosideEffect, gcsafe, extern: "npegsGreedyPosRep", raises: [],
                    tags: [].}
constructs a "greedy positive repetition" with the PEG a
proc `&`(a: Peg): Peg {.nosideEffect, gcsafe, extern: "npegsAndPredicate", raises: [],
                    tags: [].}
constructs an "and predicate" with the PEG a
proc `!`(a: Peg): Peg {.nosideEffect, gcsafe, extern: "npegsNotPredicate", raises: [],
                    tags: [].}
constructs a "not predicate" with the PEG a
proc any(): Peg {.inline, raises: [], tags: [].}
constructs the PEG any character (.)
proc anyRune(): Peg {.inline, raises: [], tags: [].}
constructs the PEG any rune (_)
proc newLine(): Peg {.inline, raises: [], tags: [].}
constructs the PEG newline (\n)
proc unicodeLetter(): Peg {.inline, raises: [], tags: [].}
constructs the PEG \letter which matches any Unicode letter.
proc unicodeLower(): Peg {.inline, raises: [], tags: [].}
constructs the PEG \lower which matches any Unicode lowercase letter.
proc unicodeUpper(): Peg {.inline, raises: [], tags: [].}
constructs the PEG \upper which matches any Unicode uppercase letter.
proc unicodeTitle(): Peg {.inline, raises: [], tags: [].}
constructs the PEG \title which matches any Unicode title letter.
proc unicodeWhitespace(): Peg {.inline, raises: [], tags: [].}
constructs the PEG \white which matches any Unicode whitespace character.
proc startAnchor(): Peg {.inline, raises: [], tags: [].}
constructs the PEG ^ which matches the start of the input.
proc endAnchor(): Peg {.inline, raises: [], tags: [].}
constructs the PEG $ which matches the end of the input.
proc capture(a: Peg): Peg {.nosideEffect, gcsafe, extern: "npegsCapture", raises: [],
                        tags: [].}
constructs a capture with the PEG a
proc backref(index: range[1 .. MaxSubpatterns]): Peg {.nosideEffect, gcsafe,
    extern: "npegs$1", raises: [], tags: [].}
constructs a back reference of the given index. index starts counting from 1.
proc backrefIgnoreCase(index: range[1 .. MaxSubpatterns]): Peg {.nosideEffect,
    gcsafe, extern: "npegs$1", raises: [], tags: [].}
constructs a back reference of the given index. index starts counting from 1. Ignores case for matching.
proc backrefIgnoreStyle(index: range[1 .. MaxSubpatterns]): Peg {.nosideEffect,
    gcsafe, extern: "npegs$1", raises: [], tags: [].}
constructs a back reference of the given index. index starts counting from 1. Ignores style for matching.
proc nonterminal(n: NonTerminal): Peg {.nosideEffect, gcsafe, extern: "npegs$1",
                                    raises: [], tags: [].}
constructs a PEG that consists of the nonterminal symbol
proc newNonTerminal(name: string; line, column: int): NonTerminal {.nosideEffect,
    gcsafe, extern: "npegs$1", raises: [], tags: [].}
constructs a nonterminal symbol
proc `$`(r: Peg): string {.nosideEffect, gcsafe, extern: "npegsToString", raises: [],
                       tags: [].}
converts a PEG to its string representation
proc bounds(c: Captures; i: range[0 .. 20 - 1]): tuple[first, last: int] {.raises: [],
    tags: [].}
returns the bounds [first..last] of the i'th capture.
proc rawMatch(s: string; p: Peg; start: int; c: var Captures): int {.nosideEffect,
    gcsafe, extern: "npegs$1", raises: [], tags: [].}
low-level matching proc that implements the PEG interpreter. Use this for maximum efficiency (every other PEG operation ends up calling this proc). Returns -1 if it does not match, else the length of the match
proc matchLen(s: string; pattern: Peg; matches: var openArray[string]; start = 0): int {.
    nosideEffect, gcsafe, extern: "npegs$1Capture", raises: [], tags: [].}
the same as match, but it returns the length of the match, if there is no match, -1 is returned. Note that a match length of zero can happen. It's possible that a suffix of s remains that does not belong to the match.
proc matchLen(s: string; pattern: Peg; start = 0): int {.nosideEffect, gcsafe,
    extern: "npegs$1", raises: [], tags: [].}
the same as match, but it returns the length of the match, if there is no match, -1 is returned. Note that a match length of zero can happen. It's possible that a suffix of s remains that does not belong to the match.
proc match(s: string; pattern: Peg; matches: var openArray[string]; start = 0): bool {.
    nosideEffect, gcsafe, extern: "npegs$1Capture", raises: [], tags: [].}
returns true if s[start..] matches the pattern and the captured substrings in the array matches. If it does not match, nothing is written into matches and false is returned.
proc match(s: string; pattern: Peg; start = 0): bool {.nosideEffect, gcsafe,
    extern: "npegs$1", raises: [], tags: [].}
returns true if s matches the pattern beginning from start.
proc find(s: string; pattern: Peg; matches: var openArray[string]; start = 0): int {.
    nosideEffect, gcsafe, extern: "npegs$1Capture", raises: [], tags: [].}
returns the starting position of pattern in s and the captured substrings in the array matches. If it does not match, nothing is written into matches and -1 is returned.
proc findBounds(s: string; pattern: Peg; matches: var openArray[string]; start = 0): tuple[
    first, last: int] {.nosideEffect, gcsafe, extern: "npegs$1Capture", raises: [],
                     tags: [].}
returns the starting position and end position of pattern in s and the captured substrings in the array matches. If it does not match, nothing is written into matches and (-1,0) is returned.
proc find(s: string; pattern: Peg; start = 0): int {.nosideEffect, gcsafe,
    extern: "npegs$1", raises: [], tags: [].}
returns the starting position of pattern in s. If it does not match, -1 is returned.
proc findAll(s: string; pattern: Peg; start = 0): seq[string] {.nosideEffect, gcsafe,
    extern: "npegs$1", raises: [], tags: [].}
returns all matching substrings of s that match pattern. If it does not match, @[] is returned.
proc contains(s: string; pattern: Peg; start = 0): bool {.nosideEffect, gcsafe,
    extern: "npegs$1", raises: [], tags: [].}
same as find(s, pattern, start) >= 0
proc contains(s: string; pattern: Peg; matches: var openArray[string]; start = 0): bool {.
    nosideEffect, gcsafe, extern: "npegs$1Capture", raises: [], tags: [].}
same as find(s, pattern, matches, start) >= 0
proc startsWith(s: string; prefix: Peg; start = 0): bool {.nosideEffect, gcsafe,
    extern: "npegs$1", raises: [], tags: [].}
returns true if s starts with the pattern prefix
proc endsWith(s: string; suffix: Peg; start = 0): bool {.nosideEffect, gcsafe,
    extern: "npegs$1", raises: [], tags: [].}
returns true if s ends with the pattern suffix
proc replacef(s: string; sub: Peg; by: string): string {.nosideEffect, gcsafe,
    extern: "npegs$1", raises: [ValueError], tags: [].}
Replaces sub in s by the string by. Captures can be accessed in by with the notation $i and $# (see strutils.`%`). Examples:
"var1=key; var2=key2".replacef(peg"{\ident}'='{\ident}", "$1<-$2$2")

Results in:

"var1<-keykey; val2<-key2key2"
proc replace(s: string; sub: Peg; by = ""): string {.nosideEffect, gcsafe,
    extern: "npegs$1", raises: [], tags: [].}
Replaces sub in s by the string by. Captures cannot be accessed in by.
proc parallelReplace(s: string; subs: varargs[tuple[pattern: Peg, repl: string]]): string {.
    nosideEffect, gcsafe, extern: "npegs$1", raises: [ValueError], tags: [].}
Returns a modified copy of s with the substitutions in subs applied in parallel.
proc replace(s: string; sub: Peg;
            cb: proc (match: int; cnt: int; caps: openArray[string]): string): string {.
    gcsafe, extern: "npegs$1cb", raises: [], tags: [].}
Replaces sub in s by the resulting strings from the callback. The callback proc receives the index of the current match (starting with 0), the count of captures and an open array with the captures of each match. Examples:
proc handleMatches*(m: int, n: int, c: openArray[string]): string =
  result = ""
  if m > 0:
    result.add ", "
  result.add case n:
    of 2: c[0].toLower & ": '" & c[1] & "'"
    of 1: c[0].toLower & ": ''"
    else: ""

let s = "Var1=key1;var2=Key2;   VAR3"
echo s.replace(peg"{\ident}('='{\ident})* ';'* \s*", handleMatches)

Results in:

"var1: 'key1', var2: 'Key2', var3: ''"
proc transformFile(infile, outfile: string;
                  subs: varargs[tuple[pattern: Peg, repl: string]]) {.gcsafe,
    extern: "npegs$1", raises: [IOError, ValueError],
    tags: [ReadIOEffect, WriteIOEffect].}
reads in the file infile, performs a parallel replacement (calls parallelReplace) and writes back to outfile. Raises EIO if an error occurs. This is supposed to be used for quick scripting.
proc split(s: string; sep: Peg): seq[string] {.nosideEffect, gcsafe, extern: "npegs$1",
    raises: [], tags: [].}
Splits the string s into substrings.
proc parsePeg(pattern: string; filename = "pattern"; line = 1; col = 0): Peg {.
    raises: [ValueError, EInvalidPeg, Exception], tags: [RootEffect].}
constructs a Peg object from pattern. filename, line, col are used for error messages, but they only provide start offsets. parsePeg keeps track of line and column numbers within pattern.
proc peg(pattern: string): Peg {.raises: [ValueError, EInvalidPeg, Exception],
                             tags: [RootEffect].}
constructs a Peg object from the pattern. The short name has been chosen to encourage its use as a raw string modifier:
peg"{\ident} \s* '=' \s* {.*}"
proc escapePeg(s: string): string {.raises: [], tags: [].}
escapes s so that it is matched verbatim when used as a peg.
proc handleMatches(m: int; n: int; c: openArray[string]): string {.raises: [], tags: [].}

Iterators

iterator findAll(s: string; pattern: Peg; start = 0): string {.raises: [], tags: [].}
yields all matching substrings of s that match pattern.
iterator split(s: string; sep: Peg): string {.raises: [], tags: [].}

Splits the string s into substrings.

Substrings are separated by the PEG sep. Examples:

for word in split("00232this02939is39an22example111", peg"\d+"):
  writeLine(stdout, word)

Results in:

"this"
"is"
"an"
"example"

Templates

template letters(): Peg
expands to charset({'A'..'Z', 'a'..'z'})
template digits(): Peg
expands to charset({'0'..'9'})
template whitespace(): Peg
expands to charset({' ', '\9'..'\13'})
template identChars(): Peg
expands to charset({'a'..'z', 'A'..'Z', '0'..'9', '_'})
template identStartChars(): Peg
expands to charset({'A'..'Z', 'a'..'z', '_'})
template ident(): Peg
same as [a-zA-Z_][a-zA-z_0-9]*; standard identifier
template natural(): Peg
same as \d+
template `=~`(s: string; pattern: Peg): bool
This calls match with an implicit declared matches array that can be used in the scope of the =~ call:
if line =~ peg"\s* {\w+} \s* '=' \s* {\w+}":
  # matches a key=value pair:
  echo("Key: ", matches[0])
  echo("Value: ", matches[1])
elif line =~ peg"\s*{'#'.*}":
  # matches a comment
  # note that the implicit ``matches`` array is different from the
  # ``matches`` array of the first branch
  echo("comment: ", matches[0])
else:
  echo("syntax error")

© 2006–2017 Andreas Rumpf
Licensed under the MIT License.
https://nim-lang.org/docs/pegs.html