Parsing, Regular Expressions, DSLs

— Christopher Genovese and Alex Reinhart

Introduction #

Parsing is the process of checking that an input (typically text) conforms to a specified language and if so, deriving a representation of the syntactic form of the input within the language.

In practice: we get a string and get either a

  1. parse tree describing the syntax of the string in the language, or
  2. an error describing how the string failed to conform to the language.

Examples:

  • Programming language
  • Configuration files
  • Data files
  • Natural language*

Some questions to answer:

  1. How do we specify a language?
  2. What kind of languages can we parse?
  3. How do we build a parser from a language specification?
  4. How do we detect errors?
  5. What is a parse tree?

An abstract view of languages #

A language is a set of strings (including the empty string) with characters taken from a set \(\Sigma\), called the alphabet. We assume the alphabet is ordered.

Let \(\Sigma^*\) denote the set of all finite sequences of zero or more consecutive symbols from the alphabet \(\Sigma\). This is called the free monoid on \(\Sigma\). (What’s the monoid structure?) We use \(\varepsilon\) to denote the empty string.

Because the alphabet is ordered, we can order \(\Sigma^*\) using lexicographic ordering inherited from \(\Sigma\).

A language over \(\Sigma\) is a subset of \(\Sigma^*\).

Cantor’s diagonal argument shows that the set of languages on a nontrivial alphabet is uncountable. We can represent \(L\) as a bit string where the ith bit is 1 if the ith “word” in the maximal language \(\Sigma^*\) is also in \(L\), 0 otherwise.

Grammars #

One way to define a language is to give a finite recipe, where we start with a small object and build a string in the language by successively transforming that object via specified rules.

A grammar is such a specification for a language.

Example: we want to build strings of the form “Tom, Dick and Harry” for a set of names

  • Tom is a name, Dick is a name, Harry is a name, Sue is a name, Flo is a name, Daria is a name
  • A name is a sentence
  • A sentence followed by ‘, name’ is a sentence
  • At the end of input, replace the ending ‘, name’ with ‘and name’

Here is a grammar that specifies that language:

Name <- Tom | Dick | Harry | Sue | Flo | Daria
Sentence <- Name
          | List and Name
List <- Name
      | Name , List

Here, each section gives a recipe for constructing an entity. On the left hand sides are non-terminal symbols like Name, Sentence, List. These are symbols that have expansions in terms of other symbols.

Each non-terminal symbols’s recipe is given by a rule of the form Symbol <- <recipe1> | <recipe2> | ... | <recipen>. The <- shows the association and the | is read “or” and denotes alternatives.

Symbols like and, Tom, Sue, and , are literals that have no expansion. These are called terminal symbols. These terminal symbols need not be explicit strings in practice, but often represent tokens that categorize a piece of input (and hold data to distinguish the specific string). See below.

A formal grammar has four components:

  1. A set of terminal symbols
  2. A set of non-terminal symbols
  3. A collection of rules
  4. A single rule designated as the start or top-level rule.

Usually the start rule is specified by being first.

Chomsky Hierarchy #

  • Type 0: Recursively Enumerable

    Unrestricted rules

  • Type 1: Context Sensitive (or Monotonic)

    alpha A beta <- alpha gamma beta

  • Type 2: Context Free

    A <- gamma

  • Type 3: Regular

    A <- a and A <- a B (multiple terminals allowed here)

  • Type 4: Finite Choice

    A <- a (multiple terminals allowed here and choices thereof)

Type 0 are unsolvable even for recognition in finite time. Type 1 can parse in exponential time.

Types 2 and 3 are the most practical and are thus commonly used. Most programming languages, for instance, have context-free grammars.

Lexing and Parsing #

In practice, parsing is usually divided into two stages:

  1. Lexical analysis :: converts strings into tokens, which are data tagged with one of a finite number of categories.

  2. Parsing :: convert a sequence of tokens into a parse tree.

The procedure doing the first is called a lexer; and the procedure doing the second is called a parser.

The lexer is a simpler analysis that recognizes patterns in the input that match a certain kind of input structure, e.g., numbers, identifiers, punctuation. The patterns are usually expressed as regular expressions.

(New methods are exploring merging these two stages.)

Exercise #

Write a grammar for simple arithmetic expressions with addition, multiplication, variables, and integers.

Assume you have terminal symbols IDENT and NUMBER representing identifier and numeric tokens, respectively.

Do any surprising issues arise in parsing with your grammar?

Regular Expressions #

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

</?p>|
<brs?/?>| (?# allow space at end)
</?b>|
</?strong>|
</?i>|
</?em>|
</?s>|
</?strike>|
</?blockquote>|
</?sub>|
</?super>|
</?h(1|2|3)>| (?# h1,h2,h3)
</?pre>|
<hrs?/?>|  (?# allow space at end)
</?code>|
</?ul>|
</?ol>|
</?li>|
</a>|
<a[^>]+>|  (?# allow attribs)
<img[^>]+/?> (?# allow attribs)

A regular expression is a pattern that specifies a set of strings (in a given alphabet) in a Type 3 (or 4) language. Put another way, a Type 3 language is one that can be described by a regular expression. It is also the languages that can be described by a particular type of deterministic automaton.

A regular expression describes a set of operations on patterns: concatenation, repetition, alternation, and more.

There are different flavors of regular expressions that support different operations and have different performance characteristics. In most programming languages, a variant of Perl Compatible Regular Expressions (PCRE) is very common.

A general description of regular expressions is given by the following grammar:

regex <- 0                # empty set
       | e                # varepsilon, empty string
       | [A]              # any symbol in A subset Sigma
       | (regex)          # grouping
       | regex regex      # concatenation
       | regex*           # zero or more repetitions
       | regex '|' regex  # logical or: alternation
       | regex '&' regex  # logical and: intersection
       | !regex           # logical complement

Many implementations drop the last two operators and add others, e.g., regexp+ for 1 or more repetitions, regex? for 0 or 1 repetitions, various kinds of assertions, various kinds of groups, and escapes.

Different implementations have different criteria for matches, backtracking, ….

See regex101.com for a practice playground.

Common constructs #

  • Metacharacters

    Some characters have special meaning *?+.|^$\[](){}

    • . matches any non-newline character
    • ^ a zero-length assertion that matches at beginning of line
    • $ a zero-length assertion that matches at end of line
  • Escaping

    You escape a metacharacter with \, e.g., \* or \. or even \\.

    Some special sequences begin with backslash

    \w, \s, \A, \Z match word characters, whitespace, and the beginning and end of the string (zero-length) respectively.

  • Alternation

    regex1 | regex2 matches either regex1 or regex2 but tries regex1 first.

  • Repetition

    Greedy repetition matches as much as possible

    • regex?: match regex 0 or 1 times
    • regex*: match regex 0 or more times
    • regex+: match regex 1 or more times
    • regex{x,y}: match regex from x to y times

    Lazy repetition matches as little as possible

    • regex??: match regex 0 or 1 times (lazy)
    • regex*?: match regex 0 or more times (lazy)
    • regex+?: match regex 1 or more times (lazy)
  • Grouping

    (regex) and (?:regex) define groups

    The first kind captures the matching string for later use; the second kind does not.

    Some implementations offer named groups:

    (?<name>:regex) captures the matching string and associates it with name

  • Character Classes

    [abc] [a-zA-Z0-9_] [-1-9] [^abc]

    metacharacters are escaped except: ^ at beginning complements the set, - must be at beginning or it means a range.

    Special unicode categories, e.g., \p{Letter}

    Posix character classes [[:ascii:]], [[:digit:]], …

  • Assertions

    \A, \Z, ^, $, (?=regex), …

Exercises: #

Construct a regex for floating point numbers.

Construct a regex for a valid R/Python variable name

Can you construct a regular expression to match a balanced parenthetical expression, say with ()’s and the letter a repeated?

Limitations #

Regular language can be described by a finite-state machine

FSM resets on repeated state; can only count mod a fixed number

Parsing Context-Free Languages #

Context Free Grammars #

Derivations (Left and Right) #

SheepNoise <- baa
            | SheepNoise baa
Statement  <- Identifier '=' Expression
Expression <- Identifier
            | Number
            | Expression Op Expression
Op <- '*' | '+'

Ambiguity, Precedence #

A grammar can be ambiguous meaning that there is a parse string that produces more than one distinct valid parse tree/derivation of the input.

It is often a good idea to re-write a grammar to eliminate ambiguities if not too hard.

Rewrite:

Statement  <- Identifier '=' Expression
Expression <- Term
            | Expression '+' Term
Term <- Primary
      | Term * Primary
Primary <- Number
         | Identifier

It is possible with most available parsing algorithms to specify precedence rules for e.g., operators that help disambiguate. Example: arithmetic operators.

Modern methods of generalized parsing can handle ambiguity when present, returning all the valid parses it can find. This is a nice feature, but eliminating ambiguity if convenient still has advantages.

Parsing methods (ex: Recursive Descent, combinators, parser generators, Earley parsers) #

Various algorithms can handle different classes of languages, with various restrictions on the grammar corresponding to guarantees on parsing performance.

Issues:

  • Ambiguity allowed?
  • Left or right recursive rules
  • Possible exponential run time?
  • Degree of backtracking
  • Type safety
  • Just a parse tree or ability to execute arbitrary code
  • Running as a separate program generating code (parser generator) or as part of a program

Basic tool: bison (derived from yacc)

Parsing libraries in many languages offer a wide variety of algorithms. Many have built-in tools for lexical analysis as well.

Exercise: A Recursive Descent Parser #

In a recursive descent parser, we encode our grammar with a procedure for each non-terminal. Each procedure is responsible for parsing its rule and returning a parse tree or related structure.

Sketch a recursive descent parser for your expression grammar. Your parser takes in a stream of tokens and returns a parse tree for your language.

Activity: Parser Combinators #

A combinator is a higher-order function that uses only function application and the results from other defined combinators to obtain a result from its arguments.

Combinators are functions that do not reference any outside state; they are typically used as simple building blocks that are combined to implement more complicated constructs.

Let’s see an example of a parser combinator.

We will consider parsing content in a given string for simplicity. First, we need a few types to describe the entities and a little infrastructure

alias ParseError = String
alias Position = Integer

type ParserState = record ParserState where
                     source : String
                     start  : Position
                     point  : Position

type ParseResult a = Success a ParserState
                   | Failure ParseError Position

type Parser a = Parser (ParserState -> ParseResult a)

-- More generally: type Parser m a = Parser (ParserState -> m (ParseResult a))
-- More generally: type Parser s m a = Parser (s -> m (a, s))

prepare : String -> Position -> ParseState
prepare input pos = ParserState {source=input, start=pos, point=pos}

-- A view of the current input at point
view : ParserState -> String

-- Advance point in state (immutably)
advance : ParserState -> Integer -> ParserState

Now a few simple combinators:


def char(c: str) -> Parser[str]:
    def this_char(state: ParserState) -> ParseResult[str]:
        input = state.view    # view(state) defined as a property
        if len(input) > 0 and c == input.substring(0, 1):
            return Success(c, state.advance(1))
        return Failure(f"Expected character {c}", state.point)
    return this_char

def lexeme(s: str) -> Parser[str]:
    n = len(s)
    def this_str(state: ParserState) -> ParseResult[str]:
        input = state.view
        if len(input) >= n and s == input.substring(0, n):
            return Success(s, state.advance(n))
        return Failure(f"Expected string {s}", state.point)
    return this_str

def whitespace(state: ParseState) -> ParseResult[str]:
    "This has type Parser str"
    m = re.match(r'^\s*', state.view)
    if m:
        matched = m.group(0)
        return Success(matched, state.advance(len(matched)))
    return Failure("Expected whitespace", state.point)

def fmap(p: Parser[A], f: Callable[[A], B]) -> Parser[B]:
    "Transforms the parse result by a function."
    def mapped(state: ParserState) -> ParseResult[B]:
        parse_A = p(state)
        match parse_A:
            case Success(resA, stateA):
                return Success(f(resA), stateA)
            case _:  # Failure
                return parse_A
    return mapped

def pure(x: A) -> Parser[A]:
    "Returns a parser that just returns x on the empty string."
    def pureP(state: ParserState) -> ParseResult[A]:
        return Success(x, state)
    return pureP

Exercises:

  1. In this style and accommodating the demands of your language, write a combinator that parses a given regular expression.

  2. Then use fmap to write a parser that parses an integer with an actual integer in the result.

  3. Write a combinator that corresponds to alternation |. It should only advance the state on a branch that successfully matches

  4. Write a combinator pipe with type signature

    pipe: Parser a -> (a -> Parser b) -> Parser b

    This combines two parsers so that the second can use the result of the first in parsing.

    Use this and your previous work to parse a pair of integers into a tuple of two integers.

    (It might help to define combinators follows and followedBy that run two parsers and ignore the result of the first and second respectively:

    follows : Parser a -> Parser b -> Parser b
    followedBy : Parser a -> Parser b -> Parser a
    

    You can do this with pipe and pure.)

  5. Write combinators some and many that run a given parser one or more and zero or more times.

  6. Create a parser that parses a list of integers in python syntax like [x, y, z]. Trailing commas are ok but not required.