Battle of the parsers: PEG vs combinators

Rédigé par Maxime Desbrus - 25/07/2024 - dans Développement , Outils , Système - Téléchargement

In this article we will compare two parsing strategies : PEG based and combinators based, from a developer's perspective, to parse Strace output for the SHH (Systemd Hardening Helper) Rust project.

Context

For the SHH project, we need to parse the output of the strace Linux tool, which logs system calls made by an executable.

The strace format is line based, however its structure is highly variable, depending on the syscall that is being done and its arguments. The format can contain integer literals, but also integer expressions with multiple operand operations, arrays, structs, named constants, raw buffers, C-style comments, and more. According to the strace manual page, "In most cases, arguments are formatted in the most C-like fashion possible.".

Indeed at a quick glance, the call part can look like perfectly valid C syntax for a function call, like this example:

mmap(0x7f2fce8dc000, 1396736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000)

But lines like this use a syntax that departs from valid C:

clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7f3b7c000990, parent_tid=0x7f3b7c000990, exit_signal=0, stack=0x7f3b7b800000, stack_size=0x7ff880, tls=0x7f3b7c0006c0} => {parent_tid=[664773]}, 88)

rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x7f6da716c510}, NULL, 8)

In the first line => is used to represent input/output arguments, and in the second one, ~[RTMIN RT_1] is a bit field.

The initial SHH version used a very crude regex based parser to process this data, and parse it into a structured type. While it mostly worked, it had many hardcoded (wrong) assumptions, and unlike what was initially anticipated, was not very robust facing the complexities of the many possible syscalls inputs.

When the need for writing a new parser became obvious, the first task to overcome was the choice of a parsing library, each coming with its own philosophy.

Parser ecosystem

The Rust parser ecosystem is already quite rich, and there are some popular established crates, representing diverse strategies to write a parser.

Here we will focus on comparing Pest and Nom, two very popular crates for this task, by writing a parser that has the same behavior with each approach.

We deliberately excluded other parsing libraries from this analysis, because it is simply not possible to consider them all in a time efficient manner. However the two we have chosen represent quite different philosophies, and many other parsers can be assigned to one of these two families. Additionally, many libraries that were not considered in this comparison focus on error reporting, to make it more accurate or user friendly, for example by reporting the exact position or range where the parsing fails, when the input is provided by the user. In our use case, we simply do not care about this class of features because the strace output parsing is purely done in the background, is not user facing (the user does not even need to know strace is being used under the hood), and more importantly, should never fail.

What follows is in no way an exhaustive comparison that analyzes all aspects of each of these two parsing library projects. However for our context and needs we believe this is a fair comparison, because we wrote two parsers which produce the same exact results, and pass the same automated tests. Since the same behavior has been implemented with each parsing approach, this allows us to focus on aspects like ease of development, performance, maintainability, etc.

It is also important to note that both parsers are by design robust in the face of invalid input, being written in a safe language (Rust) that eliminates undefined behavior or memory safety bugs, and handling input in a defensive way so they will return an error but not make the program crash in such case. Additionally, they are tested using unit tests that have been enriched with many special corner cases when writing the two parsers.

PEG parser: Pest

Grammar

PEG (Parsing Expression Grammar) based parsers, are unsurprisingly centered on their specific grammar.

The first task to develop such parser is to write rules about expected tokens, how they are delimited and chained. For the Pest grammar, the basic primitives are the ~ operator to chain rules, and | to try alternate rules if one fails to match.

As an example we could describe a syscall struct argument as "an opening brace followed by a struct member (defined in another rule), followed by a comma and another struct member, repeated any number of times, followed by a closing brace", which would translate to the following Pest grammar:

struct = {
    "{" ~
    (
        struct_member ~
        (", " ~ struct_member)*
    )? ~
    "}"
}

The struct rule itself is used when defining the generic expression rule:

expression = {
    (
        macro |
        int |
        struct |
        buffer |
        set |
        array
    ) ~
    comment?
}

This way of writing a parser from high level operations is mostly natural because it mimics how we reason and mentally divide the input string when reading it. However there is one caveat to keep in mind, especially coming from the regular expression world, familiar to many developers.

Let's take a simple example to parse a C comment. If we were to extract such comment with a regular expression (ignoring multiline comments and multiple comments per line for simplicity) we could simply use /\* .* \*/ as the pattern:

$ echo 'before comment /* comment */ after comment' | grep -o '/\* .* \*/'
/* comment */

A similar logic as Pest grammar would be:

comment = { "/* " ~ ANY* ~ " */" }

However this would not work as intended as Pest parsing logic is eager (or greedy) and non-backtracking. This roughly means that rules are tested in the order they are mentioned in the grammar, matching input bytes from left to right, with each rule consuming as much bytes as it can, until it continues either with the "next" rule if it succeeded, or the "alternate" rule if it failed. In this example Pest would match the /* token, then ANY* would take any character until the end of the line and simply never match the last */ token, thus never succeeding in matching the complete comment rule.

A possible improved Pest rule that gives the same behavior as the regular expression would be:

comment = { "/* " ~ (!" */" ~ ANY)* ~ " */" }

This behavior can be confusing at first, however compared to regular expressions this simplifies reasoning about the parser sequencing, and makes writing optimal rules easier.

Writing the Pest grammar is usually done separately from any Rust code, however the grammar is then processed when compiling the Rust program, and is used to generate a Rules Rust enum type. The compile time grammar processing includes of course checking that the grammar has a valid syntax, but also detecting some cases like rules that could trigger an infinite loop by matching themselves recursively in nested rules, which is nice to be warned early on.

Pest also provides a Language Server Protocol (LSP) implementation which allows checking the grammar at edit time, auto completing symbols, etc. all of this while remaining editor-agnostic. Overall this makes for a pleasant developer experience when writing Pest grammar.

Processing pairs

Once the grammar has been written, we need to write the Rust code to make use of it. With Pest, the Rust code generated from our grammar is used to parse our input strace lines into a tree of Pair objects. Each pair contains a rule that matched a part of the input text, and all the nested rules that matched inside it.

To convert the tree produced by Pest into our target Syscall struct type, we match on rules from the root, and recurse into inner rules with a lot of code looking like this (simplified example):

impl TryFrom<Pair<'_, Rule>> for IntegerExpression {
    // ...

    fn try_from(pair: Pair<Rule>) -> Result<Self, Self::Error> {
        match pair.as_rule() {
            Rule::named_constant => {
                // parse named constant...
                let mut children = pair.into_inner();
                let name_pair = children
                    .next()
                    .unwrap();
                Ok(IntegerExpression {
                    value: IntegerExpressionValue::NamedConst(name_pair.as_str().to_owned()),
                    // ...
                })
            }
            Rule::literal_int => {
                // parse literal integer...
            }
            // ...
        }
    }
}

While this kind of code is not particularly complex to write or understand, and ultimately does the job, it showcases an important limitation of the implementation offered by Pest.

When we wrote the grammar for the named_constant rule, we defined exactly what its content is. Hence, when we then match on the Rule::named_constant enum member, we know with certainty that the inner name string is here, yet we still have to extract it manually, with error prone boilerplate code.

A simple change in the grammar for the named_constant rule could break the above code, without detecting it at compile time.

In the above example and in every other rule, we are essentially doing the job twice: first by defining the grammar rules and their content, and then writing the code to put the pieces together for each rule, parsing pairs and their strings into our target types.

Rust has a powerful type system, which allows to model program's state in an very expressive way. In an ideal world the Rule type would encapsulate all the inner matching rules in each enum member, with knowledge taken from the grammar, with a generated Rule type that looks like this:

enum Rule {
    named_constant(String),
    literal_int(...),
    // ...
}

That way, when processing a pair that matched the named_constant rule, we would get its inner string without writing any additional code.

This issue might seem minor, however for our parser this had a non negligible effect on its development time. In a realistic developer workflow, the correct grammar is rarely written on the first try. Some cases are added later, for example when we realize we did not generalize enough a certain rule, that might contain an integer expression (which may contain arithmetic, bit shifts, etc.), instead of just an integer literal. Changing the grammar this way will typically add a nesting level in the resulting tree, and break the code that processes it at multiple places, which can only be detected at runtime, a very undesirable property for Rust programs.

Some crates have attempted to alleviate this specific problem, notably pest-ast, by creating types for each rule, and matching them with the grammar using Rust macros. At first glance, this seems exactly what we need, filling the gap between the grammar, our target types, and removing the need to write error prone low value code in between. However when we tried it, this actually required us to write more code to bind rules into each type, and their attributes. This led to very verbose code, unlike for example the experience using the well known serde library to serialize or deserialize existing types. While the path taken by pest-ast does not yet seem mature (as the developers themselves acknowledge), it shows that Pest can be improved in this area, and likely will in the future.

Thoughts on Pest

After writing our first parser with Pest, the general impression we have is that it is a powerful parsing library, easy to approach and reason with.

However we also felt the step to generate Rust code at compile time from the grammar could be pushed further to also encode the rule hierarchy in the generated types, and the exact target types for leaf rules.

Combinators & Nom

Principle

Nom takes a very different approach than Pest for parsing. Its basic building blocs are two function types: parsers and parser combinators.

Parsers are typically functions that you write yourself although a few basic ones for primitive types are provided by Nom. They take an input to parse (usually as &str for text or &[u8] for binary data), and return a IResult type, a special kind of Result that in case of success contains the remaining input to parse and the resulting type.

Combinators are also fonctions that you can write yourself, although is it rarely needed given the very rich amount Nom provides out of the box. They take one or several parser functions as input, and output a new parser function. They can be used to remove or match prefixes or suffixes, run parsers several times until specific conditions are met, post process parser output, etc.

This is basically all there is to know about how Nom works. Writing a complete parser is mostly choosing combinators and combining together judiciously to obtain the desired result for each parser.

Example

Let's break down a very simple Nom parser to extract C comments similar to the example mentioned previously:

fn parse_comment(i: &str) -> IResult<&str, Option<&str>> {
    opt(delimited(tag("/* "), take_until(" */"), tag(" */")))(i)
}

This parser takes a &str as input, and when it succeeds returns the content of the comment, or None. It should be noted that this parser is 0-copy as it does not do any copy or memory allocation on the heap, it only handles references to the input string, and as such it is very efficient.

The combinators and parsers used here are:

opt: the entire following parser is optional and should return None rather than fail if it does not match
delimited: parse input (second parser argument), preceded by a prefix (first parser argument), and followed by as suffix (third parser argument)
- tag("/* "): match fixed string for comment start
- take_until(" */"): comment content, takes all text until comment end string starts
- tag(" */"): match fixed string for comment end

These chained combinators return a new parser, that is then called on the input string i.

Writing parser functions

The order in which parser functions are called is dictated by the combinators used, however the general logic is very similar to what Pest does with grammar rules. When a parser succeeds, it consumes some of the input, the call chain then continues and the next parser is called on the rest of the input. If it fails, another parser might be called with the same input if the alt combinator is used, similar to alternate rules with the | symbol and Pest. Like with Pest, care must be taken for a parser not to consume input too far, which would risk breaking the next parser.

When writing a parser function, some discipline has to be adopted regarding the handling of whitespace. Unlike with grammar based parsers where we can explicitly describe where spaces are expected to easily ignore them, with Nom we define function to parse data of interest, and it's their responsibility to consume whitespace that may be present before or after them. In our case we simply decided that each parser function has the responsibility to consume any whitespace that may be present after its input. This convention can free us from handling any space before each input, and makes the code more concise, because consistently applying that rule will guarantee spaces will never be present at the beginning of any input coming from another parser.

Comparing with Pest

Overall using Nom to write our parser resulted in much less code than with Pest, and more importantly more maintenable code.

First the fact that there is no separation between the grammar and the code that handles the resulting tree makes for code much simpler to reason about and maintain. For example adding support to parse integer with octal notation is implemented by adding a parse_int_literal_oct function, and calling it from parse_int_literal, no other part of the code needs to be changed. This separation into small functions also offers a few advantages: it makes it easy to write quick unit tests to ensure it behaves as expected, and also to trace their execution at runtime with logging.

Secondly, each parser function can return directly the entity it is supposed to parse as a high level type (for example IntegerExpression), unlike with Pest, where everything is a string, until manually parsed again in the tree. This typing made as early as possible ensures another level of verification by the compiler for example when combining parsers with each other.

The fact that using Nom requires no boilerplate code, also makes it suitable for very small parsers, in addition to more advanced ones.

The existence of an explicit grammar with Pest does however provide a few advantages that are missing with Nom. First Pest can detect and warn about possible problems early on like a recursion due for example to a rule that end up calling itself indefinitely. With Nom, this kind of problems will generate a recursion at runtime and likely end up in a stack overflow crash.

Also forcing the developer to write the grammar to describe the input in a structured way as the first step is actually useful to understand special cases. Ironically, having first written the grammar for the Pest parser helped us greatly when writing the Nom based parser, which does not need a grammar. For this reason we would recommend to anyone writing a complex parser to first write a sketch of the expected grammar, even when writing a combinators based parser.

Performance comparison

Another interesting aspect to compare the two parsers is the performance. In the context of SHH, the performance is not critical, however since both parsers produce the same output, and are equally safe regarding incorrect input, a few percent of CPU usage saved are still a nice added benefit, all other things being equal.

To compare the runtime performance, we run a complex program, Firefox, with the sole purpose of generating a large strace log with many real system calls:

$ shh run -l firefox.log firefox

This generates a 123 MB log file, with 474319 lines. For our benchmark we don't need that much, and will only consider the performance to parse the first 5000 lines. For this we wrote a test to be run with cargo bench, which requires a Rust nightly toolchain, but has the advantage of running the test many times to eliminate measurement noise, or perturbation due to host system scheduling. Each of the parser we have written is conditionally compiled and enabled with a feature flag, which makes it very easy to compare them.

Now let's run the benchmark with the Pest parser:

$ cargo +nightly bench --no-default-features --features nightly,strace-parser-peg bench_parse_line
test strace::parser::benchs::bench_parse_line ... bench: 92,093,088.30 ns/iter (+/- 8,150,205.43)

And with the Nom parser:

$ cargo +nightly bench --no-default-features --features nightly,strace-parser-combinator bench_parse_line
test strace::parser::benchs::bench_parse_line ... bench: 29,028,041.10 ns/iter (+/- 1,199,187.95)

From this we can see that the Nom based parser is more than 3 times as fast as the Pest based one. This gap is not really a surprise as Pest does not advertise itself as faster than other parsers, and even shows a graph on their homepage where Nom and Serde are both faster.

Conclusion

Through this experiment, we have tested two different strategies to write a parser for our needs. Both approaches are valid and produced code that matched our requirements. However the parser built on Nom, is both faster and easier to maintain, and for these reasons we have enabled it by default in SHH.

Battle of the parsers: PEG vs combinators

Context

Parser ecosystem

PEG parser: Pest

Grammar

Processing pairs

Thoughts on Pest

Combinators & Nom

Principle

Example

Writing parser functions

Comparing with Pest

Performance comparison

Conclusion

Autres publications

LSA Secrets: revisiting secretsdump

Abusing multicast poisoning for pre-authenticated Kerberos relay over HTTP with Responder and krbrelayx

Diving into ADB protocol internals (2/2)

Nous contacter

PARIS

TOULOUSE

LYON

RENNES

LILLE

BORDEAUX