Writing a (toy) symbolic interpreter, and solving challenges, part 1
This serie of articles will describe, throughout the summer, how such an engine is built, and showcase implementation tricks and some trade offs to be aware off. Do not worry, the interpreter will be kept as simple as possible though!
In the end, we will have a toy interpreter that can solve (at least one) real world reverse engineering challenge(s), but, more importantly, by the time you finished reading these articles, you should be sufficiently familiar with the concepts to be able to use a fully fledged symbolic execution engine for more serious endeavors.
This is a series of articles. Fast forward to part 2, part 3 or part 4 if you want!
What is symbolic execution?
Let us look at the following C function:
int compute(int a, int b) {
int x = a * 3;
int y = b - 2;
return x * y;
}
If you know the value of a and b, you can write an interpreter that runs that program, and computes the result of this function in a straightforward way. For example, running compute(3, 4) returns 18.
Symbolic execution is also about running the same program, but this time keeping some or all data as unknown (symbolic) variables. For example, running compute(x, 4) should return an expression equivalent to x * 6.
In the presence of conditional statements, several possible execution flows for a given function are possible. For example, with the following function:
int compute_branch(int x) {
int a = x * 2;
if (x > 3)
a--;
if (x & 1)
return a;
else
return x;
}
If x is symbolic, there is no way to determine in advance if the (x > 3) and (x & 1) conditions will be true or false. In that case, we have to run both cases, while remembering if the condition was true or false.
We get four execution paths:
- When x > 3 and x & 1, the result is x * 2 - 1
- When x > 3 and !(x & 1), the result is x
- When x <= 3 and x & 1, the result is x * 2
- When x <= 3 and !(x & 1), the result is x
In the general case, the number of execution paths will grow exponentially with the number of branches. Even with reasonably complex logic, too many paths might be generated for us to process.
There are many other potential problems that one will encounter, but we will think about them in due time!
What's next?
Step 1 - writing a concrete interpreter
The first step in our series is about choosing an interesting architecture, and writing a boring, concrete interpreter for it.
For this series, we will write a WebAssembly (WASM) symbolic interpreter, as it is a simple architecture, and it has unusual branching instructions, which will make for a nice change.
Step 2 - turning the concrete interpreter into a symbolic interpreter
Turning a concrete interpreter into a (toy) symbolic interpreter should require almost no code change, thanks to the good decisions we will take during step 1 (to be honest, it took me a while to find them). This part should be straightforward code-wise, but perhaps a bit more challenging to entirely understand.
Step 3 - wiring it into an SMT solver, and solving the reverse engineering challenge
An SMT solver (satisfiability modulo theories) is a piece of software that basically tries to solve a bunch of equations. You just feed it the problem, push a button, and wait for a solution. This is very convenient as you do not really need to understand anything about how this is done, and it is usually reasonably efficient. Some, like z3, can also do optimization, and they are surprisingly useful in a variety of situations.
A reverse engineering challenge is a compiled program that accepts user input, and that usually returns "you lost" or "you won" based on the input. With symbolic execution, it is possible to turn this program into a bunch of equations, and ask the SMT solver to find an input that will lead the program execution to the "you won" reply.
Things that will not be covered
Intermediate language (IL)
Intermediate languages are very simple, minimalist, languages that are used to model the effect of the actual language we want to analyze. All processing is then performed on that language.
In theory, if the intermediate language is sufficiently well designed, it is possible to use it to describe many different source languages. In our case, it means, for example, that the same intermediate language can describe several architectures (x86, x64, ARM, Java bytecode, WASM, etc.).
This seems awesome as we can then write our analysis engine once, and run it on several architectures. However, if the IL design is lacking in expressivity, describing several architecture might be unpleasant, inefficient, or flat-out impossible.
So, basically, porting a new architecture to an IL is very much like writing a concrete interpreter for it, except we will write it in the IL instead of the host programming language. This means that is it usually easier to write a concrete interpreter directly, as we can fully use the host language instead of writing some assembly-like instructions in the IL.
As we will see, it is entirely possible to transform a concrete interpreter into a symbolic interpreter at almost no cost, as long as we expose a good interface.
Advanced heuristics
Everything will be kept as simple as possible, because:
- I want the series to be concluded before the end of summer ;
- I am not very knowledgeable about the topic :)
This means that important topics for real life symbolic execution will not be covered. One such topic, symbolic memory access, is one of the thorniest problems with program analysis in general. There are approximation heuristics that can drastically reduce the complexity of such analysis. But those are out of scope of this series.
Keep your eyes peeled for the next article, about the WASM architecture and writing a concrete interpreter!