Rosetta's Way Back to the Source

Towards Reverse Engineering of Complex Software

What is this project about?

The Rosetta project, funded by the EU in the form of an ERC grant, aims to develop techniques to enable reverse engineering of complex software sthat is available only in binary form. To the best of our knowledge we are the first to start working on a comprehensive and realistic solution for recovering the data structures in binary programs (which is essential for reverse engineering), as well as techniques to recover the code. The main success criterion for the project will be our ability to reverse engineer a realistic, complex binary. Additionally, we will show the immediate usefulness of the information that we extract from the binary code (that is, even before full reverse engineering), by automatically hardening the software to make it resilient against memory corruption bugs (and attacks that exploit them).

In the Rosetta project, we target common processors like the x86, and languages like C and C++ that are difficult to reverse engineer, and we aim for full reverse engineering rather than just decompilation (which typically leaves out data structures and semantics). However, we do not necessarily aim for fully automated reverse engineering (which may well be impossible in the general case). Rather, we aim for techniques that make the process straightforward. In short, we will push reverse engineering towards ever more complex programs.

Our methodology revolves around recovering data structures, code and semantic information iteratively. Specifically, we will recover data structures not so much by statically looking at the instructions in the binary program (as others have done), but mainly by observing how the data is used

Research question. The project addresses the question whether the compilation process that translates source code to binary code is irreversible for complex software. Irreversibility of compilation is an assumed property that underlies most of the commercial software today. Specifically, the project aims to demonstrate that the assumption is false.

There has been quite some interest in the Rosetta project. For instance, an article about the project on Tweakers (Dutch) led to hundreds of reactions. Some informed, some not. Of course, it is fine to have opinions, but it is better to have informed opinions. On this page, I will try to explain the project in a bit more detail. For those who cannot wait: I answer some of the comments in the Tweakers' posts here.

Why do we want it?

Most software is available only in binary form, without any access to source code. The reasons are two-fold:

Reverse engineering is commonly used in security settings, both by hackers to discover vulnerabilities in commercial software, and by security experts to discover the workings of malicious code. In either case, reverse engineering requires a huge, mostly manual effort. For large programs, reverse engineering is sufficiently hard to make it impractical, and software vendors treat compilation as an (in practice) unbreakable code. We will refer to this view as the irreversibility assumption.

If the source code is lost, the lack of access to the source is clearly undesirable. It is impossible for users to maintain, or even rewrite the software without access to source code. Perhaps the software has security vulnerabilities or other bugs that might be easy to fix if only we had access to the source. Rewriting the program from scratch is often not an option. The investment required for rewriting, testing and debugging the software from scratch may be gigantic and risky. What if the production line or security control system stops working because of wrinkles in the new code?

The second case, where vendors deliberately close the source, appears to be more complex. Who could argue against vendors trying to protect their intellectual investment and competitive edge? Some think that closed source is unfair or evil (e.g., because dominant vendors may give their own other software an advantage over that of competitors, by allowing it to use undocumented features that are very difficult or impossible to use for competing products). Others think that it is the only way to maintain the pace of innovation and that the absence of it will lead to software patents (typically said with a shudder).

I will not go into this at this point. Personally, I use mainly open source products, but go ahead if you want to use closed source stuff. I do, sometimes. Skype for instance (with some misgivings). However, beyond the question of whether closed source is good or evil, there are issues:

Explicitly then: the project does not aim at illegal activities, violating copyrights and plagiarising other people's code. It does aim to make it possible to analyse code that you bought and paid for, and to make it possible to fix broken software without the code. In particular, one thing I want to do with the data structures after I have excavated them, is protect them. Against common attacks, like buffer overflows. The idea is simple enough (the execution is not!): if you know there is a buffer of length 128 at location X, make sure that all accesses to the buffer stay within the buffer boundaries.

The Challenge

Whatever the motivation, the scientific question is whether it is even possible to revert an existing, complex program to readable source code. At first sight, the task is hopeless (which is why companies like Microsoft have been able to build on the irreversibility assumption). Binary code lacks most of the features that enable programmers to understand well-structured programs in a higher-level language. Let us consider some of the concepts that are lacking at the binary level:

A tiny bit of technical stuff (bit not much!)

Let us zoom in on these difficulties a bit. To do so, we cannot avoid looking at the way modern computers and compilers operate. Nevertheless, by abstracting away the lowest-level details, the explanation should be at a sufficiently high level to be understood by a general audience, and sufficiently specific to satisfy domain experts.

Missing data structures and semantics. Arguably the hardest challenge is the lack of information about data structures and their semantics. Programmers are taught to design their data structures first and write the code second. Phrased differently, the data structures form the heart of the program. Not having the data structures makes reverse engineering an arduous, manual task. Unfortunately, after compilation the data structures themselves are completely invisible. To illustrate this, suppose a function in a program has a variable that represents an employee record. The data type used for the record consists of an array of up to 128 characters (bytes) to represent the employeeĆ¢s name, a 16 bit integer number to represent the year of birth, a byte to represent the month, and another byte to represent the day. Most programming languages have language constructs to define such a record. For instance, the left hand side of the figure below shows how this could be defined in the language C.

For a human reading the data structure on the left, it will be immediately clear what each of the fields means and how big it is. After compilation, however, the data structure simply results in the reservation of 132 anonymous bytes of memory (as shown on the right). No individual fields are visible. The program simply uses the first 128 bytes in this memory area to store names, the next two bytes to store the year and so on. No need to define it.

Things get much worse. For instance, compiler optimizations may change the binary beyond recognition. This is known as WYSINWYX (What You See Is Not What You eXecute). Also, allocations on the stack are more difficult than explicit allocations on the heap.

Missing higher-level code and semantics. Besides the data structures, we also lack the actual code in a highlevel programming language, but here at least we have the instructions at the assembly level (shown as 2 instructions of pseudo-assembly in the figure on the right). Although we no longer have the names of fields and variables, and compiler optimizations may well have changed the program structure considerably, good programmers are able to recognize certain blocks of assembly code and translate them to a higher-level equivalent. Doing so is still manual labour, but unlike with data structures, we know where to start.

Some misconceptions

The project has received already a huge amount of attention, even before it started . Some things need clarification: