Rosetta's Way Back to the SourceTowards Reverse Engineering of Complex Software
The Rosetta project, funded by the EU in the form of an ERC grant, aims to develop techniques to enable reverse engineering of complex software sthat is available only in binary form. To the best of our knowledge we are the first to start working on a comprehensive and realistic solution for recovering the data structures in binary programs (which is essential for reverse engineering), as well as techniques to recover the code. The main success criterion for the project will be our ability to reverse engineer a realistic, complex binary. Additionally, we will show the immediate usefulness of the information that we extract from the binary code (that is, even before full reverse engineering), by automatically hardening the software to make it resilient against memory corruption bugs (and attacks that exploit them).
In the Rosetta project, we target common processors like the x86, and languages like C and C++ that are difficult to reverse engineer, and we aim for full reverse engineering rather than just decompilation (which typically leaves out data structures and semantics). However, we do not necessarily aim for fully automated reverse engineering (which may well be impossible in the general case). Rather, we aim for techniques that make the process straightforward. In short, we will push reverse engineering towards ever more complex programs.
Our methodology revolves around recovering data structures, code and semantic information iteratively. Specifically, we will recover data structures not so much by statically looking at the instructions in the binary program (as others have done), but mainly by observing how the data is used
Research question. The project addresses the question whether the compilation process that translates source code to binary code is irreversible for complex software. Irreversibility of compilation is an assumed property that underlies most of the commercial software today. Specifically, the project aims to demonstrate that the assumption is false.
There has been quite some interest in the Rosetta project. For instance, an article about the project on Tweakers (Dutch) led to hundreds of reactions. Some informed, some not. Of course, it is fine to have opinions, but it is better to have informed opinions. On this page, I will try to explain the project in a bit more detail. For those who cannot wait: I answer some of the comments in the Tweakers' posts here.
Most software is available only in binary form, without any access to source code. The reasons are two-fold:
Reverse engineering is commonly used in security settings, both by hackers to discover vulnerabilities in commercial software, and by security experts to discover the workings of malicious code. In either case, reverse engineering requires a huge, mostly manual effort. For large programs, reverse engineering is sufficiently hard to make it impractical, and software vendors treat compilation as an (in practice) unbreakable code. We will refer to this view as the irreversibility assumption.
If the source code is lost, the lack of access to the source is clearly undesirable. It is impossible for users to maintain, or even rewrite the software without access to source code. Perhaps the software has security vulnerabilities or other bugs that might be easy to fix if only we had access to the source. Rewriting the program from scratch is often not an option. The investment required for rewriting, testing and debugging the software from scratch may be gigantic and risky. What if the production line or security control system stops working because of wrinkles in the new code?
The second case, where vendors deliberately close the source, appears to be more complex. Who could argue against vendors trying to protect their intellectual investment and competitive edge? Some think that closed source is unfair or evil (e.g., because dominant vendors may give their own other software an advantage over that of competitors, by allowing it to use undocumented features that are very difficult or impossible to use for competing products). Others think that it is the only way to maintain the pace of innovation and that the absence of it will lead to software patents (typically said with a shudder).
I will not go into this at this point. Personally, I use mainly open source products, but go ahead if you want to use closed source stuff. I do, sometimes. Skype for instance (with some misgivings). However, beyond the question of whether closed source is good or evil, there are issues:
In a security sensitive environment, these are unacceptable properties. At the very least, it should be possible to analyse the code to determine whether the software is benign or malicious, secure or vulnerable. This is true not just for Skype, but also for less obfuscated code, such as media players, browsers, plugins, etc.
Explicitly then: the project does not aim at illegal activities, violating copyrights and plagiarising other people's code. It does aim to make it possible to analyse code that you bought and paid for, and to make it possible to fix broken software without the code. In particular, one thing I want to do with the data structures after I have excavated them, is protect them. Against common attacks, like buffer overflows. The idea is simple enough (the execution is not!): if you know there is a buffer of length 128 at location X, make sure that all accesses to the buffer stay within the buffer boundaries.
Whatever the motivation, the scientific question is whether it is even possible to revert an existing, complex program to readable source code. At first sight, the task is hopeless (which is why companies like Microsoft have been able to build on the irreversibility assumption). Binary code lacks most of the features that enable programmers to understand well-structured programs in a higher-level language. Let us consider some of the concepts that are lacking at the binary level:
Let us zoom in on these difficulties a bit. To do so, we cannot avoid looking at the way modern computers and compilers operate. Nevertheless, by abstracting away the lowest-level details, the explanation should be at a sufficiently high level to be understood by a general audience, and sufficiently specific to satisfy domain experts.
Missing data structures and semantics. Arguably the hardest challenge is the lack of information about data structures and their semantics. Programmers are taught to design their data structures first and write the code second. Phrased differently, the data structures form the heart of the program. Not having the data structures makes reverse engineering an arduous, manual task. Unfortunately, after compilation the data structures themselves are completely invisible. To illustrate this, suppose a function in a program has a variable that represents an employee record. The data type used for the record consists of an array of up to 128 characters (bytes) to represent the employeeâs name, a 16 bit integer number to represent the year of birth, a byte to represent the month, and another byte to represent the day. Most programming languages have language constructs to define such a record. For instance, the left hand side of the figure below shows how this could be defined in the language C.
For a human reading the data structure on the left, it will be immediately clear what each of the fields means and how big it is. After compilation, however, the data structure simply results in the reservation of 132 anonymous bytes of memory (as shown on the right). No individual fields are visible. The program simply uses the first 128 bytes in this memory area to store names, the next two bytes to store the year and so on. No need to define it.
Things get much worse. For instance, compiler optimizations may change the binary beyond recognition. This is known as WYSINWYX (What You See Is Not What You eXecute). Also, allocations on the stack are more difficult than explicit allocations on the heap.
Missing higher-level code and semantics. Besides the data
structures, we also lack the actual code in a highlevel programming
language, but here at least we have the instructions at the assembly
level (shown as 2 instructions of pseudo-assembly in the figure on the
right). Although we no longer have the names of fields and variables,
and compiler optimizations may well have changed the program structure
considerably, good programmers are able to recognize certain blocks of
assembly code and translate them to a higher-level equivalent. Doing
so is still manual labour, but unlike with data structures, we know
where to start.
Will all the tools be released as open source? To
this question I replied that I cannot promise this yet, but that this
is what I normally do. I stress that this is what I have always
done. Many of my projects (Argos, Streamline) are available as open
source. However, I have to be a bit cautious as I am not sure about
legal issues, licenses of software that we will use etc. Also, code
has to be stabile (and I know from experience that software, once
released, typically takes a lot of time in maintenance, bug fixes,
etc). At any rate, the results of this research will be made
public. As always we intend to publish how we do this and how well
things work. So anyone can reproduce the results.
Is the aim to destroy the software industry? What a
strange idea. I do not aim to plagiarise commercial software, kill off
any industries, or work for miscreants that work evil. I do want to
enable you to analyse and improve the software you bought or
Compiler-optimisations will make it impossible to reverse
engineer. Ah yes, they make it harder. An unrolled loop may no
longer look like a loop, and if the unrolled loop accesses an array,
it is difficult to recognise the accesses to the array as such (and
thus to recover the array). But impossible? No.
Will the RE be fully automated? No. I do not think it can
be. Some higher level semantics and naming will still be done by a
human engineer. The plan is to make RE straightforward rather than
close to impossible.
The project has received already a huge amount of attention, even
before it started . Some
things need clarification:
Will all the tools be released as open source? To this question I replied that I cannot promise this yet, but that this is what I normally do. I stress that this is what I have always done. Many of my projects (Argos, Streamline) are available as open source. However, I have to be a bit cautious as I am not sure about legal issues, licenses of software that we will use etc. Also, code has to be stabile (and I know from experience that software, once released, typically takes a lot of time in maintenance, bug fixes, etc). At any rate, the results of this research will be made public. As always we intend to publish how we do this and how well things work. So anyone can reproduce the results.
Is the aim to destroy the software industry? What a strange idea. I do not aim to plagiarise commercial software, kill off any industries, or work for miscreants that work evil. I do want to enable you to analyse and improve the software you bought or downloaded.
Compiler-optimisations will make it impossible to reverse engineer. Ah yes, they make it harder. An unrolled loop may no longer look like a loop, and if the unrolled loop accesses an array, it is difficult to recognise the accesses to the array as such (and thus to recover the array). But impossible? No.
Will the RE be fully automated? No. I do not think it can be. Some higher level semantics and naming will still be done by a human engineer. The plan is to make RE straightforward rather than close to impossible.