Wednesday, March 18, 2009

The art of Decompilation

Decompilation


Decompilation is the reverse process of compilation i.e. creating high level language code from machine/assembly language code. At the basic level, it just requires to understand the machine/assembly code and rewrite it into a high level language, but things are not as simple as they seem, particularly when it comes to implementing a decompiler. Throughout this discussion, we will be using the C language for the high level language, and the 8086 assembly language for the low level language.

The ethics of decompilation

Is decompilation legal, and is it allowed?

There are many situations when decompilation can be used...

  1. To recover lost source code. You may have written a program for which you only have the executable now (or you got the exe of a program you wrote long back, from someone else!). If you want to have the source for such a program, you can use decompilation to recover it. In all rights, you are the owner of the program, so nobody is going to question you.
  2. Just as stated above, applications written long back for a legacy computer may not have the source code now, and you may need to port it to a new platform. Either you have to rewrite the application from the scratch, or use decompilation to understand the working of the application and write it again.
  3. Say you have code written in some language for which you cant find a compiler today! If you have the executable, just decompile it and rewrite the logic in the language of your choice today.
  4. To discover the internals of someone else's program (like what algorithm they have used...)

Usually all software are copyrighted by the authors. This means, copying or expressing the same idea in another program is prohibited. Hence if you are using decompilation to discover the internals of a program and if that particular part is breaching the copyright of the owner, you are liable for legal action. However, there are some permitted uses of decompilation, like the first three cases stated above. Also, decompilation of parts of software which do not come under the copyright laws (e.g. algorithms) is permitted. In any case, it is better to contact your legal advisor if you are doing any serious work with decompilation.

In all practical purposes, decompiling programs which were created by you can't be questioned! After all, you are the owner of all rights to the program. But be careful if you are trying it out on someone else's programs.

Is decompilation possible?

Let's take a look at a normal C compiler. When a C program is compiled, the first stage of the compiler will generate a very rudimentary assembly language output (or nearly equivalent to it), which is nothing but a line-by-line translation of the C source code. If any optimizations are chosen to be done, the next stages perform the code optimization to replace redundant instructions and improve the overall efficiency of the output program. This output is then linked with the standard libraries for any library function calls, and saved in the executable format of the platform.

void main()
{
int i, j, k;

i = 10;
j = 20;

k = i*j + 5;
}

 ;---- i = 10;
mov [bp+2], 10
;---- j = 20;
mov [bp+4], 20
;---- k = i*j + 5;
mov ax, [bp+2]
mov bx, [bp+4]
mul bx
add ax, 5
mov [bp+6], ax
Sample C code
Possible output of a compiler
(without optimizations)

If no optimizations are performed while the code is generated, it is very easy to understand what the output code does, and an equivalent C code can be written/generated automatically. Note that we can only generate "an equivalent C code", and not the "same C code" which was compiled to get this executable. In other words, it is always impossible to get the exact source code, but we can generate an equivalent program which will function in the same way.

However, things are a bit different if any compiler optimizations are used while building the original executable. It turns out to be more difficult to understand the flow of the program now, and the more rigorous the optimizations are, the worse are our chances to figure out what the code is doing exactly.

void main()
{
int i, j, k;

i = 10;
j = 20;

k = i*j + 5;
}

 ;---- i = 10;
mov si, 10
;---- j = 20;
mov di, 20
;---- k = i*j + 5;
mov ax, si
mov bx, di
mul bx
add ax, 5
mov [bp+6], ax

 ;---- k = i*j + 5;
mov ax, 10
mov bx, 20
mul bx
add ax, 5
mov [bp+6], ax
Sample C code
Possible output of a compiler
(using register variables)

Possible output of a compiler
(using code optimization)

One more factor which joins the opposition is that each compiler generates code in its own way. There are very few points where all compilers generate similar code. This means that we need to tailor the decompilation procedure for each compiler, so that if we know which compiler was used to generate this executable, we have a better chance of understanding the code.

For example, the simple "if" statement can be compiled in many ways, like the one given below.

  if (i>20)
{
j=30;
}
else
{
j=40;
}
j++;

    cmp [bp+2], 20
jle lab1

mov [bp+4], 30
jmp lab2
lab1:
mov [bp+4], 40
lab2:
inc [bp+4]

    cmp [bp+2], 20
jg lab1
jmp lab2
lab1:
mov [bp+4], 30
jmp lab3
lab2:
mov [bp+4], 40
lab3:
inc [bp+4]
Sample "if" statement
Possible output #1
Possible output #2

Some other factors which hinder the decompilation process are

1. Self modifying code
Code written for critical applications like games, where every machine cycle counts for the performance, sometimes self-modifying code is used. For example, if the same condition is used at many places in a critical loop, it is evaluated once and the code after that is modified to branch to the same location without evaluating the condition again. But with today's processor speeds, this technique is fast becoming antique!
2. User-defined datatypes
User defined data types, like "struct"s, "typedef"s, "union"s and bit-fields add more to the confusion of the decompiler. Though it is easy to use structures while writing the code, it is almost impossible to figure out if a variable is a part of a structure or is it a basic type on its own, by looking at the compiled output.
3. Use of processor-specific instructions/optimizations

Inspite of all these reasons, it is still possible to decompile a binary executable (though not 100% automated, and not 100% accurate!). Carry on reading...

How to decompile?

A simple approach to decompile a binary executable, is to first parse it and separate it into functions (C style). Once we know where the entry point of the program ("main" for a C program), we can start decompiling that function, and any other function it calls. This way, we can focus on one function at a time carefully.

To use this approach, the first thing that should be known is the entry-point of the program. For a normal C program, it is the "main" function, and for a Win32 program, it is the "WinMain" function. But to find out where these functions begin, we must analyze the executable and figure out which compiler was used, because each compiler has its own entry/exit code added to the program which we need not decompile. If the compiler used is known, we can trace out where the "main" function is starting, and from thereon, till we get a "return" instruction, we can separate out the function.

A simple method to decompile a code snippet

Let's say we have separated out the instructions for a function. To understand how this piece of code works, we need to emulate the processor, and interpret each and every machine instruction! (though it seems a roundabout way, there is nothing called the best method to understand the working of the code, so this is not the only way. If you think this can be done in some other way, please try it out and i'd be interested to hear about it.). And as we are interpreting the code, we need to combine logical group of instructions into simple high level language statements. And voila! We've decompiled the function!

The high level language statements can be grouped as assignment statements, condition evaluation and branches, and function calls. While interpreting the machine code for a function, we have to look for instructions which fall in one of these categories, and accordingly generate the high level code.

Understanding assignment statements and expressions

Take a look at the following code snippet.

   mov ax, [bp+4]
mov bx, 20
mul bx
add ax, 4
mov [bp+4], ax
Sample expression eval & assignment.

It is not difficult to understand what it does. To put it in steps,

  1. AX is loaded with value from memory [bp+4]
  2. BX is loaded with value 20
  3. AX and BX are multiplied, and the result is in DX:AX (of which we will ignore DX for the time being)
  4. A value of 4 is added with AX
  5. The result in AX is stored in memory location [bp+4]

And we can easily write a program which will interpret this code. Let's say we have two variables wAX, wBX. The above 5 steps can be translated by our program as

  1. wAX = [bp+4]
  2. wBX = 20
  3. wAX = wAX*wBX = [bp+4]*20
  4. wAX = wAX+4 = ([bp+4]*20) + 4
  5. [bp+4] = wAX = ([bp+4]*20) + 4

And if we substitute a variable name "i" for "[bp+4]", we have arrived at the high level language statement

i = ( i*20 ) + 4;

That was pretty easy, wasn't it? But then, when do we decide that we have got a full high-level language statement? Whenever there is a instruction which stores some value to external memory, it is a full high-level language statement by itself. Just think about it. All the following high-level language statements inevitably end with a memory store operation!

i = 0;
x = i*30 + (j>>1)+(i/2)
k = (i & 2) + z;

Understanding condition evaluation and branches

A condition evaluation almost always uses a "compare" instruction. So while interpreting the machine code, if we get a "cmp" instruction, we translate it into a "if" statement, and the branch instruction that follows this compare will lead us to the block of code which gets executed if the condition is true, and if it is false.

The following code snippet illustrates this.

   mov ax, [bp+4]
cmp ax, 10
jnz lab1

mov bx, 15
mov [bp+2], bx
jmp lab2
lab1:
mov bx,20
mov [bp+2], bx
lab2:
Sample condition eval & branch.

When we are interpreting this code and come across the "cmp" instruction, the value in our variable wAX is "[bp+4]" (which is a memory variable). So we now form a condition between "[bp+4]" and "10". But what are we comparing for? That can be found using the branch instruction which comes next. Since we find a "jnz" instruction ("jnz" means "jump if not zero", or "jump if not equal"), we can conclude that this is a equality compare i.e. we are comparing for "[bp+4]!=10". The rest is pretty easy, and after replacing "[bp+4]" with a name "i", we can generate the C code for this as below.

if (i != 10) goto lab1;
j = 15;
goto lab2;
lab1:
j = 20;
lab2:

After a bit of rearrangement, and removing the "goto" statements, we can get the 'beautified' C code as

  if (i!= 10)
{
j = 20;
}
else
{
j = 15;
}

But beware! Conditions need not always be evaluated using "compare" instructions. Even arithmetic instructions can set/reset the conditions flag bits in the processor's FLAGS register, and it is these condition flag bits which are actually used to determine if a branch should be taken or not! A classic example is the C statement "i--", as shown below.

 if (i--)
j = 20;
j++;

   sub [bp+4], 1
jz lab1
mov [bp+2], 20
lab1:
add [bp+2], 1
Sample "if" statement
Branch without "compare"

So if we come across such code, we must look at the previous arithmetic instruction, and find out which variable/expression is actually used as the condition.

Interpreting function calls and return values

Function calls are a bit different. The usual "C" convention of function calling, is to push the parameters (last parameter first!) onto the stack, and then call the function. The function will read the values from the stack, and the return value is usually placed in some register (usually the "AX" register). So the trick we can use here is, whenever we find a "push" instruction (which pushes a value on the top of the stack) we make a note of it. And eventually when we reach a "call" instruction (which is actually a subroutine branch instruction), we look at our variables to see what all values/variables have been pushed to the stack, and put them in reverse order to create the high-level C language statement for the function call!

The example for this method is given below. Please note that the code on the right is obtained after replacing "[bp+4]" with "i" and "[bp+2]" with "j".

 mov ax, [bp+4];
push ax
mov ax, [bp+2];
push ax
call _func
mov [bp+4], ax

 
i = func(j, i);
Sample function call
Equivalent C source code.

No comments:

Post a Comment

LinkWithin

Related Posts with Thumbnails