Title : Next-Gen. Runtime Binary Encryption
 Author : zvrba
              Volume 0x0b, Issue 0x3f, Phile #0x0d of 0x14
|=------=[ cryptexec: Next-generation runtime binary encryption ]=-------=|
|=------=[             using on-demand function extraction      ]=-------=|
|=-----------------------------------------------------------------------=|
|=----------------=[ Zeljko Vrba <[email protected]> ]=-----------------=|
|=-----------------------------------------------------------------------=|
                                ABSTRACT
Please excuse my awkward English, it is not my native language.
    What is binary encryption and why encrypt at all? For the answer to
this question the reader is referred to the Phrack#58 [1] and article
therein titled "Runtime binary encryption".  This article describes a
method to control the target program that doesn't does not rely on
any assistance from the OS kernel or processor hardware. The method
is implemented in x86-32 GNU AS (AT&T syntax). Once the controlling
method is devised, it is relatively trivial to include on-the-fly
code decryption.
1    Introduction
2    OS- and hardware-assisted tracing
3    Userland tracing
3.1    Provided API
3.2    High-level description
3.3    Actual usage example
3.4    XDE bug
3.5    Limitations
3.6    Porting considerations
4    Further ideas
5    Related work
5.1    ELFsh
5.2    Shiva
5.3    Burneye
5.4    Conclusion
6    References
7    Credits
A    Appendix: source code
A.1    crypt_exec.S
A.2    cryptfile.c
A.3    test2.c
Note: Footnotes are marked by # and followed by the number. They are
listed at the end of each section.
--[ 1.0 - Introduction
    First let me introduce some terminology used in this article so that
the reader is not confused.
o The attributes "target", "child" and "traced" are used interchangeably
  (depending on the context) to refer to the program being under the
  control of another program.
o The attributes "controlling" and "tracing" are used interchangeably to
  refer to the program that controls the target (debugger, strace, etc.)
--[ 2.0 - OS- and hardware-assisted tracing
    Current debuggers (both under Windows and UNIX) use x86 hardware 
features for debugging. The two most commonly used features are the trace 
flag (TF) and INT3 instruction, which has a convenient 1-byte encoding of 
0xCC.
    TF resides in bit 8 of the EFLAGS register and when set to 1 the pro-
cessor generates exception 1 (debug exception) after each instruction
is executed. When INT3 is executed, the processor generates exception 3
(breakpoint).
    The traditional way to trace a program under UNIX is the ptrace(2)
syscall. The program doing the trace usually does the following
(shown in pseudocode):
fork()
child:   ptrace(PT_TRACE_ME)
           execve("the program to trace")
parent:  controls the traced program with other ptrace() calls
    Another way is to do ptrace(PT_ATTACH) on an already existing process.
Other operations that ptrace() interface offers are reading/writing target
instruction/data memory, reading/writing registers or continuing the
execution (continually or up to the next system call - this capability is
used by the well-known strace(1) program).
Each time the traced program receives a signal, the controlling program's
ptrace() function returns. When the TF is turned on, the traced program
receives a SIGTRAP after each instruction. The TF is usually not turned
on by the traced program#1, but from the ptrace(PT_STEP).
    Unlike TF, the controlling program places 0xCC opcode at strategic#2
places in the code. The first byte of the instruction is replaced with
0xCC and the controlling program stores both the address and the original
opcode. When execution comes to that address, SIGTRAP is delivered and
the controlling program regains control. Then it replaces (again using
ptrace()) 0xCC with original opcode and single-steps the original
instruction. After that the original opcode is usually again replaced
with 0xCC.
Although powerful, ptrace() has several disadvantages:
1. The traced program can be ptrace()d only by one controlling program.
2. The controlling and traced program live in separate address spaces,
   which makes changing traced memory awkward.
3. ptrace() is a system call:  it is slow  if used for full-blown tracing
   of larger chunks of code.
    I won't go deeper in the mechanics of ptrace(), there are available
tutorials [2] and the man page is pretty self-explanatory.
__
#1 Although nothing prevents it to do so - it is in the user-modifiable
   portion of EFLAGS.
#2 Usually the person doing the debugging decides what is strategic.
--[ 3.0 - Userland  tracing
    The tracing can be done solely from the user-mode:  the instructions
are executed natively, except control-transfer instructions (CALL, JMP,
Jcc, RET, LOOP, JCXZ). The background of this idea is explained
nicely in [3] on the primitive 1960's MIX computer designed by Knuth.
Features of the method I'm about to describe:
o It allows that only portions of the executable file are encrypted.
o Different portions of the executable can be encrypted with different
  keys provided there is no cross-calling between them.
o It allows encrypted code to freely call non-encrypted code. In this
  case the non-encrypted code is also executed instruction by instruction.
  When called outside of encrypted code, it still executes without
  tracing.
o There is never more than 24 bytes of encrypted code held in memory in
  plaintext.
o OS- and language-independent.
    The rest of this section explains the provided API, gives a high-level
description of the implementation, shows a usage example and discusses
Here are the details of my own implementation.
----[ 3.1 - Provided API
    No "official" header file is provided. Because of the sloppy and
convenient C parameter passing and implicit function declarations, you
can get away with no declarations whatsoever.
The decryption API consists of one typedef and one function.
typedef (*decrypt_fn_ptr)(void *key, unsigned char *dst, const unsigned
  char *src);
    This is the generic prototype that your decryption routine must fit. It
is called from the main decryption routine with the following arguments:
o key: pointer to decryption key data. Note that in most cases this is
       NOT the raw key but pointer to some kind of "decryption context".
o dst: pointer to destination buffer
o src: pointer to source buffer
    Note that there is no size argument: the block size is fixed to 8 
bytes. The routine should not read more than 8 bytes from the src and NEVER
output more than 8 bytes to dst.
    Another unusual constraint is that the decryption function MUST NOT
modify its arguments on the stack. If you need to do this, copy the stack 
arguments into local variables. This is a consequence of how the routine 
is called from within the decryption engine - see the code for details.
    There are no constraints whatsoever on the kind of encryption which can
be used. ANY bijective function which maps 8 bytes to 8 bytes is suitable.
Encrypt the code with the function, and use its inverse for the
decryption. If you use the identity function, then decryption becomes
simple single-stepping with no hardware support -- see section 4 for
related work.
The entry point to the decryption engine is the following function:
int crypt_exec(decrypt_fn_ptr dfn, const void *key, const void *lo_addr,
  const void *hi_addr, const void *F, ...);
    The decryption function has the capability to switch between executing
both encrypted and plain-text code. The encrypted code can call the 
plain-text code and plain-text code can return into the encrypted code. 
But for that to be possible, it needs to know the address bounds of the 
encrypted code.
    Note that this function is not reentrant! It is not allowed for ANY 
kind of code (either plain-text or encrypted) running under the crypt_exec
routine to call crypt_exec again. Things will break BADLY because the
internal state of previous invocation is statically allocated and will
get overwritten.
The arguments are as follows:
o dfn: Pointer to decryption function. The function is called with the
  key argument provided to crypt_exec and the addresses of destination
  and source buffers.
o key: This are usually NOT the raw key bytes, but the initialized
  decryption context. See the example code for the test2 program: first
  the user-provided raw key is loaded into the decryption context and the
  address of the _context_ is given to the crypt_exec function.
o lo_addr, hi_addr: These are low and high addresses that are encrypted
  under the same key. This is to facilitate calling non-encrypted code
  from within encrypted code.
o F: pointer to the code which should be executed under the decryption
  engine. It can be an ordinary C function pointer. Since the tracing
  routine was written with 8-byte block ciphers in mind, the F function
  must be at least 8-byte aligned and its length must be a multiple of 8.
  This is easier to achieve (even with standard C) than it sounds. See the
  example below.
o ... become arguments to the called function.
    crypt_exec arranges to function F to be called with the arguments 
provided in the varargs list. When crypt_exec returns, its return value is
what the F returned. In short, the call
  x = crypt_exec(dfn, key, lo_addr, hi_addr, F, ...);
has exactly the same semantics as
  x = F(...);
would have, were F plain-text.
    Currently, the code is tailored to use the XDE disassembler. Other
disassemblers can be used, but the code which accesses results must be
changed in few places (all references to the disbuf variable).
    The crypt_exec routine provides a private stack of 4kB. If you use your
own decryption routine and/or disassembler, take care not to consume too
much stack space. If you want to enlarge the local stack, look for the
local_stk label in the code.
__
#3 In the rest of this article I will call this interchangeably tracing
   or decryption routine. In fact, this is a tracing routine with added
   decryption.
----[ 3.2 - High-level description
    The tracing routine maintains two contexts: the traced context and
its own context. The context consists of 8 32-bit general-purpose
registers and flags. Other registers are not modified by the routine.
Both contexts are held on the private stack (that is also used for
calling C).
    The idea is to fetch, one at a time, instructions from the traced 
program and execute them natively. Intel instruction set has rather 
irregular encoding, so the XDE [5] disassembler engine is used to find 
both the real opcode and total instruction length. During experiments on 
FreeBSD (which uses LOCK- prefixed MOV instruction in its dynamic loader) 
I discovered a bug in XDE which is described and fixed below.
    We maintain our own EIP in traced_eip, round it down to the next lower
8-byte boundary and then decrypt#4 24 bytes#5 into our own buffer.  Then
the disassembly takes place and the control is  transferred to emulation
routines via the opcode control table.  All instructions, except control
transfer, are executed natively (in traced context which is restored at
appropriate time).  After single instruction execution, the control is
returned to our tracing routine.
    In order to prevent losing control, the control transfer instructions#6
are emulated. The big problem was (until I solved it) emulating indirect
JMP  and  CALL instructions (which can appear with any kind of complex EA
that i386 supports). The problem is solved by replacing the  CALL/JMP
instruction with  MOV to register opcode,  and modifying bits 3-5 (reg
field) of  modR/M byte to set the target register (this field holds the
part of opcode in the  CALL/JMP case). Then we let the processor to
calculate the EA for us.
    Of course, a means are needed to stop the encrypted execution and to
enable encrypted code to call plaintext code:
1. On entering, the tracing engine pops the return address and its
   private arguments and then pushes the return address back to the
   traced stack. At that moment:
     o The stack frame is good for executing a regular C function (F).
     o The top of stack pointer (esp) is stored into end_esp.
2. When the tracing routine encounters a RET instruction it first checks
   the traced_esp. If it equals end_esp, it is a point where the F
   function would have ended. Therefore, we restore the traced context
   and do not emulate RET, but let it execute natively. This way the
   tracing routine loses control and normal instruction execution
   continues.
    In order to allow encrypted code to call plaintext code, there are 
lo_addr and hi_addr parameters. These parameters determine the low and high
boundary of encrypted code in memory. If the traced_eip falls out of
[lo_addr, hi_addr) range, the decryption routine pointer is swapped with
the pointer to a no-op "decryption" that just copies 8 bytes from source
to destination. When the traced_eip again falls into that interval, the
pointers are again swapped.
__
#4 The decryption routine is called indirectly for reasons described
   later.
#5 The number comes from worst-case considerations: if an instruction
   begins at a boundary that is 7 (mod 8), given maximum instruction
   length of 15 bytes, yields a total of 22 bytes = 3 blocks. The buffer
   has 32 bytes in order to accommodate an additional JMP indirect
   instruction after the traced instruction. The JMP jumps indirectly to
   place in the tracing routine where execution should continue.
#6 INT instructions are not considered as control transfer. After (if)
   the OS returns from the invoked trap, the program execution continues
   sequentially, the instruction right after INT. So there are no special
   measures that should be taken.
----[ 3.3 - Actual usage example
    Given encrypted execution engine, how do we test it? For this purpose I
have written a small utility named cryptfile that encrypts a portion of
the executable file ($ is UNIX prompt):
$ gcc -c cast5.c
$ gcc cryptfile.c cast5.o -o cryptfile
$ ./cryptfile
USAGE: ./cryptfile <-e_-d> FILE KEY STARTOFF ENDOFF
KEY MUST be 32 hex digits (128 bits).
The parameters are as follows:
o -e,-d: one of these is MANDATORY and stands for encryption
  or decryption.
o FILE: the executable file to be encrypted.
o KEY: the encryption key. It must be given as 32 hex digits.
o STARTOFF, ENDOFF: the starting and ending offset in the file that should
  be encrypted. They must be a multiple of block size (8 bytes). If not,
  the file will be correctly encrypted, but the encrypted execution will
  not work correctly.
    The whole package is tested on a simple program, test2.c. This program
demonstrates that encrypted functions can call both encrypted and plaintext
functions as well as return results. It also demonstrates that the engine 
works even when calling functions in shared libraries.
Now we build the encrypted execution engine:
$ gcc -c crypt_exec.S
$ cd xde101
$ gcc -c xde.c
$ cd ..
$ ld -r cast5.o crypt_exec.o xde101/xde.o -o crypt_monitor.o
    I'm using patched XDE. The last step is to combine several relocatable
object files in a single relocatable file for easier linking with other
programs.
    Then we proceed to build the test program. We must ensure that 
functions that we want to encrypt are aligned to 8 bytes. I'm specifying 16
, just in case. Therefore:
$ gcc -falign-functions=16 -g test2.c crypt_monitor.o -o test2
    We want to encrypt functions f1 and f2. How do wemap from function 
names to offsets in the executable file? Fortunately, this can be simply 
done for ELF with the readelf utility (that's why I chose such an awkward
way - I didn't want to bother with yet another ELF 'parser').
$ readelf -s test2
Symbol table '.dynsym' contains 23 entries:
 Num:     Value   Size Type     Bind    Vis       Ndx Name
   0: 00000000      0 NOTYPE   LOCAL  DEFAULT   UND
   1: 08048484     57 FUNC     GLOBAL DEFAULT   UND printf
   2: 08050aa4      0 OBJECT   GLOBAL DEFAULT   ABS _DYNAMIC
   3: 08048494      0 FUNC     GLOBAL DEFAULT   UND memcpy
   4: 08050b98      4 OBJECT   GLOBAL DEFAULT   20 __stderrp
   5: 08048468      0 FUNC     GLOBAL DEFAULT    8 _init
   6: 08051c74      4 OBJECT   GLOBAL DEFAULT   20 environ
   7: 080484a4     52 FUNC     GLOBAL DEFAULT   UND fprintf
   8: 00000000      0 NOTYPE   WEAK   DEFAULT   UND __deregister_frame..
   9: 0804fc00      4 OBJECT   GLOBAL DEFAULT   13 __progname
  10: 080484b4    172 FUNC     GLOBAL DEFAULT   UND sscanf
  11: 08050b98      0 NOTYPE   GLOBAL DEFAULT   ABS __bss_start
  12: 080484c4      0 FUNC     GLOBAL DEFAULT   UND memset
  13: 0804ca64      0 FUNC     GLOBAL DEFAULT   11 _fini
  14: 080484d4    337 FUNC     GLOBAL DEFAULT   UND atexit
  15: 080484e4    121 FUNC     GLOBAL DEFAULT   UND scanf
  16: 08050b98      0 NOTYPE   GLOBAL DEFAULT   ABS _edata
  17: 08050b68      0 OBJECT   GLOBAL DEFAULT   ABS _GLOBAL_OFFSET_TABLE_
  18: 08051c78      0 NOTYPE   GLOBAL DEFAULT   ABS _end
  19: 080484f4    101 FUNC     GLOBAL DEFAULT   UND exit
  20: 08048504      0 FUNC     GLOBAL DEFAULT   UND strlen
  21: 00000000      0 NOTYPE   WEAK   DEFAULT   UND _Jv_RegisterClasses
  22: 00000000      0 NOTYPE   WEAK   DEFAULT   UND __register_frame_info
Symbol table '.symtab' contains 145 entries:
 Num:     Value   Size Type     Bind    Vis       Ndx Name
  0: 00000000      0 NOTYPE   LOCAL   DEFAULT   UND
  1: 080480f4      0 SECTION LOCAL   DEFAULT    1
  2: 08048110      0 SECTION LOCAL   DEFAULT    2
  3: 08048128      0 SECTION LOCAL   DEFAULT    3
  4: 080481d0      0 SECTION LOCAL   DEFAULT    4
  5: 08048340      0 SECTION LOCAL   DEFAULT    5
  6: 08048418      0 SECTION LOCAL   DEFAULT    6
  7: 08048420      0 SECTION LOCAL   DEFAULT    7
  8: 08048468      0 SECTION LOCAL   DEFAULT    8
  9: 08048474      0 SECTION LOCAL   DEFAULT    9
 10: 08048520      0 SECTION LOCAL   DEFAULT   10
 11: 0804ca64      0 SECTION LOCAL   DEFAULT   11
 12: 0804ca80      0 SECTION LOCAL   DEFAULT   12
 13: 0804fc00      0 SECTION LOCAL   DEFAULT   13
 14: 08050aa0      0 SECTION LOCAL   DEFAULT   14
 15: 08050aa4      0 SECTION LOCAL   DEFAULT   15
 16: 08050b54      0 SECTION LOCAL   DEFAULT   16
 17: 08050b5c      0 SECTION LOCAL   DEFAULT   17
 18: 08050b64      0 SECTION LOCAL   DEFAULT   18
 19: 08050b68      0 SECTION LOCAL   DEFAULT   19
 20: 08050b98      0 SECTION LOCAL   DEFAULT   20
 21: 00000000      0 SECTION LOCAL   DEFAULT   21
 22: 00000000      0 SECTION LOCAL   DEFAULT   22
 23: 00000000      0 SECTION LOCAL   DEFAULT   23
 24: 00000000      0 SECTION LOCAL   DEFAULT   24
 25: 00000000      0 SECTION LOCAL   DEFAULT   25
 26: 00000000      0 SECTION LOCAL   DEFAULT   26
 27: 00000000      0 SECTION LOCAL   DEFAULT   27
 28: 00000000      0 SECTION LOCAL   DEFAULT   28
 29: 00000000      0 SECTION LOCAL   DEFAULT   29
 30: 00000000      0 SECTION LOCAL   DEFAULT   30
 31: 00000000      0 SECTION LOCAL   DEFAULT   31
 32: 00000000      0 FILE    LOCAL   DEFAULT   ABS crtstuff.c
 33: 08050b54      0 OBJECT  LOCAL   DEFAULT   16 __CTOR_LIST__
 34: 08050b5c      0 OBJECT  LOCAL   DEFAULT   17 __DTOR_LIST__
 35: 08050aa0      0 OBJECT  LOCAL   DEFAULT   14 __EH_FRAME_BEGIN__
 36: 08050b64      0 OBJECT  LOCAL   DEFAULT   18 __JCR_LIST__
 37: 0804fc08      0 OBJECT  LOCAL   DEFAULT   13 p.0
 38: 08050b9c      1 OBJECT  LOCAL   DEFAULT   20 completed.1
 39: 080485b0      0 FUNC    LOCAL   DEFAULT   10 __do_global_dtors_aux
 40: 08050ba0     24 OBJECT  LOCAL   DEFAULT   20 object.2
 41: 08048610      0 FUNC    LOCAL   DEFAULT   10 frame_dummy
 42: 00000000      0 FILE    LOCAL   DEFAULT   ABS crtstuff.c
 43: 08050b58      0 OBJECT  LOCAL   DEFAULT   16 __CTOR_END__
 44: 08050b60      0 OBJECT  LOCAL   DEFAULT   17 __DTOR_END__
 45: 08050aa0      0 OBJECT  LOCAL   DEFAULT   14 __FRAME_END__
 46: 08050b64      0 OBJECT  LOCAL   DEFAULT   18 __JCR_END__
 47: 0804ca30      0 FUNC    LOCAL   DEFAULT   10 __do_global_ctors_aux
 48: 00000000      0 FILE    LOCAL   DEFAULT   ABS test2.c
 49: 08048660     75 FUNC    LOCAL   DEFAULT   10 f1
 50: 080486b0     58 FUNC    LOCAL   DEFAULT   10 f2
 51: 08050bb8     16 OBJECT  LOCAL   DEFAULT   20 key.0
 52: 080486f0    197 FUNC    LOCAL   DEFAULT   10 decode_hex_key
 53: 00000000      0 FILE    LOCAL   DEFAULT   ABS cast5.c
 54: 0804cba0   1024 OBJECT  LOCAL   DEFAULT   12 s1
 55: 0804cfa0   1024 OBJECT  LOCAL   DEFAULT   12 s2
 56: 0804d3a0   1024 OBJECT  LOCAL   DEFAULT   12 s3
 57: 0804d7a0   1024 OBJECT  LOCAL   DEFAULT   12 s4
 58: 0804dba0   1024 OBJECT  LOCAL   DEFAULT   12 s5
 59: 0804dfa0   1024 OBJECT  LOCALDEFAULT   12 s6
 60: 0804e3a0   1024 OBJECT  LOCAL   DEFAULT   12 s7
 61: 0804e7a0   1024 OBJECT  LOCAL   DEFAULT   12 sb8
 62: 0804a3c0   3734 FUNC    LOCAL   DEFAULT   10 key_schedule
 63: 0804b408      0 NOTYPE  LOCAL   DEFAULT   10 identity_decrypt
 64: 08051bf0      0 NOTYPE  LOCAL   DEFAULT   20 r_decrypt
 65: 08051be8      0 NOTYPE  LOCAL   DEFAULT   20 key
 66: 08050bd4      0 NOTYPE  LOCAL   DEFAULT   20 lo_addr
 67: 08050bd8      0 NOTYPE  LOCAL   DEFAULT   20 hi_addr
 68: 08050bcc      0 NOTYPE  LOCAL   DEFAULT   20 traced_eip
 69: 08050be0      0 NOTYPE  LOCAL   DEFAULT   20 end_esp
 70: 08050bd0      0 NOTYPE  LOCAL   DEFAULT   20 traced_ctr
 71: 0804b449      0 NOTYPE  LOCAL   DEFAULT   10 decryptloop
 72: 08050bc8      0 NOTYPE  LOCAL   DEFAULT   20 traced_esp
 73: 08051be4      0 NOTYPE  LOCAL   DEFAULT   20 stk_end
 74: 0804b456      0 NOTYPE  LOCAL   DEFAULT   10 decryptloop_nocontext
 75: 0804b476      0 NOTYPE  LOCAL   DEFAULT   10 .store_decrypt_ptr
 76: 08051bec      0 NOTYPE  LOCAL   DEFAULT   20 decrypt
 77: 0804fc35      0 NOTYPE  LOCAL   DEFAULT   13 insn
 78: 08051bf4      0 NOTYPE  LOCAL   DEFAULT   20 disbuf
 79: 08051be4      0 NOTYPE  LOCAL   DEFAULT   20 ilen
 80: 080501f0      0 NOTYPE  LOCAL   DEFAULT   13 continue
 81: 0804fdf0      0 NOTYPE  LOCAL   DEFAULT   13 control_table
 82: 0804fc20      0 NOTYPE  LOCAL   DEFAULT   13 _unhandled
 83: 0804fc21      0 NOTYPE  LOCAL   DEFAULT   13 _nonjump
 84: 0804fc33      0 NOTYPE  LOCAL   DEFAULT   13 .execute
 85: 0804fc55      0 NOTYPE  LOCAL   DEFAULT   13 _jcc_rel8
 86: 0804fc5e      0 NOTYPE  LOCAL   DEFAULT   13 _jcc_rel32
 87: 0804fc65      0 NOTYPE  LOCAL   DEFAULT   13 ._jcc_rel32_insn
 88: 0804fc71      0 NOTYPE  LOCAL   DEFAULT   13 ._jcc_rel32_true
 89: 0804fc6b      0 NOTYPE  LOCAL   DEFAULT   13 ._jcc_rel32_false
 90: 0804fc72      0 NOTYPE  LOCAL   DEFAULT   13 rel_offset_fixup
 91: 0804fc7d      0 NOTYPE  LOCAL   DEFAULT   13 _retn
 92: 0804fca6      0 NOTYPE  LOCAL   DEFAULT   13 ._endtrace
 93: 0804fcbe      0 NOTYPE  LOCAL   DEFAULT   13 _loopne
 94: 0804fce0      0 NOTYPE  LOCAL   DEFAULT   13 ._loop_insn
 95: 0804fcd7      0 NOTYPE  LOCAL   DEFAULT   13 ._doloop
 96: 0804fcc7      0 NOTYPE  LOCAL   DEFAULT   13 _loope
 97: 0804fcd0      0 NOTYPE  LOCAL   DEFAULT   13 _loop
 98: 0804fcec      0 NOTYPE  LOCAL   DEFAULT   13 ._loop_insn_true
 99: 0804fce2      0 NOTYPE  LOCAL   DEFAULT   13 ._loop_insn_false
100: 0804fcf6      0 NOTYPE  LOCAL   DEFAULT   13 _jcxz
101: 0804fd0a      0 NOTYPE  LOCAL   DEFAULT   13 _callrel
102: 0804fd0f      0 NOTYPE  LOCAL   DEFAULT   13 _call
103: 0804fd38      0 NOTYPE  LOCAL   DEFAULT   13 _jmp_rel8
104: 0804fd41      0 NOTYPE  LOCAL   DEFAULT   13 _jmp_rel32
105: 0804fd49      0 NOTYPE  LOCAL   DEFAULT   13 _grp5
106: 0804fda4      0 NOTYPE  LOCAL   DEFAULT   13 ._grp5_continue
107: 08050bdc      0 NOTYPE  LOCAL   DEFAULT   20 our_esp
108: 0804fdc9      0 NOTYPE  LOCAL   DEFAULT   13 ._grp5_call
109: 0804fdd0      0 NOTYPE  LOCAL   DEFAULT   13 _0xf
110: 08050be4      0 NOTYPE  LOCAL   DEFAULT   20 local_stk
111: 00000000      0 FILE    LOCAL   DEFAULT   ABS xde.c
112: 0804b419      0 NOTYPE  GLOBAL  DEFAULT   10 crypt_exec
113: 08048484     57 FUNC    GLOBAL  DEFAULT   UND printf
114: 08050aa4      0 OBJECT  GLOBAL  DEFAULT   ABS _DYNAMIC
115: 08048494      0 FUNC    GLOBAL  DEFAULT   UND memcpy
116: 0804b684   4662 FUNC    GLOBAL  DEFAULT   10 xde_disasm
117: 08050b98      4 OBJECT  GLOBAL  DEFAULT   20 __stderrp
118: 0804fc04      0 OBJECT  GLOBAL  HIDDEN    13 __dso_handle
119: 0804b504    384 FUNC    GLOBAL  DEFAULT   10 reg2xset
120: 08048468      0 FUNC    GLOBAL  DEFAULT    8 _init
121: 0804c8bc    364 FUNC    GLOBAL  DEFAULT   10 xde_asm
122: 08051c74      4 OBJECT   GLOBAL DEFAULT   20 environ
123: 080484a4     52 FUNC     GLOBAL DEFAULT   UND fprintf
124: 00000000      0 NOTYPE   WEAK   DEFAULT   UND __deregister_frame..
125: 0804fc00      4 OBJECT   GLOBAL DEFAULT   13 __progname
126: 08048520    141 FUNC     GLOBAL DEFAULT   10 _start
127: 0804b258    431 FUNC     GLOBAL DEFAULT   10 cast5_setkey
128: 080484b4    172 FUNC     GLOBAL DEFAULT   UND sscanf
129: 08050b98      0 NOTYPE   GLOBAL DEFAULT   ABS __bss_start
130: 080484c4      0 FUNC     GLOBAL DEFAULT   UND memset
131: 080487c0    318 FUNC     GLOBAL DEFAULT   10 main
132: 0804ca64      0 FUNC     GLOBAL DEFAULT   11 _fini
133: 080484d4    337 FUNC     GLOBAL DEFAULT   UND atexit
134: 080484e4    121 FUNC     GLOBAL DEFAULT   UND scanf
135: 08050200   2208 OBJECT   GLOBAL DEFAULT   13 xde_table
136: 08050b98      0 NOTYPE   GLOBAL DEFAULT   ABS _edata
137: 08050b68      0 OBJECT   GLOBAL DEFAULT   ABS _GLOBAL_OFFSET_TABLE_
138: 08051c78      0 NOTYPE   GLOBAL DEFAULT   ABS _end
139: 08049660   3421 FUNC     GLOBAL DEFAULT   10 cast5_decrypt
140: 080484f4    101 FUNC     GLOBAL DEFAULT   UND exit
141: 08048900   3421 FUNC     GLOBAL DEFAULT   10 cast5_encrypt
142: 08048504      0 FUNC     GLOBAL DEFAULT   UND strlen
143: 00000000      0 NOTYPE   WEAK   DEFAULT   UND _Jv_RegisterClasses
144: 00000000      0 NOTYPE   WEAK   DEFAULT   UND __register_frame_info
    We see that function f1 has address 0x8048660 and size 75 = 0x4B. 
Function f2 has address 0x80486B0 and size 58 = 3A. Simple calculation 
shows that they are in fact consecutive in memory so we don't have to 
encrypt them separately but in a single block ranging from 0x8048660 to 
0x80486F0.
$ readelf -l test2
Elf file type is EXEC (Executable file)
Entry point 0x8048520
There are 6 program headers, starting at offset 52
Program Headers:
Type            Offset    VirtAddr    PhysAddr    FileSiz MemSiz
Flg Align
PHDR           0x000034 0x08048034 0x08048034 0x000c0 0x000c0 R E 0x4
INTERP         0x0000f4 0x080480f4 0x080480f4 0x00019 0x00019 R   0x1
    [Requesting program interpreter: /usr/libexec/ld-elf.so.1]
LOAD           0x000000 0x08048000 0x08048000 0x06bed 0x06bed R E 0x1000
LOAD           0x006c00 0x0804fc00 0x0804fc00 0x00f98 0x02078 RW  0x1000
DYNAMIC        0x007aa4 0x08050aa4 0x08050aa4 0x000b0 0x000b0 RW  0x4
NOTE           0x000110 0x08048110 0x08048110 0x00018 0x00018 R   0x4
 Section to Segment mapping:
Segment Sections...
 00
 01     .interp
 02     .interp .note.ABI-tag .hash .dynsym .dynstr .rel.dyn .rel.plt
         .init .plt .text .fini .rodata
 03     .data .eh_frame .dynamic .ctors .dtors .jcr .got .bss
 04     .dynamic
 05     .note.ABI-tag
>From this we see that both addresses (0x8048660 and 0x80486F0) fall into
the first LOAD segment which is loaded at VirtAddr 0x804800 and is placed
at offset 0 in the file. Therefore, to map virtual address to file offset
we simply subtract 0x8048000 from each address giving 0x660 = 1632 and
0x6F0 = 1776.
    If you obtain ELFsh [7] then you can make your life much easier. The
following transcript shows how ELFsh can be used to obtain the same
information:
$ elfsh
       Welcome to The ELF shell 0.51b3 .::.
       .::. This software is under the General Public License
       .::. Please visit http://www.gnu.org to know about Free Software
[ELFsh-0.51b3]$ load test2
 [*] New object test2 loaded on Mon Jun 13 20:45:33 2005
[ELFsh-0.51b3]$ sym f1
 [SYMBOL TABLE]
 [Object test2]
 [059] 0x8048680   FUNCTION f1
 size:0000000075 foffset:001632 scope:Local   sctndx:10 => .text + 304
[ELFsh-0.51b3]$ sym f2
 [SYMBOL TABLE]
 [Object test2]
 [060] 0x80486d0   FUNCTION f2
 size:0000000058 foffset:001776 scope:Local   sctndx:10 => .text + 384
[ELFsh-0.51b3]$ exit
 [*] Unloading object 1 (test2) *
       Good bye ! .::. The ELF shell 0.51b3
    The field foffset gives the symbol offset within the executable, while
size is its size. Here all the numbers are decimal.
    Now we are ready to encrypt a part of the executable with a very
'imaginative' password and then test the program:
$ echo -n "password" | openssl md5
5f4dcc3b5aa765d61d8327deb882cf99
$ ./cryptfile -e test2 5f4dcc3b5aa765d61d8327deb882cf99 1632 1776
$ chmod +x test2.crypt
$ ./test2.crypt
    At the prompt enter the same hex string and then enter numbers 12 and 
34 for a and b. The result must be 1662, and esp before and after must be
the same.
    Once you are sure that the program works correctly, you can strip(1)
symbols from it.
----[ 3.4 - XDE bug
    During the development,  a I have found a bug in the XDE disassembler
engine: it didn't correctly handle the LOCK (0xF0) prefix. Because of the
bug XDE claimed that 0xF0 is a single-byte instruction. This is the
needed patch to correct the disassembler:
--- xde.c       Sun Apr 11 02:52:30 2004
+++ xde_new.c   Mon Aug 23 08:49:00 2004
@@ -101,6 +101,8 @@
   if (c == 0xF0)
   {
     if (diza->p_lock != 0) flag |= C_BAD;      /* twice */
+       diza->p_lock = c;
+       continue;
   }
   break;
    I also needed to remove __cdecl on functions, a 'feature' of Win32 C
compilers not needed on UNIX platforms.
----[ 3.5 - Limitations
o XDE engine (probably) can't handle new instructions (SSE, MMX, etc.).
  For certain it can't handle 3dNow! because they begin with 0x0F 0x0F,
  a byte sequence for which the XDE claims is an invalid instruction
  encoding.
o The tracer shares the same memory with the traced program. If the traced
  program is so badly broken that it writes to (random) memory it doesn't
  own, it can stumble upon and overwrite portions of the tracing routine.
o Each form of tracing has its own speed impacts. I didn't measure how
  much this method slows down program execution (especially compared to
  ptrace()).
o Doesn't handle even all 386 instructions (most notably far calls/jumps
  and RET imm16).  In this case the tracer stops with HLT which should
  cause GPF under any OS that runs user processes in rings other than 0.
o The block size of 8 bytes is hardcoded in many places in the program.
  The source (both C and ASM) should be parametrized by some kind of
  BLOCKSIZE #define.
o The tracing routine is not reentrant! Meaning, any code being executed
  by crypt_exec can't call again crypt_exec because it will overwrite its
  own context!
o The code itself isn't optimal:
    - identity_decrypt could use 4-byte moves.
    - More registers could be used to minimize memory references.
----[ 3.6 - Porting considerations
    This is as heavy as it gets - there isn't a single piece of machine-
independent code in the main routine that could be used on an another
processor architecture. I believe that porting shouldn't be too difficult,
mostly rewriting the mechanics of the current program. Some points to
watch out for include:
o Be sure to handle all control flow instructions.
o Move instructions could affect processor flags.
o Write a disassembly routine. Most RISC architectures have regular
  instruction set and should be far easier to disassemble than x86 code.
o This is self-modifying code: flushing the instruction prefetch queue
  might be needed.
o Handle delayed jumps and loads if the architecture provides them. This
  could be tricky.
o You might need to get around page protections before calling the
  decryptor (non-executable data segments).
    Due to unavailability of non-x86 hardware I wasn't able to implement 
the decryptor on another processor.
--[ 4 - Further ideas
o Better  encryption  scheme.   ECB  mode  is  bad,  especially  with
  small block size of 8 bytes. Possible alternative is the following:
    1.  Round the traced_eip down to a multiple of 8 bytes.
    2.  Encrypt the result with the key.
    3.  Xor the result with the instruction bytes.
  That way the encryption depends on the location in memory. Decryption
  works the same way.  However, it would complicate cryptfile.c program.
o Encrypted data. Devise a transparent (for the C programmer) way to
  access the encrypted data. At least two approaches come to mind:
  1)  playing  with  page  mappings  and  handling  read/write  faults,
  or 2) use XDE to decode all accesses to memory and perform encryption
  or decryption, depending on the type of access (read or write). The
  first approach seems too slow (many context switches per data read)
  to be practical.
o New instruction sets and architectures. Expand XDE to handle new x86
  instructions. Port the routine to architectures other than i386 (first
  comes to mind AMD64, then ARM, SPARC...).
o Perform decryption on the smart card. This is slow, but there is no
  danger of key compromise.
o Polymorphic decryption engine.
----[ 5 - Related Work
This section gives a brief overview of existing work, either because of
similarity in coding techniques (ELFsh and tracing without ptrace) or
because of the code protection aspect.
5.1 ELFsh
---------
The ELFsh crew's article on elfsh and e2dbg [7], also in this Phrack
issue.  A common point in our work is the approach to program tracing
without using ptrace(2). Their latest work is a scriptable embedded ELF
debugger, e2dbg. They are also getting around PaX protections, an issue I
didn't even take into account.
5.2 Shiva
---------
The Shiva binary encryptor [8], released in binary-only form. It tries
really hard to prevent reverse engineering by including features such as
trap flag detection, ptrace() defense, demand-mapped blocks (so that
fully decrpyted image can't be dumped via /proc), using int3 to emulate
some instructions, and by encryption in layers. The 2nd, password
protected layer, is optional and encrypted using 128-bit AES. Layer 3
encryption uses TEA, the tiny encryption algorithm.
According to the analysis in [9], "for sufficiently large programs, no
more than 1/3 of the program will be decrypted at any given time". This
is MUCH larger amount of decrypted program text than in my case: 24
bytes, independent of any external factors. Also, Shiva is heavily
tied to the ELF format, while my method is not tied to any operating
system or executable format (although the current code IS limited to
the 32-bit x86 architecture).
5.3 Burneye
-----------
There are actually two tools released by team-teso: burneye and burneye2
(objobf) [10].
Burneye is a powerful binary encryption tool. Similarly to Shiva, it has
three layers: 1) obfuscation, 2) password-based encryption using RC4 and
SHA1 (for generating the key from passphrase), and 3) the fingerprinting
layer.
The fingerprinting layer is the most interesting one: the data about the
target system is collected (e.g. amount of memory, etc..) and made into
a 'fingeprint'. The executable is encrypted taking the fingerprint into
account so that the resulting binary can be run only on the host with the
given fingerprint. There are two fingerprinting options:
o Fingeprint tolerance can be specified so that Small deviations are
  allowed. That way, for example, the memory can be upgraded on the
  target system and the executable will still work. If the number of
  differences in the fingeprint is too large, the program won't work.
o Seal: the program produced with this option will run on any system.
  However, the first time it is run, it creats a fingerprint of the
  host and 'seals' itself to that host. The original seal binary is
  securely deleted afterwards.
The encrypted binary can also be made to delete itself when a certain
environment variable is set during the program execution.
objobf is just relocatable object obfuscator. There is no encryption
layer. The input is an ordinary relocatable object and the output is
transformed, obfuscated, and functionally equivalent code. Code
transformations include: inserting junk instructions, randomizing the
order of basic blocks, and splitting basic blocks at random points.
5.4 Conclusion
--------------
Highlights of the distinguishing features of the code encryption
technique presented here:
o Very small amount of plaintext code in memory at any time - only 24
  bytes. Other tools leave much more plain-text code in memory.
o No special loaders or executable format manipulations are needed. There
  is one simple utility that encrypts the existing code in-place. It is
  executable format-independent since its arguments are function offsets
  within the executable (which map to function addresses in runtime).
o The code is tied to the 32-bit x86 architecture, however it should be
  portable without changes to any operating system running on x86-32.
  Special arrangements for setting up page protections may be necessary
  if PaX or NX is in effect.
On the downside, the current version of the engine is very vulnerable
with respect to reverse-engineering. It can be easily recognized by
scanning for fixed sequences of instructions (the decryption routine).
Once the decryptor is located, it is easy to monitor a few fixed memory
addresses to obtain both the EIP and the original instruction residing at
that EIP. The key material data is easy to obtain, but this is the case
in ANY approach using in-memory keys.
However, the decryptor in its current form has one advantage: since it is
ordinary code that does no special tricks, it should be easy to combine
it with a tool that is more resilient to reverse-engineering, like Shiva
or Burneye.
----[ 6 - References
1.  Phrack magazine.
    https://phrack.org
2.  ptrace tutorials:
    http://linuxgazette.net/issue81/sandeep.html
    http://linuxgazette.net/issue83/sandeep.html
    http://linuxgazette.net/issue85/sandeep.html
3.  D. E. Knuth:  The Art of Computer Programming, vol.1:  Fundamental
    Algorithms.
4.  Fenris.
    http://lcamtuf.coredump.cx/fenris/whatis.shtml
5.  XDE.
    http://z0mbie.host.sk
6.  Source code for described programs. The source I have written is
    released under MIT license.  Other files have different licenses. The
    archive also contains a patched version of XDE.
    http://www.core-dump.com.hr/software/cryptexec.tar.gz
7.  ELFsh, the ELF shell. A powerful program for manipulating ELF files.
    http://elfsh.devhell.org
8.  Shiva binary encryptor.
    http://www.securereality.com.au
9.  Reverse Engineering Shiva.
    http://blackhat.com/presentations/bh-federal-03/bh-federal-03-eagle/
     bh-fed-03-eagle.pdf
10. Burneye and Burneye2 (objobf).
    http://packetstormsecurity.org/groups/teso/indexsize.html
----[ 7 - Credits
Thanks go to mayhem who has reviewed this article. His suggestions were
very helpful, making the text much more mature than the original.
--[ A - Appendix: Source code
    Here I'm providing only my own source code. The complete source package
can be obtained from [6]. It includes:
o All source listed here,
o the patched XDE disassembler, and
o the source of the CAST5 cryptographic algorithm.
----[ A.1 - The tracer source: crypt_exec.S
/*
Copyright (c) 2004 Zeljko Vrba
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to permit
persons to whom the Software is furnished to do so, subject to the
following conditions:
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT
OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
.text
/************************************************************************
 * void *crypt_exec(
 *   decrypt_fn_ptr dfn, const void *key,
 *   const void *lo_addr, const void *hi_addr,
 *   const void *addr, ...)
 * typedef (*decrypt_fn_ptr)(
 *   void *key, unsigned char *dst, const unsigned char *src);
 *
 * - dfn is pointer to deccryption function
 * - key is pointer to crypt routine key data
 * - addr is the addres where execution should begin. due to the way the
 *   code is decrypted and executed, it MUST be aligned to 8 (BLOCKSIZE)
 *   bytes!!
 * - the rest are arguments to called function
 *
 * The crypt_exec stops when the stack pointer becomes equal to what it
 * was on entry, and executing 'ret' would cause the called function to
 * exit. This works assuming normal C compiled code.
 *
 * Returns the value the function would normally return.
 *
 * This code calls:
 * int xde_disasm(unsigned char *ip, struct xde_instr *outbuf);
 * XDE disassembler engine is compiled and used with PACKED structure!
 *
 * It is assumed that the encryption algorithm uses 64-bit block size.
 * Very good protection could be done if decryption is executed on the
 * SMART CARD.
 *
 * Some terminology:
 * 'Traced' refers to the original program being executed instruction by
 * instruction. The technique used resembles Knuth's tracing routine (and
 * indeed, we get true tracing when decryption is dropped).
 *
 * 'Our' refers to our data stack, etc.
 *
 * TODOs and limitations:
 * - some instructions are not emulated (FAR CALL/JMP/RET, RET NEAR imm16)
 * - LOOP* and JCXZ opcodes haven't been tested
 * - _jcc_rel32 has been tested only indirectly by _jcc_rel8
 ***********************************************************************/
/*
  Offsets into xde_instr struct.
*/
#define OPCODE  23
#define OPCODE2 24
#define MODRM   25
/*
 Set up our stack and save traced context. The context is saved at the end
 of our stack.
*/
#define SAVE_TRACED_CONTEXT \
    movl %esp, traced_esp   ;\
    movl $stk_end, %esp     ;\
    pusha                   ;\
    pushf
/*
 Restore traced context from the current top of stack. After that restores
 traced stack pointer.
*/
#define RESTORE_TRACED_CONTEXT \
    popf                    ;\
    popa                    ;\
    movl traced_esp, %esp
/*
 Identity decryption routine. This just copies 8 bytes (BLOCKSIZE) from
 source to destination. Has normal C calling convention. Is not global.
*/
identity_decrypt:
    movl 8(%esp), %edi              /* destination address */
    movl 12(%esp), %esi             /* source address */
    movl $8, %ecx                   /* 8 bytes */
    cld
    rep movsb
    ret
crypt_exec:
.globl crypt_exec
.extern disasm
    /*
     Fetch all arguments. We are called from C and not expected to save
     registers. This is the stack on entry:
     [ ret_addr dfn key lo_addr hi_addr addr ...args ]
    */
    popl %eax                       /* return address */
    popl r_decrypt                  /* real decryption function pointer */
    popl key                        /* encryption key */
    popl lo_addr                    /* low traced eip */
    popl hi_addr                    /* high traced eip */
    popl traced_eip                 /* eip to start tracing */
    pushl %eax                      /* put return addr to stack again */
    /*
     now the stack frame resembles as if inner function (starting at
     traced_eip) were called by normal C calling convention (after return
     address, the vararg arguments folow)
    */
    movl %esp, end_esp              /* this is used to stop tracing. */
    movl $0, traced_ctr             /* reset counter of insns to 0 */
decryptloop:
    /*
     This loop traces a single instruction.
     The CONTEXT at the start of each iteration:
     traced_eip: points to the next instruction in traced program
     First what we ever do is switch to our own stack and store the traced
     program's registers including eflags.
     Instructions are encrypted in ECB mode in blocks of 8 bytes.
     Therefore, we always must start decryption at the lower 8-byte
     boundary. The total of three blocks (24) bytes are decrypted for one
     instruction. This is due to alignment and maximum instruction length
     constraints: if the instruction begins at addres that is congruent
     to 7 mod 8 + 16 bytes maximum length (given some slack) gives
     instruction span of three blocks.
     Yeah, I know ECB sucks, but this is currently just a proof-of
     concept.  Design something better for yourself if you need it.
    */
    SAVE_TRACED_CONTEXT
decryptloop_nocontext:
    /*
     This loop entry point does not save traced context. It is used from
     control transfer instruction emulation where we doall work ourselves
     and don't use traced context.
     The CONTEXT upon entry is the same as for decryptloop.
     First decide whether to decrypt or just trace the plaintext code.
    */
    movl traced_eip, %eax
    movl $identity_decrypt, %ebx    /* assume no decryption */
    cmpl lo_addr, %eax
    jb .store_decrypt_ptr           /* traced_eip < lo_addr */
    cmpl hi_addr, %eax
    ja .store_decrypt_ptr           /* traced_eip > hi_addr */
    movl r_decrypt, %ebx            /* in bounds, do decryption */
.store_decrypt_ptr:
    movl %ebx, decrypt
    /*
     Decrypt three blocks starting at eax, reusing arguments on the stack
     for the total of 3 calls. WARNING! For this to work properly, the
     decryption function MUST NOT modify its arguments!
    */
    andl $-8, %eax                  /* round down traced_eip to 8 bytes */
    pushl %eax                      /* src buffer */
    pushl $insn                     /* dst buffer */
    pushl key                       /* key data pointer */
    call *decrypt                   /* 1st block */
    addl $8, 4(%esp)                /* advance dst */
    addl $8, 8(%esp)                /* advance src */
    call *decrypt                   /* 2nd block */
    addl $8, 4(%esp)                /* advance dst */
    addl $8, 8(%esp)                /* advance src */
    call *decrypt                   /* 3rd block */
    addl $12, %esp                  /* clear args from stack */
    /*
     Obtain the real start of instruction in the decrypted buffer. The
     traced eip is taken modulo blocksize (8) and added to the start
     address of decrypted buffer. Then XDE is called (standard C calling
     convention) to get necessary information about the instruction.
    */
    movl traced_eip, %eax
    andl $7, %eax                   /* traced_eip mod 8 */
    addl $insn, %eax                /* offset within decrypted buffer */
    pushl $disbuf                   /* address to disassemble into */
    pushl %eax                      /* insn offset to disassemble */
    call xde_disasm                 /* disassemble and return len */
    movl %eax, ilen                 /* store instruction length */
    popl %eax                       /* decrypted insn start */
    popl %ebx                       /* clear remaining arg from stack */
    /*
     Calculate the offset in control table of the instruction handling
     routine. Non-control transfer instructions are just executed in
     traced context, other instructions are emulated.
     Before executing the instruction, the traced eip is advanced by
     instruction length, and the number of executed instructions is
     incremented. We also append indirect 'jmp *continue' after the
     instruction, to continue execution at appropriate place in our
     tracing.  The JMP indirect opcodes are 0xFF 0x25.
    */
    movl ilen, %ebx
    addl %ebx, traced_eip           /* advance traced eip */
    incl traced_ctr                 /* increment counter */
    movw $0x25FF, (%eax, %ebx)      /* JMP indirect; little-endian! */
    movl $continue, 2(%eax, %ebx)   /* store address */
    movzbl OPCODE+disbuf, %esi      /* load instruction byte */
    jmp *control_table(,%esi,4)     /* execute by appropirate handler */
.data
    /*
     Emulation routines start here. They are in data segment because code
     segment isn't writable and we are modifying our own code. We don't
     want yet to mess around with mprotect(). One day (non-exec page table
     support on x86-64) it will have to be done anyway..
     The CONTEXT upon entry on each emulation routine:
     eax        : start of decrypted (CURRENT) insn addr to execute
     ilen       : instruction length in bytes
     stack top -> [traced: eflags edi esi ebp esp ebx edx ecx eax]
     traced_esp : original program's esp
     traced_eip : eip of next insn to execute (NOT of CURRENT insn!)
    */
_unhandled:
    /*
     Unhandled opcodes not normally generated by compiler. Once proper
     emulation routine is written, they become handled :)
     Executing privileged instruction, such as HLT, is the easiest way to
     terminate the program. %eax holds the address of the instruction we
     were trying to trace so it can be observed from debugger.
    */
     hlt
_nonjump:
    /*
     Common emulation for all non-control transfer instructions.
     Instruction buffer (insn) is already filled with decrypted blocks.
     Decrypted instruction can begin in the middle of insn buffer, so the
     relative jmp instruction is adjusted to jump to the traced insn,
     skipping 'junk' at the beginning of insn.
     When the instruction is executed, our execution continues at location
     where 'continue' points to. Normally, this is decryptloop, but
     occasionaly it is temporarily changed (e.g. in _grp5).
    */
    subl $insn, %eax                /* insn begin within insn buffer */
    movb %al, .execute+1            /* update jmp instruction */
    RESTORE_TRACED_CONTEXT
.execute:
    jmp insn                        /* relative, only offset adjusted */
insn:
    .fill 32, 1, 0x90
_jcc_rel8:
    /*
     Relative 8-bit displacement conditional jump. It is handled by
     relative 32-bit displacement jump, once offset is adjusted. Opcode
     must also be adjusted: short jumps are 0x70-0x7F, long jumps are 0x0F
     0x80-0x8F. (conditions correspond directly). Converting short to long
     jump needs adding 0x10 to 2nd opcode.
    */
    movsbl 1(%eax), %ebx            /* load sign-extended offset */
    movb (%eax), %cl                /* load instruction */
    addb $0x10, %cl                 /* adjust opcode to long form */
    /* drop processing to _jcc_rel32 as 32-bit displacement */
_jcc_rel32:
    /*
     Emulate 32-bit conditional relative jump. We pop the traced flags,
     let the Jcc instruction execute natively, and then adjust traced eip
     ourselves, depending whether Jcc was taken or not.
     CONTEXT:
     ebx: jump offset, sign-extended to 32 bits
     cl : real 2nd opcode of the instruction (1st is 0x0F escape)
    */
    movb %cl, ._jcc_rel32_insn+1    /* store opcode to instruction */
    popf                            /* restore traced flags */
._jcc_rel32_insn:
    /*
     Explicit coding of 32-bit relative conditional jump. It is executed
     with the traced flags. Also the jump offset (32 bit) is supplied.
    */
    .byte 0x0F, 0x80
    .long ._jcc_rel32_true - ._jcc_rel32_false
._jcc_rel32_false:
    /*
     The Jcc condition was false. Just save traced flags and continue to
     next instruction.
    */
    pushf
    jmp decryptloop_nocontext
._jcc_rel32_true:
    /*
     The Jcc condition was true. Traced flags are saved, and then the
     execution falls through to the common eip offset-adjusting routine.
    */
    pushf
rel_offset_fixup:
    /*
     Common entry point to fix up traced eip for relative control-flow
     instructions.
     CONTEXT:
     traced_eip: already advanced to the would-be next instruction. this
                 is done in decrypt_loop before transferring control to
                 any insn-handler.
     ebx       : sign-extended 32-bit offset to add to eip
    */
    addl %ebx, traced_eip
    jmp decryptloop_nocontext
_retn:
    /*
     Near return (without imm16). This is the place where the end-of
     trace condition is checked. If, at this point, esp equals end_esp,
     this means that the crypt_exec would return to its caller.
    */
    movl traced_esp, %ebp           /* compare curr traced esp to esp */
    cmpl %ebp, end_esp              /* when crypt_exec caller's return */
    je ._endtrace                   /* address was on top of the stack */
    /*
     Not equal, emulate ret.
    */
    movl %esp, %ebp                 /* save our current stack */
    movl traced_esp, %esp           /* get traced stack */
    popl traced_eip                 /* pop return address */
    movl %esp, traced_esp           /* write back traced stack */
    movl %ebp, %esp                 /* restore our current stack */
    jmp decryptloop_nocontext
._endtrace:
    /*
     Here the traced context is completely restored and RET is executed
     natively. Our tracing routine is no longer in control after RET.
     Regarding C calling convention, the caller of crypt_exec will get
     the return value of traced function.
     One detail we must watch for: the stack now looks like this:
     stack top -> [ ret_addr ...args ]
     but we have been called like this:
     stack top -> [ ret_addr dfn key lo_addr hi_addr addr ...args ]
     and this is what compiler expects when popping arg list. So we must
     fix the stack. The stack pointer can be just adjusted by -20 instead
     of reconstructing the previous state because C functions are free to
     modify their arguments.
     CONTEXT:
     ebp: current traced esp
    */
    movl (%ebp), %ebx               /* return address */
    subl $20, %ebp                  /* fake 5 extra args */
    movl %ebx, (%ebp)               /* put ret addr on top of stack */
    movl %ebp, traced_esp           /* store adjusted stack */
    RESTORE_TRACED_CONTEXT
    ret                             /* return without regaining control */
    /*
     LOOPNE, LOOPE and LOOP instructions are executed from the common
     handler (_doloop). Only the instruction opcode is written from
     separate handlers.
     28 is the offset of traced ecx register that is saved on our stack.
    */
_loopne:
    movb $0xE0, ._loop_insn         /* loopne opcode */
    jmp ._doloop
_loope:
    movb $0xE1, ._loop_insn         /* loope opcode */
    jmp ._doloop
_loop:
    movb $0xE2, ._loop_insn         /* loop opcode */
._doloop:
    /*
     * Get traced context that is relevant for LOOP* execution: signed
     * offset, traced ecx and traced flags.
     */
    movsbl 1(%eax), %ebx
    movl 28(%esp), %ecx
    popf
._loop_insn:
    /*
     Explicit coding of loop instruction and offset.
    */
    .byte 0xE0                      /* LOOP* opcodes: E0, E1, E2 */
    .byte ._loop_insn_true - ._loop_insn_false
._loop_insn_false:
    /*
     LOOP* condition false. Save only modified context (flags and ecx)
     and continue tracing.
    */
    pushf
    movl %ecx, 28(%esp)
    jmp decryptloop_nocontext
._loop_insn_true:
    /*
     LOOP* condition true. Save only modified context, and jump to the
     rel_offset_fixup to fix up traced eip.
    */
    pushf
    movl %ecx, 28(%esp)
    jmp rel_offset_fixup
_jcxz:
    /*
     JCXZ. This is easier to simulate than to natively execute.
    */
    movsbl 1(%eax), %ebx            /* get signed offset */
    cmpl $0, 28(%esp)               /* test traced ecx for 0 */
    jz rel_offset_fixup             /* if so, fix up traced EIP */
    jmp decryptloop_nocontext
_callrel:
    /*
     Relative CALL.
    */
    movb $1, %cl                    /* 1 to indicates relative call */
    movl 1(%eax), %ebx              /* get offset */
_call:
    /*
     CALL emulation.
     CONTEXT:
     cl : relative/absolute indicator.
     ebx: absolute address (cl==0) or relative offset (cl!=0).
    */
    movl %esp, %ebp                 /* save our stack */
    movl traced_esp, %esp           /* push traced eip onto */
    pushl traced_eip                /* traced stack */
    movl %esp, traced_esp           /* write back traced stack */
    movl %ebp, %esp                 /* restore our stack */
    testb %cl, %cl                  /* if not zero, then it is a */
    jnz rel_offset_fixup            /* relative call */
    movl %ebx, traced_eip           /* store dst eip */
    jmp decryptloop_nocontext       /* continue execution */
_jmp_rel8:
    /*
     Relative 8-bit displacement JMP.
    */
    movsbl 1(%eax), %ebx            /* get signed offset */
    jmp rel_offset_fixup
_jmp_rel32:
    /*
     Relative 32-bit displacement JMP.
    */
    movl 1(%eax), %ebx              /* get offset */
    jmp rel_offset_fixup
_grp5:
    /*
     This is the case for 0xFF opcode which escapes to GRP5: the real
     instruction opcode is hidden in bits 5, 4, and 3 of the modR/M byte.
    */
    movb MODRM+disbuf, %bl          /* get modRM byte */
    shr $3, %bl                     /* shift bits 3-5 to 0-2 */
    andb $7, %bl                    /* and test only bits 0-2 */
    cmpb $2, %bl                    /* < 2, not control transfer */
    jb _nonjump
    cmpb $5, %bl                    /* > 5, not control transfer */
    ja _nonjump
    cmpb $3, %bl                    /* CALL FAR */
    je _unhandled
    cmpb $5, %bl                    /* JMP FAR */
    je _unhandled
    movb %bl, %dl                   /* for future reference */
    /*
     modR/M equals 2 or 4 (near CALL or JMP).
     In this case the reg field of modR/M (bits 3-5) is the part of
     instruction opcode.
     Replace instruction byte 0xFF with 0x8B (MOV r/m32 to reg32 opcode).
     Replace reg field with 3 (ebx register index).
    */
    movb $0x8B, (%eax)          /* replace with MOV_to_reg32 opcode */
    movb 1(%eax), %bl           /* get modR/M byte */
    andb $0xC7, %bl             /* mask bits 3-5 */
    orb $0x18, %bl              /* set them to 011=3: ebx reg index */
    movb %bl, 1(%eax)           /* set MOV target to ebx */
    /*
     We temporarily update continue location to continue execution in
     this code instead of jumping to decryptloop. We execute MOV in TRACED
     context because it must use traced registers for address calculation.
     Before that we save OUR esp so that original TRACED context isn't
     lost (MOV updates ebx, traced CALL wouldn't mess with any registers).
     First we save OUR context, but after that we must restore TRACED ctx.
     In order to do that, we must adjust esp to point to traced context
     before restoration.
    */
    movl $._grp5_continue, continue
    movl %esp, %ebp             /* save traced context pointer into ebp */
    pusha                       /* store our context; eflags irrelevant */
    movl %esp, our_esp          /* our context pointer */
    movl %ebp, %esp             /* adjust traced context pointer */
    jmp _nonjump
._grp5_continue:
    /*
     This is where execution continues after MOV calculates effective
     address for us.
     CONTEXT upon entry:
     ebx: target address where traced execution should continue
     dl : opcode part (bits 3-5) of modR/M, shifted to bits 0-2
    */
    movl $decryptloop, continue     /* restore continue location */
    movl our_esp, %esp              /* restore our esp */
    movl %ebx, 16(%esp)             /* so that ebx is restored anew */
    popa                            /* our context along with new ebx */
    cmpb $2, %dl                    /* CALL near indirect */
    je ._grp5_call
    movl %ebx, traced_eip           /* JMP near indirect */
    jmp decryptloop_nocontext
._grp5_call:
    xorb %cl, %cl                   /* mark: addr in ebx is absolute */
    jmp _call
_0xf:
    /*
     0x0F opcode esacpe for two-byte opcodes. Only 0F 0x80-0x8F range are
     Jcc rel32 instructions. Others are normal instructions.
    */
    movb OPCODE2+disbuf, %cl        /* extended opcode */
    cmpb $0x80, %cl
    jb _nonjump                     /* < 0x80, not Jcc */
    cmpb $0x8F, %cl
    ja _nonjump                     /* > 0x8F, not Jcc */
    movl 2(%eax), %ebx              /* load 32-bit offset */
    jmp _jcc_rel32
control_table:
    /*
     This is the jump table for instruction execution dispatch. When the
     real opcode of the instruction is found, the tracer jumps indirectly
     to execution routine based on this table.
    */
    .rept 0x0F                      /* 0x00 - 0x0E */
    .long _nonjump                  /* normal opcodes */
    .endr
    .long _0xf                      /* 0x0F two-byte escape */
    .rept 0x60                      /* 0x10 - 0x6F */
    .long _nonjump                  /* normal opcodes */
    .endr
    .rept 0x10                      /* 0x70 - 0x7F */
    .long _jcc_rel8                 /* relative 8-bit displacement */
    .endr
    .rept 0x10                      /* 0x80 - 0x8F */
    .long _nonjump                  /* long displ jump handled from */
    .endr                           /* _0xf opcode escape */
    .rept 0x0A                      /* 0x90 - 0x99 */
.long _nonjump
    .endr
    .long _unhandled                /* 0x9A: far call to full pointer */
    .rept 0x05                      /* 0x9B - 0x9F */
    .long _nonjump
    .endr
    .rept 0x20                      /* 0xA0 - 0xBF */
    .long _nonjump
    .endr
    .long _nonjump, _nonjump        /* 0xC0, 0xC1 */
    .long _unhandled                /* 0xC2: retn imm16 */
    .long _retn                     /* 0xC3: retn */
    .rept 0x06                      /* 0xC4 - 0xC9 */
    .long _nonjump
    .endr
    .long _unhandled, _unhandled    /* 0xCA, 0xCB : far ret */
    .rept 0x04
    .long _nonjump
    .endr
    .rept 0x10                      /* 0xD0 - 0xDF */
    .long _nonjump
    .endr
    .long _loopne, _loope           /* 0xE0, 0xE1 */
    .long _loop, _jcxz              /* 0xE2, 0xE3 */
    .rept 0x04                      /* 0xE4 - 0xE7 */
    .long _nonjump
    .endr
    .long _callrel                  /* 0xE8 */
    .long _jmp_rel32                /* 0xE9 */
    .long _unhandled                /* far jump to full pointer */
    .long _jmp_rel8                 /* 0xEB */
    .rept 0x04                      /* 0xEC - 0xEF */
    .long _nonjump
    .endr
    .rept 0x0F                      /* 0xF0 - 0xFE */
    .long _nonjump
    .endr
    .long _grp5                     /* 0xFF: group 5 instructions */
.data
continue:   .long decryptloop       /* where to continue after 1 insn */
.bss
.align  4
traced_esp: .long 0                 /* traced esp */
traced_eip: .long 0                 /* traced eip */
traced_ctr: .long 0                 /* incremented by 1 for each insn */
lo_addr:    .long 0                 /* low encrypted eip */
hi_addr:    .long 0                 /* high encrypted eip */
our_esp:    .long 0                 /* our esp... */
end_esp:    .long 0                 /* esp when we should stop tracing */
local_stk:  .fill 1024, 4, 0        /* local stack space (to call C) */
stk_end = .                         /* we need this.. */
ilen:       .long 0                 /* instruction length */
key:        .long 0                 /* pointer to key data */
decrypt:    .long 0                 /* USED decryption function */
r_decrypt:  .long 0                 /* REAL decryption function */
disbuf:     .fill 128, 1, 0         /* xde disassembly buffer */
----[ A.2 - The file encryption utility source: cryptfile.c
/*
Copyright (c) 2004 Zeljko Vrba
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to permit
persons to whom the Software is furnished to do so, subject to the
following conditions:
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT
OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
/*
 * This program encrypts a portion of the file, writing new file with
 * .crypt appended. The permissions (execute, et al) are NOT preserved!
 * The blocksize of 8 bytes is hardcoded.
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include "cast5.h"
#define BLOCKSIZE 8
#define KEYSIZE 16
typedef void (*cryptblock_f)(void*, u8*, const u8*);
static unsigned char *decode_hex_key(char *hex)
{
    static unsigned char key[KEYSIZE];
    int i;
    if(strlen(hex) != KEYSIZE << 1) {
        fprintf(stderr, "KEY must have EXACTLY %d hex digits.\n",
          KEYSIZE << 1);
       exit(1);
    }
    for(i = 0; i < KEYSIZE; i++, hex += 2) {
        unsigned int x;
        char old = hex[2];
        hex[2] = 0;
        if(sscanf(hex, "%02x", &x) != 1) {
            fprintf(stderr, "non-hex digit in KEY.\n");
            exit(1);
        }
        hex[2] = old;
        key[i] = x;
    }
    return key;
}
static void *docrypt(
    FILE *in, FILE *out,
    long startoff, long endoff,
    cryptblock_f crypt, void *ctx)
{
    char buf[BLOCKSIZE], enc[BLOCKSIZE];
    long curroff = 0;
    size_t nread = 0;
    while((nread = fread(buf, 1, BLOCKSIZE, in)) > 0) {
        long diff = startoff - curroff;
        if((diff < BLOCKSIZE) && (diff > 0)) {
            /*
            this handles the following mis-alignment (each . is 1 byte)
            ...[..|......]....
                ^  ^      ^ curoff+BLOCKSIZE
                |  startoff
             curroff
            */
            if(fwrite(buf, 1, diff, out) < diff) {
                perror("fwrite");
                exit(1);
            }
            memmove(buf, buf + diff, BLOCKSIZE - diff);
            fread(buf + BLOCKSIZE - diff, 1, diff, in);
            curroff = startoff;
        }
        if((curroff >= startoff) && (curroff < endoff)) {
            crypt(ctx, enc, buf);
        } else {
            memcpy(enc, buf, BLOCKSIZE);
        }
        if(fwrite(enc, 1, nread, out) < nread) {
            perror("fwrite");
            exit(1);
        }
        curroff += nread;
    }
}
int main(int argc, char **argv)
{
    FILE *in, *out;
    long startoff, endoff;
    char outfname[256];
    unsigned char *key;
    struct cast5_ctx ctx;
    cryptblock_f mode;
    if(argc != 6) {
        fprintf(stderr, "USAGE: %s <-e|-d> FILE KEY STARTOFF ENDOFF\n",
          argv[0]);
        fprintf(stderr, "KEY MUST be 32 hex digits (128 bits).\n");
        return 1;
    }
    if(!strcmp(argv[1], "-e")) {
        mode = cast5_encrypt;
    } else if(!strcmp(argv[1], "-d")) {
        mode = cast5_decrypt;
    } else {
        fprintf(stderr, "invalid mode (must be either -e od -d)\n");
        return 1;
    }
    startoff = atol(argv[4]);
    endoff = atol(argv[5]);
    key = decode_hex_key(argv[3]);
    if(cast5_setkey(&ctx, key, KEYSIZE) < 0) {
        fprintf(stderr, "error setting key (maybe invalid length)\n");
        return 1;
    }
    if((endoff - startoff) & (BLOCKSIZE-1)) {
        fprintf(stderr, "STARTOFF and ENDOFF must span an exact multiple"
          " of %d bytes\n", BLOCKSIZE);
        return 1;
    }
    if((endoff - startoff) < BLOCKSIZE) {
        fprintf(stderr, "STARTOFF and ENDOFF must span at least"
          " %d bytes\n", BLOCKSIZE);
        return 1;
    }
    sprintf(outfname, "%s.crypt", argv[2]);
    if(!(in = fopen(argv[2], "r"))) {
        fprintf(stderr, "fopen(%s): %s\n", argv[2], strerror(errno));
        return 1;
    }
    if(!(out = fopen(outfname, "w"))) {
        fprintf(stderr, "fopen(%s): %s\n", outfname, strerror(errno));
        return 1;
    }
    docrypt(in, out, startoff, endoff, mode, &ctx);
    fclose(in);
    fclose(out);
    return 0;
}
----[ A.3 - The test program: test2.c
/*
Copyright (c) 2004 Zeljko Vrba
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to permit
persons to whom the Software is furnished to do so, subject to the
following conditions:
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT
OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include "cast5.h"
#define BLOCKSIZE 8
#define KEYSIZE 16
/*
 * f1 and f2 are encrypted with the following 128-bit key:
 * 5f4dcc3b5aa765d61d8327deb882cf99 (MD5 of the string 'password')
 */
static int f1(int a)
{
    int i, s = 0;
    for(i = 0; i < a; i++) {
        s += i*i;
    }
    printf("called plaintext code: f1 = %d\n", a);
    return s;
}
static int f2(int a, int b)
{
    int i;
    a = f1(a);
    for(i = 0; i < b; i++) {
        a += b;
    }
    return a;
}
static unsigned char *decode_hex_key(char *hex)
{
    static unsigned char key[KEYSIZE];
    int i;
    if(strlen(hex) != KEYSIZE << 1) {
        fprintf(stderr, "KEY must have EXACTLY %d hex digits.\n",
          KEYSIZE << 1);
        exit(1);
    }
    for(i = 0; i < KEYSIZE; i++, hex += 2) {
        unsigned int x;
        char old = hex[2];
        hex[2] = 0;
        if(sscanf(hex, "%02x", &x) != 1) {
            fprintf(stderr, "non-hex digit in KEY.\n");
            exit(1);
        }
        hex[2] = old;
        key[i] = x;
    }
    return key;
}
int main(int argc, char **argv)
{
    int a, b, result;
    char op[16], hex[256];
    void *esp;
    struct cast5_ctx ctx;
    printf("enter decryption key: ");
    scanf("%255s", hex);
    if(cast5_setkey(&ctx, decode_hex_key(hex), KEYSIZE) < 0) {
        fprintf(stderr, "error setting key.\n");
        return 1;
    }
    printf("a b = "); scanf("%d %d", &a, &b);
    asm("movl %%esp, %0" : "=m" (esp));
    printf("esp=%p\n", esp);
    result = crypt_exec(cast5_decrypt, &ctx, f1, decode_hex_key,
      f2, a, b);
    asm("movl %%esp, %0" : "=m" (esp));
    printf("esp=%p\n", esp);
    printf("result = %d\n", result);
    return 0;
}