Title : Once upon a free()
 Author : anonymous author
                             ==Phrack Inc.==
               Volume 0x0b, Issue 0x39, Phile #0x09 of 0x12
|=---------------------=[ Once upon a free()... ]=-----------------------=|
|=-----------------------------------------------------------------------=|
|=--------------=[ anonymous <[email protected]> ]=-------------=|
On the Unix system, and later in the C standard library there are functions
to handle variable amounts of memory in a dynamic way. This allows programs
to dynamically request memory blocks from the system. The operating system
only provides a very rough system call 'brk' to change the size of a big
memory chunk, which is known as the heap.
On top of this system call the malloc interface is located, which provides
a layer between the application and the system call. It can dynamically
split the large single block into smaller chunks, free those chunks on
request of the application and avoid fragmentation while doing so. You can
compare the malloc interface to a linear file system on a large, but
dynamically sized raw device.
There are a few design goals which have to be met by the malloc interface:
	- stability
	- performance
	- avoidance of fragmentation
	- low space overhead
There are only a few common malloc implementations. The most common ones
are the System V one, implemented by AT&T, the GNU C Library implementation
and the malloc-similar interface of the Microsoft operating systems
(RtlHeap*).
Here is a table of algorithms and which operating systems use them:
Algorithm               | Operating System
------------------------+--------------------------------------------------
BSD kingsley            | 4.4BSD, AIX (compatibility), Ultrix
BSD phk                 | BSDI, FreeBSD, OpenBSD
GNU Lib C (Doug Lea)    | Hurd, Linux
System V AT&T           | Solaris, IRIX
Yorktown                | AIX (default)
RtlHeap*                | Microsoft Windows *
------------------------+--------------------------------------------------
It is interesting to see that most of the malloc implementations are very
easy to port and that they are architecture independent. Most of those
implementations just build an interface with the 'brk' system call. You can
change this behaviour with a #define. All of the implementations I have
come across are written in ANSI C and just do very minimal or even no
sanity checking. Most of them have a special compilation define that
includes asserts and extra checks. Those are turned off by default in the
final build for performance reasons. Some of the implementations also
offer extra reliability checks that will detect buffer overflows. Those
are made to detect overflows while development, not to stop exploitation
in the final release.
Storing management info in-band
Most malloc implementations share the behaviour of storing their own
management information, such as lists of used or free blocks, sizes of
memory blocks and other useful data within the heap space itself. Since the
whole idea of malloc/free is based on the dynamic requirements the
application has, the management info itself occupies a variable amount of
data too. Because of this, the implementation can seldomly just reserve a
certain amount of memory for its own purposes, but stores the management
information "in-band", right after and before the blocks of memory that are
used by the application.
Some applications do request a block of memory using the malloc interface,
which later happens to be vulnerable to a buffer overflow. This way, the
data behind the chunk can be changed. Possibly the malloc management
structures can be compromised. This has been demonstrated first by Solar
Designer's wizard-like exploit [1].
The central attack of exploiting malloc allocated buffer overflows is to
modify this management information in a way that will allow arbitrary
memory overwrites afterwards. This way pointers can be overwritten within
the writeable process memory, hence allowing modification of return
addresses, linkage tables or application level data.
To mount such an attack, we have to take a deep look within the internal
workings of the implementation we want to exploit. This article discusses
the commonly used GNU C Library and the System V implementation and how to
gain control over a process using buffer overflows which occur in malloced
buffers under Linux, Solaris and IRIX systems.
System V malloc implementation
==============================
IRIX and Solaris use an implementation which is based on self-adjusting
binary trees. The theoretical background of this implementation has been
described in [2].
The basic idea of this implementation is to keep lists of equally sized
malloc chunks within a binary tree. If you allocate two chunks of the
same size, they will be within the same node and within the same list of this
node. The tree is ordered by the size of its elements.
The TREE structure
The definition of the TREE structure can be found in the mallint.h, along
with some easy-to-use macros to access its elements. The mallint.h file
can be found in the source distribution of the Solaris operating system
[4]. Although I cannot verify that IRIX is based on the same source, there
are several similarities which indicated this. The malloc interface
internally creates the same memory layout and functions, besides some 64
bit alignments. You can utilize the Solaris source for your IRIX exploits,
too.
To allow each tree element to be used for a different purpose to avoid
overhead and force an alignment, each TREE structure element is defined
as a union:
/* the proto-word; size must be ALIGN bytes */
typedef union _w_ {
	size_t		w_i;		/* an unsigned int */
	struct _t_	*w_p;		/* a pointer */
	char		w_a[ALIGN];	/* to force size */
} WORD;
Central TREE structure definition:
/* structure of a node in the free tree */
typedef struct _t_ {
	WORD	t_s;	/* size of this element */
	WORD	t_p;	/* parent node */
	WORD	t_l;	/* left child */
	WORD	t_r;	/* right child */
	WORD	t_n;	/* next in link list */
	WORD	t_d;	/* dummy to reserve space for self-pointer */
} TREE;
The 't_s' element of the chunk header contains the rounded up value of the
size the user requested when he called malloc. Since this size is always
rounded up to a word boundary, at least the lower two bits of the 't_s'
elements are unused - they normally would have the value of zero all the
time. Instead of being zero, they are ignored for all size-related
operations. They are used as flag elements. 
From the malloc.c source it reads:
   BIT0: 1 for busy (block is in use), 0 for free.
   BIT1: if the block is busy, this bit is 1 if the preceding block in
         contiguous memory is free. Otherwise, it is always 0.
TREE Access macros:
/* usable # of bytes in the block */
#define	SIZE(b)		(((b)->t_s).w_i)
/* free tree pointers */
#define	PARENT(b)	(((b)->t_p).w_p)
#define	LEFT(b)		(((b)->t_l).w_p)
#define	RIGHT(b)	(((b)->t_r).w_p)
/* forward link in lists of small blocks */
#define	AFTER(b)	(((b)->t_p).w_p)
/* forward and backward links for lists in the tree */
#define	LINKFOR(b)	(((b)->t_n).w_p)
#define	LINKBAK(b)	(((b)->t_p).w_p)
For all allocation operations a certain alignment and minimum size is
enforced, which is defined here:
#define	WORDSIZE	(sizeof (WORD))
#define	MINSIZE		(sizeof (TREE) - sizeof (WORD))
#define	ROUND(s)	if (s % WORDSIZE) s += (WORDSIZE - (s % WORDSIZE))
The tree structure is the central element of each allocated chunk. Normally
only the 't_s' and 't_p' elements are used, and user data is stored from
't_l' on. Once the node is freed, this changes and the data is reused to
manage the free elements more efficiently. The chunk represents an element
within the splay tree. As more chunks get freed, the malloc implementation
tries to merge the free chunks right next to it. At most FREESIZE (32 by
default) chunks can be in this dangling free state at the same time. They
are all stored within the 'flist' array. If a call to free is made while
the list is already full, the old element at this place falls out and is
forwarded to realfree. The place is then occupied by the newly freed
element.
This is done to speed up and avoid defragmentation in cases where a lot of
calls to free are made in a row. The real merging process is done by
realfree.  It inserts the chunk into the central tree, which starts at the
'Root' pointer. The tree is ordered by the size of its elements and
is self-balancing. It is a so called "splay tree", in which the elements
cycle in a special way to speed up searches (see google.com "splay tree"
for further information). This is not much of importance here, but keep in
mind that there are two stages of free chunks: one being within the flist
array, and one within the free-elements tree starting at 'Root'.
There are some special management routines for allocating small chunks of
memory, which happen to have a size below 40 bytes. Those are not
considered here, but the basic idea is to have extra lists of them, not
keeping them within a tree but in lists, one for each WORD matching size
below 40.
There is more than one way to exploit a malloc based buffer overflow,
however here is one method which works against both, IRIX and Solaris.
As a chunk is realfree'd, it is checked whether the neighbor-chunks are
already within the realfree'd tree. If it is the case, the only thing
that has to be done is to logically merge the two chunks and reorder its
position within the tree, as the size has changed.
This merging process involves pointer modification within the tree, which
consists of nodes. These nodes are represented by the chunk header
itself. Pointers to other tree elements are stored there. If we can
overwrite them, we can possibly modify the operation when merging the
chunks.
Here is, how it is done in malloc.c:
(modified to show the interesting part of it)
static void
realfree(void *old)
{
	TREE	*tp, *sp, *np;
	size_t	ts, size;
	/* pointer to the block */
	tp = BLOCK(old);
	ts = SIZE(tp);
	if (!ISBIT0(ts))
		return;
	CLRBITS01(SIZE(tp));
	/* see if coalescing with next block is warranted */
	np = NEXT(tp);
	if (!ISBIT0(SIZE(np))) {
		if (np != Bottom)
			t_delete(np);
		SIZE(tp) += SIZE(np) + WORDSIZE;
	}
We remember NEXT points to the chunk directly following the current one. So
we have this memory layout:
          tp               old              np
          |                |                |
          [chunk A header] [chunk A data] | [chunk B or free ....]
                                          |
                                          chunk boundary
In the usual situation the application has allocated some space and got a
pointer (old) from malloc. It then messes up and allows a buffer overflow
of the chunk data. We cross the chunk boundary by overflowing and hit the
data behind, which is either free space or another used chunk.
	np = NEXT(tp);
Since we can only overflow data behind 'old', we cannot modify the header
of our own chunk. Therefore we cannot influence the 'np' pointer in any
way. It always points to the chunk boundary.
Now a check is made to test if it is possible to merge forward, that is our
chunk and the chunk behind it. Remember that we can control the chunk
to the right of us.
	if (!ISBIT0(SIZE(np))) {
		if (np != Bottom)
			t_delete(np);
		SIZE(tp) += SIZE(np) + WORDSIZE;
	}
 
BIT0 is zero if the chunk is free and within the free elements tree. So if
it is free and not the last chunk, the special 'Bottom' chunk, it is
deleted from the tree. Then the sizes of both chunks are added and later in
the code of the realfree function the whole resized chunk is reinserted
into the tree.
One important part is that the overflowed chunk must not be the last chunk
within the malloc space, condition:
        1. Overflowed chunk must not be the last chunk
Here is how the 't_delete' function works:
static void
t_delete(TREE *op)
{
	TREE	*tp, *sp, *gp;
	/* if this is a non-tree node */
	if (ISNOTREE(op)) {
		tp = LINKBAK(op);
		if ((sp = LINKFOR(op)) != NULL)
			LINKBAK(sp) = tp;
		LINKFOR(tp) = sp;
		return;
	}
There are other cases, but this is the one easiest to exploit. As I am
already tired of this, I will just explain this one here. The others are
very similar (look at malloc.c).
ISNOTREE compares the 't_l' element of the TREE structure with -1. -1 is
the special marker for non-tree nodes, which are used as doubly linked list,
but that does not matter.
Anyway, this is the first condition we have to obey:
        2. fake->t_l = -1;
Now the unlinking between FOR (t_n) and BAK (t_p) is done, which can be
rewritten as:
	t1 = fake->t_p
	t2 = fake->t_n
	t2->t_p = t1
	t1->t_n = t2
Which is (written in pseudo-raw-assignments which happen at the same time):
	[t_n + (1 * sizeof (WORD))] = t_p
	[t_p + (4 * sizeof (WORD))] = t_n
This way we can write to arbitrary addresses together with valid
addresses at the same time. We choose to use this:
	t_p = retloc - 4 * sizeof (WORD)
	t_n = retaddr
This way retloc will be overwritten with retaddr and *(retaddr + 8) will be
overwritten with retloc. If there is code at retaddr, there should be a
small jump over the bytes 8-11 to not execute this address as code. Also,
the addresses can be swapped if that fits the situation better.
Finally our overwrite buffer looks like this:
  | <t_s> <t_p> <t_l> <j: t_r> <t_n> <j: t_d>
  |
  chunk boundary
Where: t_s = some small size with lower two bits zeroed out
       t_p = retloc - 4 * sizeof (WORD)
       t_l = -1
       t_r = junk
       t_n = retaddr
       t_d = junk
Note that although all of the data is stored as 32 bit pointers, each
structure element occupies eight bytes. This is  because of the WORD
union, which forces at least ALIGN bytes to be used for each element.
ALIGN is defined to eight by default.
So a real overflow buffer behind the chunk boundary might look like:
ff ff ff f0 41 41 41 41  ef ff fc e0 41 41 41 41  | ....AAAA....AAAA
ff ff ff ff 41 41 41 41  41 41 41 41 41 41 41 41  | ....AAAAAAAAAAAA
ef ff fc a8 41 41 41 41  41 41 41 41 41 41 41 41  | ....AAAAAAAAAAAA
All 'A' characters can be set arbitrarily. The 't_s' element has been
replaced with a small negative number to avoid NUL bytes. If you want to use
NUL bytes, use very few. Otherwise the realfree function will crash later.
The buffer above will overwrite:
	[0xeffffce0 + 32] = 0xeffffca8
	[0xeffffca8 +  8] = 0xeffffce0
See the example code (mxp.c) for a more in-depth explanation.
To summarize down the guts if you happen to exploit a malloc based buffer
overflow on IRIX or Solaris:
       1. Create a fake chunk behind the one you overflow
       2. The fake chunk is merged with the one you overflow as it is
          passed to realfree
       3. To make it pass to realfree it has to call malloc() again or
          there have to be a lot of successive free() calls
       4. The overflowed chunk must not be the last chunk (the one before
          Bottom)
       5. Prepend the shellcode/nop-space with jump-aheads to not execute
          the unavoidable unlink-overwrite address as code
       6. Using the t_splay routines attacks like this are possible too, so
          if you cannot use the attack described here (say you cannot
          write 0xff bytes), use the source luke.
There are a lot of other ways to exploit System V malloc management, way
more than there are available in the GNU implementation. This is a result
of the dynamic tree structure, which also makes it difficult to understand
sometimes. If you have read until here, I am sure you can find your own
ways to exploit malloc based buffer overflows.
GNU C Library implementation
============================
The GNU C library keeps the information about the memory slices the
application requests in so called 'chunks'. They look like this (adapted
from malloc.c):
             +----------------------------------+
    chunk -> | prev_size                        |
             +----------------------------------+
             | size                             |
             +----------------------------------+
      mem -> | data                             |
             : ...                              :
             +----------------------------------+
nextchunk -> | prev_size ...                    |
             :                                  :
Where mem is the pointer you get as return value from malloc(). So if you
do a:
        unsigned char *	mem = malloc (16);
Then 'mem' is equal to the pointer in the figure, and (mem - 8) would be
equal to the 'chunk' pointer.
The 'prev_size' element has a special function: If the chunk before the
current one is unused (it was free'd), it contains the length of the chunk
before. In the other case - the chunk before the current one is used -
'prev_size' is part of the 'data' of it, saving four bytes.
The 'size' field has a special meaning. As you would expect, it contains
the length of the current block of memory, the data section. As you call
malloc(), four is added to the size you pass to it and afterwards the size
is padded up to the next double-word boundary. So a malloc(7) will become a
malloc(16), and a malloc(20) will become malloc(32). For malloc(0) it will
be padded to malloc(8). The reason for this behaviour will be explained in
the latter.
Since this padding implies that the lower three bits are always zero and
are not used for real length, they are used another way. They are used to
indicate special attributes of the chunk. The lowest bit, called
PREV_INUSE, indicates whether the previous chunk is used or not. It is set
if the next chunk is in use. The second least significant bit is set if the
memory area is mmap'ed -- a special case which we will not consider.  The
third least significant bit is unused.
To test whether the current chunk is in use or not, we have to check the
next chunk's PREV_INUSE bit within its size value.
Once we free() the chunk, using free(mem), some checks take place and the
memory is released. If its neighbour blocks are free, too (checked using
the PREV_INUSE flag), they will be merged to keep the number of reuseable
blocks low, but their sizes as large as possible. If a merge is not
possible, the next chunk is tagged with a cleared PREV_INUSE bit, and the
chunk changes a bit:
             +----------------------------------+
    chunk -> | prev_size                        |
             +----------------------------------+
             | size                             |
             +----------------------------------+
      mem -> | fd                               |
             +----------------------------------+
             | bk                               |
             +----------------------------------+
             | (old memory, can be zero bytes)  |
             :                                  :
nextchunk -> | prev_size ...                    |
             :                                  :
You can see that there are two new values, where our data was previously
stored (at the 'mem' pointer). Those two values, called 'fd' and 'bk' -
forward and backward, that is, are pointers. They point into a double
linked list of unconsolidated blocks of free memory. Every time a new free
is issued, the list will be checked, and possibly unconsolidated blocks
are merged. The whole memory gets defragmented from time to time to release
some memory.
Since the malloc size is always at least 8 bytes, there is enough space for
both pointers. If there is old data remaining behind the 'bk' pointer, it
remains unused until it gets malloc'd again.
The interesting thing regarding this management, is that the whole internal
information is held in-band -- a clear channeling problem.
(just as with format string commands within the string itself, as control
channels in breakable phonelines, as return addresses within stack memory,
etc).
Since we can overwrite this internal management information if we can
overwrite a malloced area, we should take a look at how it is processed
later on. As every malloc'ed area is free()'d again in any sane program,
we take a look at free, which is a wrapper to chunk_free() within malloc.c
(simplified a bit, took out #ifdef's):
static void
chunk_free(arena *ar_ptr, mchunkptr p)
{
  size_t     hd = p->size; /* its head field */
  size_t     sz;           /* its size */
  int        idx;          /* its bin index */
  mchunkptr  next;         /* next contiguous chunk */
  size_t     nextsz;       /* its size */
  size_t     prevsz;       /* size of previous contiguous chunk */
  mchunkptr  bck;          /* misc temp for linking */
  mchunkptr  fwd;          /* misc temp for linking */
  int        islr;         /* track whether merging with last_remainder */
  check_inuse_chunk(ar_ptr, p);
  sz = hd & ~PREV_INUSE;
  next = chunk_at_offset(p, sz);
  nextsz = chunksize(next);
Since the malloc management keeps chunks within special structures called
'arenas', it is now tested whether the current chunk that should be free
directly borders to the 'top' chunk -- a special chunk. The 'top' chunk is
always the top-most available memory chunk within an arena, it is the border
of the available memory. The whole if-block is not interesting for typical
buffer overflows within the malloc space.
  if (next == top(ar_ptr))                         /* merge with top */
  {
    sz += nextsz;
    if (!(hd & PREV_INUSE))                    /* consolidate backward */
    {
      prevsz = p->prev_size;
      p = chunk_at_offset(p, -(long)prevsz);
      sz += prevsz;
      unlink(p, bck, fwd);
    }
    set_head(p, sz | PREV_INUSE);
    top(ar_ptr) = p;
      if ((unsigned long)(sz) >= (unsigned long)trim_threshold)
        main_trim(top_pad);
    return;
  }
Now the 'size' of the current chunk is tested whether the previous chunk is
unused (testing for the PREV_INUSE flag). If this is the case, both chunks
are merged.
  islr = 0;
  if (!(hd & PREV_INUSE))                    /* consolidate backward */
  {
    prevsz = p->prev_size;
    p = chunk_at_offset(p, -(long)prevsz);
    sz += prevsz;
    if (p->fd == last_remainder(ar_ptr))     /* keep as last_remainder */
      islr = 1;
    else
      unlink(p, bck, fwd);
  }
Now the same is done vice versa. It is checked whether the chunk in front
of the current chunk is free (testing for the PREV_INUSE flag of the size
two chunks ahead). If this is the case the chunk is also  merged into the
current one.
  if (!(inuse_bit_at_offset(next, nextsz)))   /* consolidate forward */
  {
    sz += nextsz;
    if (!islr && next->fd == last_remainder(ar_ptr))
                                              /* re-insert last_remainder */
    {
      islr = 1;
      link_last_remainder(ar_ptr, p);
    }
    else
      unlink(next, bck, fwd);
    next = chunk_at_offset(p, sz);
  }
  else
    set_head(next, nextsz);                  /* clear inuse bit */
  set_head(p, sz | PREV_INUSE);
  next->prev_size = sz;
  if (!islr)
    frontlink(ar_ptr, p, sz, idx, bck, fwd);
}
As Solar Designer showed us, it is possible to use the 'unlink' macro to
overwrite arbitrary memory locations. Here is how to do:
A usual buffer overflow situation might look like:
        mem = malloc (24);
        gets (mem);
        ...
        free (mem);
This way the malloc'ed chunk looks like this:
[ prev_size ] [ size P] [ 24 bytes ... ] (next chunk from now)
       [ prev_size ] [ size P] [ fd ] [ bk ] or [ data ... ]
As you can see, the next chunk directly borders to our chunk we overflow.
We can overwrite anything behind the data region of our chunk, including
the header of the following chunk.
If we take a closer look at the end of the chunk_free function, we see this
code:
  if (!(inuse_bit_at_offset(next, nextsz)))   /* consolidate forward */
  {
    sz += nextsz;
    if (!islr && next->fd == last_remainder(ar_ptr))
                                              /* re-insert last_remainder */
    {
      islr = 1;
      link_last_remainder(ar_ptr, p);
    }
    else
      unlink(next, bck, fwd);
    next = chunk_at_offset(p, sz);
  }
The inuse_bit_at_offset, is defined as macro in the beginning of malloc.c:
#define inuse_bit_at_offset(p, s)\
 (((mchunkptr)(((char*)(p)) + (s)))->size & PREV_INUSE)
Since we control the header of the 'next' chunk we can trigger the whole if
block at will. The inner if statement is uninteresting, except our chunk is
bordering to the top-most chunk. So if we choose to trigger the outer if
statement, we will call unlink, which is defined as macro, too:
#define unlink(P, BK, FD)                                                \
{                                                                        \
  BK = P->bk;                                                            \
  FD = P->fd;                                                            \
  FD->bk = BK;                                                           \
  BK->fd = FD;                                                           \
}
The unlink is called with a pointer to a free chunk and two temporary
pointer variables, called bck and fwd. It does this to the 'next' chunk
header:
  *(next->fd + 12) = next->bk
  *(next->bk + 8) = next->fd
They are not swapped, but the 'fd' and 'bk' pointers point to other chunks.
This two chunks being pointed to are linked, zapping the current chunk from
the table.
So to exploit a malloc based buffer overflow, we have to write a bogus
header in the following chunk and then wait for our chunk getting free'd.
        [buffer .... ] | [ prev_size ] [ size ] [ fd ] [ bk ]
'|' is the chunk boundary.
The values we set for 'prev_size' and 'size' do not matter, but two
conditions have to be met, in case it should work:
  a) the least significant bit of 'size' has to be zero
  b) both, 'prev_size' and 'size' should be add-safe to a pointer that is
     read from. So either use very small values up to a few thousand, or -
     to avoid NUL bytes - use big values such as 0xfffffffc.
  c) you have to ensure that at (chunk_boundary + size + 4) the lowest bit
     is zeroed out (0xfffffffc will work just fine)
'fd' and 'bk' can be set this way (as used in Solar Designers Netscape
Exploit):
  fd = retloc - 12
  bk = retaddr
But beware, that (retaddr + 8) is being written to and the content there
will be destroyed. You can circumvent this by a simple '\xeb\x0c' at
retaddr, which will jump twelve bytes ahead, over the destroyed content.
Well, however, exploitation is pretty straight forward now:
<jmp-ahead, 2> <6> <4 bogus> <nop> <shellcode> |
        \xff\xff\xff\xfc \xff\xff\xff\xfc <retloc - 12> <retaddr>
Where '|' is the chunk boundary (from that point we overflow). Now, the
next two negative numbers are just to survive a few checks in free() and to
avoid NUL bytes. Then we store (retloc - 12) properly encoded and then the
return address, which will return to the 'jmp-ahead'. The buffer before the
'|' is the same as with any x86 exploit, except for the first 12 bytes,
because we have to take care of the extra write operation by the unlink
macro.
Off-by-one / Off-by-five
------------------------
It is possible to overwrite arbitrary pointers, even in cases where you can
overwrite only five bytes, or - in special cases - only one byte. When
overwriting five bytes the memory layout has to look like:
        [chunk a] [chunk b]
Where chunk a is under your control and overflowable. Chunk b is already
allocated as the overflow happens. By overwriting the first five bytes of
chunk b, we trash the 'prev_size' element of the chunks header and the
least significant byte of the 'size' element. Now, as chunk b is free()'d,
backward consolidation pops in, since 'size' has the PREV_INUSE flag
cleared (see below). If we supply a small value for 'prev_size', which is
smaller than the size of chunk a, we create a fake chunk structure:
        [chunk a ... fakechunk ...] [chunk b]
                     |
                     p 
Where prev_size of chunk b points relativly negative to the fake chunk.
The code which is exploitable through this setting was already discussed:
  if (!(hd & PREV_INUSE))                    /* consolidate backward */
  {
    prevsz = p->prev_size;
    p = chunk_at_offset(p, -(long)prevsz);
    sz += prevsz;
    if (p->fd == last_remainder(ar_ptr))     /* keep as last_remainder */
      islr = 1;
    else
      unlink(p, bck, fwd);
  }
'hd' is the size element of chunk b. When we overwrite it, we clear out the
lower two bits, so PREV_INUSE is cleared and the if condition is matched
(NUL will do it in fact). In the next few instructions 'p', which was a
pointer to chunk b originally, is relocated to our fakechunk. Then the
unlink macro is called and we can overwrite the pointers as usual. We use
backward consolidation now, while in the previous description we have used
forward consolidation. Is this all confusing? Well, when exploiting malloc
overflows, do not worry about the details, they will become clearer as you
understand the malloc functions from a broader scope.
  For a really well done overview and description of the malloc
implementation in the GNU C Library, take a look at the GNU C Library
reference manual [3]. It makes a good read for non-malloc related things,
too.
Possible obstacles and how to get over with them
================================================
As with any new exploitation technique people will show up which have the
'perfect' solution to the problem in their head or in form of a patch to
the malloc functions. Those people - often ones who have never written
an exploit themselves - are misleading into a wrong sense of security and I
want to leave a few words about those approaches and why they seldomly work.
There are three host based stages where you can stop a buffer overflow
resulting in a compromise:
 1. The bug/overflow stage
    This is the place where the real overflow happens, where data is
overwritten. If this place is known, the origin of the problem can be fixed
(at source level). However, most approaches argue that this place is not
known and therefore the problem cannot be fixed yet.
 2. The activation stage
    After the overflow happened parts of the data of the application are
corrupted. It does not matter what kind of data, whether it is a stack
frame, a malloc management record or static data behind a buffer. The
process is still running its own path of code, the overwritten data is
still passive. This stage is everything after the overflow itself and
before the seize of execution control. This is where the natural,
non-artificially introduced hurdles for the attacker lies, code which must
be passed in a certain way.
 3. The seized stage
    This is everything after control has been redirected from its original
path of execution. This is the stage where nopspace and shellcode is
executed, where no real exploitation hurdles are left.
Now for the protection systems. Most of the "non-exec stack" and "non-exec
heap" patches try to catch the switch from stage two to three, where
execution is seized, while some proprietary systems check for the origin of
a system call from within kernel space. They do not forbid you to run code
this way, they try to limit what code can be run.
Those systems which allow you to redirect execution in the first place are
fundamentally flawed. They try to limit the exploitation in a black-listing
way, by trying to plug the places you may want to go to. But if you can
execute legal code within the process space its almost everytime enough to
compromise the process as a whole.
Now for the more challenging protections, which try to gripe you in stage
two. Those include - among others - libsafe, StackGuard, FormatGuard, and
any compiler or library based patches. They usually require a recompilation
or relinking of your existing code, to insert their security 'measures'
into your code. This includes canary values, barriers of check bytes or
reordering and extensive checking of sanity before doing things which might
be bad. While sanity checking in general is a good policy for security, it
cannot fix stuff that was broken before. Every of those protections is
assuming a certain situation of a bug which might appear in your program
and try to predict the results of an attacker abusing the bug. They setup
traps which they assume you will or have to trigger to exploit the bug.
This is done before your control is active, so you cannot influence it
much except by choosing the input data. Those are, of course much more
tight than protection systems which only monitor stage three, but still
there are ways around them. A couple of ways have been discussed in the
past, so I will not go into depth here. Rather, I will briefly address on a
protection which I already see on the horizon under a name like
'MallocGuard'.
Such a protection would not change the mechanism of malloc management
chunks much, since the current code has proved to be effective. The malloc
function plays a key role in overall system performance, so you cannot
tweak freely here. Such a protection can only introduce a few extra checks,
it cannot verify the entire consistency everytime malloc() is called. And
this is where it is flawed: Once you seize control over one malloc chunk
information, you can seize control over other chunks too. Because chunks
are 'walked' by using either stored pointers (SysV) or stored lengths
(GlibC), it is possible to 'create' new chunks. Since a sanity check would
have to assume inconsistency of all chunks in the worst case, it would have
to check all chunks by walking them. But this would eat up too much
performance, so its impossible to check for malloc overflows easily while
still keep a good performance. So, there will be no 'MallocGuard', or it
will be a useless guard, in the tradition of useless pseudo protections. As
a friend puts it - 'for every protection there is an anti-protection'.
Thanks
======
I would like to thank all proofreaders and correctors. For some really
needed corrections I thank MaXX, who wrote the more detailed article about
GNU C Library malloc in this issue of Phrack, kudos to him ! :)
References
==========
[1] Solar Designer,
    http://www.openwall.com/advisories/OW-002-netscape-jpeg.txt
[2] DD Sleator, RE Tarjan, "Self-Adjusting Binary Trees", 1985,
    http://www.acm.org/pubs/citations/journals/jacm/1985-32-3/p652-sleator/
    http://www.math.tau.ac.il/~haimk/adv-ds-2000/sleator-tarjan-splay.pdf
[3] The GNU C Library
    http://www.gnu.org/manual/glibc-2.2.3/html_node/libc_toc.html
[4] Solaris 8 Foundation Source Program
    http://www.sun.com/software/solaris/source/
|=[ EOF ]=---------------------------------------------------------------=|