[ News ] [ Paper Feed ] [ Issues ] [ Authors ] [ Archives ] [ Contact ]

..[ Phrack Magazine ]..
.:: Power cell buffer overflow ::.

Issues: [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ 15 ] [ 16 ] [ 17 ] [ 18 ] [ 19 ] [ 20 ] [ 21 ] [ 22 ] [ 23 ] [ 24 ] [ 25 ] [ 26 ] [ 27 ] [ 28 ] [ 29 ] [ 30 ] [ 31 ] [ 32 ] [ 33 ] [ 34 ] [ 35 ] [ 36 ] [ 37 ] [ 38 ] [ 39 ] [ 40 ] [ 41 ] [ 42 ] [ 43 ] [ 44 ] [ 45 ] [ 46 ] [ 47 ] [ 48 ] [ 49 ] [ 50 ] [ 51 ] [ 52 ] [ 53 ] [ 54 ] [ 55 ] [ 56 ] [ 57 ] [ 58 ] [ 59 ] [ 60 ] [ 61 ] [ 62 ] [ 63 ] [ 64 ] [ 65 ] [ 66 ] [ 67 ] [ 68 ] [ 69 ] [ 70 ]
Current issue : #66 | Release date : 2009-11-06 | Editor : The Circle of Lost Hackers
Phrack Prophile on The PaX TeamTCLH
Phrack World NewsTCLH
Abusing the Objective C runtimenemo
Backdooring Juniper FirewallsGraeme
Exploiting DLmalloc frees in 2009huku
Persistent BIOS infectionaLS and Alfredo
Exploiting UMA : FreeBSD kernel heap exploitsargp and karl
Exploiting TCP Persist Timer Infinitenessithilgore
Malloc Des-Maleficarumblackngel
A Real SMM RootkitCore Collapse
Alphanumeric RISC ARM ShellcodeYYounan and PPhilippaerts
Power cell buffer overflowBSDaemon
Binary Mangling with Radarepancake
Linux Kernel Heap Tampering DetectionLarry H
Developing MacOS X Kernel Rootkitsghalen and wowie
How close are they of hacking your braindahut
Title : Power cell buffer overflow
Author : BSDaemon
                             ==Phrack Inc.==

               Volume 0x0d, Issue 0x42, Phile #0x0D of 0x11

|=---------=[ Hacking the Cell Broadband Engine Architecture ]=---------=|
|=-------------------=[ SPE software exploitation ]=--------------------=|
|=--------------=[ By BSDaemon                              ]=----------=|
|=--------------=[ <bsdaemon *noSPAM* risesecurity_org>     ]=----------=|

					"There are two ways of
					constructing a software design.
					One way is to make it so simple
					that there are obviously no
					deficiencies.  And the other way
					is to make it so complicated that
					there are no obvious deficiencies"
						- C.A.R. Hoare

------[  Index

  1 - Introduction
    1.1 - Paper structure

  2 - Cell Broadband Engine Architecture
    2.1 - What is Cell
    2.2 - Cell History
	2.2.1 - Problems it solves
	2.2.2 - Basic Design Concept
	2.2.3 - Architecture Components
	2.2.4 - Processor Components
    2.3 - Debugging Cell
    	2.3.1 - Linux on Cell
	2.3.2 - Extensions to Linux - User-mode - Kernel-mode
	2.3.3 - Debugging the SPE
    2.4 - Software Development for Linux on Cell
	2.4.1 - PPE/SPE hello world
	2.4.2 - Standard Library Calls from SPE
	2.4.3 - Communication Mechanisms 
	2.4.4 - Memory Flow Control (MFC) Commands
	2.4.5 - Direct Memory Access (DMA) Commands - Get/Put Commands - Resources - SPE 2 SPE Communication

  3 - Exploiting Software Vulnerabilities on Cell SPE
    3.1 - Memory Overflows
        3.1.1 - SPE memory layout
        3.1.2 - SPE assembly basics - Registers - Local Storage Addressing Mode - External Devices - Instruction Set
	3.1.3 - Exploiting software vulnerabilities in SPE - Avoiding Null Bytes
	3.1.4 - Finding software vulnerabilities on SPE

  4 - Future and other uses

  5 - Acknowledgements

  6 - References

  7 - Notes on SDK/Simulator Environment

  8 - Sources 

------[ 1 - Introduction

This article is all about Cell Broadband Architecture Engine [1], a new
hardware designed by a joint between Sony [2], Toshiba [3] and IBM [4].

As so, lots of architecture details will be explained, and also many
development differences for this platform.

The biggest differentiator between this article and others released about
this subject, is the focus on the architecture exploitation and not the
use of the powerful processor resources to break code [5] and of course,
the focus in the differentiators of the architecture, which means the SPU
(synergestic processor unit) and not in the core (PPU - power processor
unit) [6], since the core is a small-modified power processor (which
means, all shellcodes for Linux on Power will also works for the core and
there is just small differences in the code allocation and stuffs like

It's important to mention that everything about Cell tries to focus in the
Playstation3 hardware, since it's cheap and widely deployed, but there is
also big machines made with this processor [7], including the #1 in the
list of supercomputers [8].

---[ 1.1 - Paper structure

The idea of this paper is to complete the studies about Cell, putting all
the information needed to do security research, focused in software
exploitation for this architecture together.

For that, the paper have been structured in two important portions:
Chapter 2 will be all about the Cell Architecture and how to develop for
this architecture.  It includes many samples and explains the
modifications done to Linux in order to get the best from this
architecture.  Also, it gives the knowledge needed in order to go further
in software exploitation for this arch.  Chapter 3 is focused in the
exploitation of the SPU processor, showing the simple memory layout it has
and how to write a shellcode for the purpose of gaining control over an
application running inside the SPU.

------[ 2 - Cell Broadband Engine Architecture

From the IBM Research [9]: "The Cell Architecture grew from a challenge
posed by Sony and Toshiba to provide power-efficient and cost-effective
high-performance processing for a wide range of applications, including
the most demanding consumer appliance: game consoles. Cell - also known as
the Cell Broadband Engine Architecture (CBEA) - is an innovative solution
whose design was based on the analysis of a broad range of workloads in
areas such as cryptography, graphics transform and lighting, physics,
fast-Fourier transforms (FFT), matrix operations, and scientific
workloads. As an example of innovation that ensures the clients' success,
a team from IBM Research joined forces with teams from IBM Systems
Technology Group, Sony and Toshiba, to lead the development of a novel
architecture that represents a breakthrough in performance for consumer
applications. IBM Research participated throughout the entire development
of the architecture, its implementation and its software enablement,
ensuring the timely and efficient application of novel ideas and
technology into a product that solves real challenges."

It's impossible to not get excited with this.  A so 'powerful' and
versatile architecture, completely different from what we usually seen is
an amazing stuff to research for software vulnerabilities.  Also, since
it's supposed to be widely deployed, there will be an infinite number of
new vulnerabilities coming on in the near future.  I wanted to exploit
those vulnerabilities.

---[ 2.1 - What is Cell

As must be already clear to the reader, I'm not talking about phones here.
Cell is a new architecture, which cames to solve some of the actual
problems in the computer industry.

It's compatible with a well-known architecture, which are the Power
Architecture, keeping most of it's advantages and solving most of it's
problems (if you cannot wait until know what problems, go to 2.2.1

---[ 2.2 - Cell History

The focus of this section is just to give a timeline vision for the
reader, not been detailed at all.

The architecture was born from a joint between IBM, Sony and Toshiba,
formed in 2000. 

They opened a design center in March 2001, based in Austin, Texas (USA).

In the spring of 2004, a single Cell BE became operational.  In the summer
of the same year, a 2-way SMP version was released.

The first technical disclosures came just in February 2005, with the
simulator [10] and open-source SDK [11] (more on that later) been released
in November of the same year.  In the same month, Mercury started to sell
Cell (yeah, sell Cell sounds funny) machines.

Cell Blades was announced by IBM in February of 2006.  The SDK 1.1 was
released in July of the same year, with many improvements.  The latest
version is 3.1.

---[ 2.2.1 - Problems it solves

The computer technology have been evolving along the years, but always
suffering and trying to avoid some barriers.

Those barriers are physically impossible to be bypassed and that's why the
processor clock stopped to grow and multi-core architectures been focused.

Basically we have three big walls (barriers) to the speedy grow:
	- Power wall
	It's related to the CMOS technology limits and the hard limit to 
	the acceptable system power

	- Memory wall
	Many comparisons and improvements trying to avoid the DRAM latency
	when compared to the processor frequency

	- Frequency wall
	Diminishing return from deeper pipelines

For a new architecture to work and be widely deployed, it was also
important to keep the investments in software development.

Cell accomplish that being compatible with the 64 bits Power Architecture,
and attacks the walls in the following ways:

	- Non-homogeneous coherent multi-processor and high design 
	  frequency at a low operating voltage with advanced power
	  management attacks the 'power wall'.
	- Streaming DMA architecture and three-level memory model (main 
	  storage, local storage and register files) attacks the 'memory 
	- Non-homogeneous coherent multi-processor, highly-optimized
implementation and large shared register files with software controlled
branching to allow deeper pipelines attacks the 'frequency wall'.

It have been developed to support any OS, which means it supports
real-time operating system as well non-real time operating systems.

---[ 2.2.2 - Basic Design Concept

The basic concept behind cell is it's asymmetric multi-core design.  That
permits a powerful design, but of course requires specific-developed
applications to achieve the most of the architecture.

Knowing that, becomes clear that the understanding of the new component,
which is called SPU (synergistic processor unit) or SPE (synergistic
processor element) proofs to be essential - see the next section for a
better understanding of the differences between SPU and SPE.

---[ 2.2.3 - Architecture Components

In cell what we have is a core processor, called Power Processor Element
(PPE) which control tasks and synergistic processor elements (SPEs) for
data-intensive processing.

The SPE consists of the synergistic processor unit (SPU), which are a
processor itself and the memory flow control (MFC), responsible for the
data movements and synchronization, as well for the interface with the
high-performance element interconnect bus (EIB).

Communications with the EIB are done in a 16B/cycle, which means that each
SPU is interconnected at that speedy with the bus, which supports

Refer to the picture architecture-components.jpg in the directory images
of the attached file for a visual of the above explanation.

---[ 2.2.4 - Processor Components

As said, the Power Processor Element (PPE) is the core processor which
control tasks (scheduling).  It is a general purpose 64 bit RISC processor
(Power architecture).

It's 2-way hardware multithreaded, with a L1: 32KB I and D caches and L2:
512KB cache.

Has support for real-time operations, like locking the L2 cache and the
TLB (also it supports managed TLB by hardware and software).  It has
bandwidth and resource reservation and mediated interrupts.

It's also connected to the EIB using a 16B/cycle channel (figure 

The EIB itself supports four 16 bytes data rings with simultaneous
transfers per ring (it will be clarified later).

This bus supports over 100 simultaneous transactions achieving in each bus
data port more than 25.6 Gbytes/sec in each direction.

On the other side, the synergistic processor element is a simple RISC
user-mode architecture supporting dual-issue VMX-like, graphics SP-float
and IEEE DP-float.

Important to note that the SPE itself has dedicated resources: unified 128
x 128 bit register files and 256KB local storage.  Each SPE has a
dedicated DMA engine, supporting 16 requests.

The memory management on this architecture simplified it's use, with the
local storage of the SPE being aliased into the PPE system memory (figure

MFC in the SPE acts as the MMU providing controls over the SPE DMA access
and it's compatible with the PowerPC Virtual Memory layout and is software
controllable using PPE MMIO.

DMA access supports 1,2,4,8...n*16 bytes transfer, with a maximum of 16 KB
for I/O, and with two different queues for DMA commands:  Proxy & SPU
(more on this later).

EIB is also connected in a broadband interface controller (BIC).  The
purpose of this controller is to provide external connectivity for
devices.  It supports two configurable interfaces (60 GB/s) with a
configurable number of bytes, coherent (BIF) and/or I/O (IOIFx) protocols,
using two virtual channels per interface, and multiple system

The memory interface controller (MIC) is also connected to the EIB and is
a Dual XDR controller (25.6 GB/s) with ECC and suspended DRAM support
(figure processor-components3.jpg).

Still are missing two more components:  The internal interrupt controller
(IIC) and the I/O Bus Master Translation (IOT) (figure

The IIC handles the SPE interrupts as well as the external interrupts and
interrupts comming from the coherent interconnect and the IOIF0 and IOIF1.
It is also responsible for the interrupt priority level control and for
the interrupt generation ports for IPI.  Note that the IIC is duplicated
for each PPE hardware thread.

IOT translates bus addresses to system real addresses, supporting two
level translations:
	- I/O segments (256 MB)
	- I/O pages (4K, 64K, 1M, 16M bytes)

Interesting is the resource of I/O device identifier per page for LPAR use
(blades) and IOST/IOPT caches managed by software and hardware.

---[ 2.3 - Debugging Cell

As the bus is a high-speedy circuit, it's really difficult to debug the
architecture and better seen what is going on.

For that, and also to made it easy to develop software for Cell, IBM
Research developed a Cell simulator [10] in which you may run Linux and
install the software development kit [11].

The IBM Linux Technology Center brazilian team developed a plugin for
eclipse as an IDE for the debugger and SDK.  Putting it all together is
possible to have the toolkit installed in a Linux machine, running the
frontends for the simulator and for the SDK.  The debugging interface is
much better using this frontends.  Anyway, it's important to notice that
it's just a frontend for the normal and well know linux tools with
extended support to Cell processor (GDB and GCC).

---[ 2.3.1 - Linux on Cell

Linux on cell is an open-source git branch and is provided in the PowerPC
64 kernel line.

It started in the 2.6.15 and is evolving to support many new features,
like the scheduling improvements for the SPUs (actually it can be
preempted, and my big friend Andre Detsch who reviewed this article was
one of the biggest contributors to create an stable code here).

On Linux it added heterogeneous lwp/thread model, with a new SPE thread
model (really similar to the pthreads library as we will see later),
supporting user-mode direct and indirect SPE access, full-preemptive SPE
context management and for that, spe_ptrace() was create and it's support
added to GDB, spe_schedule() for thread to physical spe assigment (it is
not anymore FIFO - run until completion).

As a note, the SPE threads shares it's address space with the parent PPE
process (using DMA), demand paging for SPE access and shared hardware page
table with PPE.

An implementation detail is the PPE proxy thread allocated for each SPE to
provide a single namespace for both PPE and SPE and assist in SPE
initiated C99 and Posix library services.

All the events, error and signal handling for SPEs are done by the parent
PPE thread.

The ELF objects for SPE are wrapped into PPE objects with an extended GLD.

---[ 2.3.2 - Extensions to Linux

Here I'll try to provide some details for Linux running under a Cell
Hardware.  The base hardware used for this reference is a Playstation 3,
which has 8 SPUs, but one is reserved with the purpose of redundancy and
another one is used as hypervisor for a custom OS (in this case, Linux).

All the details are valid for any Linux on Cell and we will provide an
top-down view approach.

---[ - User-mode

Cell supports both power 32 and 64 bits applications, as well as 32 and 64
cell workloads.  It has different programming modes, like RPC, devices
subsystems and direct/indirect access.

As already said, it has heterogeneous threads:  single SPU, SPU groups and
shared memory support.

It runs over a SPE management runtime library, with 32 and 64 bits.  This
library interacts with the SPUFS filesystem (/spu/thread#/) in the
following ways:
   * Open, close, read, write the files:
	- mem
	This file provides access to the local storage

	- regs
	Access to the 128 register of 128 bits each

	- mbox
	spe to ppe mailbox

	- liox
	spe to ppe interrupt mailbox

	- xbox_stat
	Get the mailbox status

	- signal1
	Signal notification acess

	- signal2
	Signal notification acess

	- signalx_type
	Signal type

	- npc
	Read/write SPE next program counter (for debugging)

	- fpcr
	SPE floating point control/status register	

	- decr
	SPE decrementer

	- decr_status
	SPE decrementer status

	- spu_tag_mask
	Access tag query mask

	- event_mask
	Access spe event mask

	- srr0
	Access spe state restore register 0

   * open, close mmap the files:
	- mem
	Program State access of the Local Storage
	- signal1
	Direct application access to signal 1

	- signal2
	Direct application access to signal 2

	- cntl
	Direct application access to SPE controls, DMA queues and

The library also provides SPE task control system calls (to interact with
the SPE system calls implemented in kernel-mode), which are:
	- sys_spu_create_thread
	Allocates a SPE task/context and creates a directory in SPUFS

	- sys_spu_run
	  Activates a SPU task/context on a physical SPE and
	  blocks in the kernel as a proxy thread to handle the events
	  already mentioned

Some functions provided by the library are related to the management of
the spe tasks, like spe create group, create thread, get/set affinity,
get/set context, get event, get group, get ls, get ps area, get threads,
get/set priority, get policy, set group defaults, group max, kill/wait,
open/close image, write signal, read in_mbox, write out_mbox, read mbox

Obviously the standard 32 and 64 bits powerpc ELF (binary) interpreters,
it is provided a SPE object loader, responsible for understand the
extension to the normal objects already mentioned and for initiate the
loading of the SPE threads.

Going down, we have the glibc and other GNU libraries, both supporting 32
and 64 bits.

---[ - Kernel-mode

The next layer is the normal system-call interface, where we have the SPU
management framework (through special files in the spufs) and
modifications in the exec* interface, in a 64bit kernel.

This modification is done through a special misc format binary, called SPU
object loader extension.  

Of course there is other kernel extensions, the SPUFS filesystem, which
provides the management interface and the SPU allocation, scheduling and

Also, we do have the Cell BE architecture specific code, supporting multi
and large pages, SPE event & fault handling, IIC and IOMMU.

Everything is controlled by a hypervisor, since Linux is what is called a
custom OS when running in a Playstation3 hardware (the hypervisor is
responsible for the protection of the 'secret key' of the hardware and
knowing how to exploit SPU vulnerabilities plus some fuzzing on the
hypervisor may be the needed knowledge to break the game protection copy
in this hardware).

---[ 2.3.3 - Debugging the SPE

The SDK for Linux on Cell provides good resources for Debugging and better
understanding of what is going on.

It's important to note the environment variables that control the
behaviour of the system.

So, if you set the SPU_INFO, for example, the spe runtime library will
print messages when loading a SPE ELF executable (see above).

---------- begin output ----------
	# export SPU_INFO=1
	# ./test
	Loading SPE program: ./test
	SPU LS Entry Addr  : XXX
---------- end   output ----------

And it will also print messages before starting up a new SPE thread, like:

---------- begin output ----------
	Starting SPE thread 0x..., to attach debugger use: spu-gdb -p XXX
---------- end   output ----------

When planning to use the spu-gdb to debug a SPU thread, it's important to
remember the SPU_DEBUG_START environment variable, which will include
everything provided by the SPU_INFO and will stop the thread until a
debugger is attached or a signal is received.

Since each SPU register can hold multiple fixed (or floating) point values
of different sizes, for GDB is provided a data structure that can be
accessed with different formats.  So, specifying the field in the data
structure, we can update it using different sizes as well:

---------- begin output ----------
(gdb) ptype $r70
type = union __gdb_builtin_type_vec128 {
	int128_t uint128;
	float v4_float[4];
	int32_t v4_int32[4];
	int16_t v8_int16[8];
	int8_t v16_int8[16];

(gdb) p $r70.uint128
$1 = 0x00018ff000018ff000018ff000018ff0
(gdb) set $r70.v4_int[2]=0xdeadbeef
(gdb) p $r70.uint128
$2 = 0x00018ff000018ff0deadbeef00018ff0
---------- end   output ----------

To permit you to better understand when the SPU code starts the execution
and follow it gdb also included an interesting option:

---------- begin output ----------
(gdb) set spu stop-on-load
(gdb) run
(gdb) info registers
---------- end   output ----------

Another important information for debugging your code is to understand the
internal sizes and be prepared for overlapping.  Useful information can
be get using the following fragment code inside your spu program (careful:
It's not freeing the allocated memory).

---   code   ---
extern int _etext;
extern int _edata;
extern int _end;

void meminfo(void)
	printf("\n&_etext: %p", &_etext);
	printf("\n&_edata: %p", &_edata);
	printf("\n&_end: %p", &_end);
	printf("\nsbrk(0): %p", sbrk(0));
	printf("\nmalloc(1024): %p", malloc(1024));
	printf("\nsbrk(0): %p", sbrk(0));
--- end code ---

And of course you can also play with the GCC and LD arguments to have more
debugging info:

---   code   ---
# vi Makefile
CFLAGS += -g
LDFLAGS += -Wl,-Map,map_filename.map
--- end code ---

---[ 2.4 - Software Development for Linux on Cell

In this chapter I will introduce the inners of the Cell development,
giving the basic knowledge necessary to better understand the further

---[ 2.4.1 - PPE/SPE hello world

Every program in Cell that uses the SPEs needs to have at least two source
codes.  One for the PPE and another one for the SPE.

Following is a simple code to run on the SPE (it's also in the attached
tar file :

---   code   ---
#include <stdio.h>

int main(unsigned long long speid, unsigned long long argp, unsigned long long envp)
	printf("\nHello World!\n");
	return 0;
--- end code ---

The Makefile for this code will look like:

---   code   ---
PROGRAM_spu     = hello_spu
LIBRARY_embed   = hello_spu.a
IMPORTS 	= $(SDKLIB_spu)/libc.a
include 	($TOP)/make.footer
--- end code ---

Of course it looks like any normal code.  The PPE as already explained is
the responsible for the creation of the new thread and allocation in the

---   code   ---
#include <stdio.h>
#include <libspe.h> 

extern spe_program_handle_t hello_spu;

int main(void)
	int speid, status;

	speid=spe_create_thread(0, &hello_spu, NULL, NULL, -1, 0);
	spe_wait(speid, &status, 1);
	return 0;
--- end code ---

With the following Makefile:

---   code   ---
DIRS 		= spu
PROGRAM_ppu 	= hello_ppu
IMPORTS 	= ../spu/hello_spu.a -lspe
include $(TOP)/make.footer
--- end code ---

The reader will notice that the speid in the PPE program will be the same
value as the speid in the main of the SPE.

Also, the arguments passed to the spe_create_thread() are the ones
received by the SPE program when running (argp and envp equals to NULL in
our sample).

Important to remember that when compiled this program will generate a
binary in the spu directory, called hello_spu and another one in the root
directory of this example called hello_ppu, which CONTAINS embedded the

---[ 2.4.2 - Standard Library Calls from SPE

When the SPE program needs to use any standard library call, like for
example, printf or exit, it has to call back to the PPE main thread.

It uses a simple stop-and-signal assembly instruction with standardized
arguments value (important to remember that since it's needed in
shellcodes for SPE).

That value is returned from the ioctl call and the user thread must react
to that.  This means copying the arguments from the SPE Local Storage,
executing the library call and then calling ioctl again.

The instruction according to the manual:
	"stop u14 - Stop and signal.  Execution is stopped, the current
	address is written to the SPU NPC register, the value u14 is
	written to the SPU status register, and an interrupt is sent to 
	the PPU."

This is a disassembly output of the hello_spu program:

---------- begin output ----------
# spu-gdb ./hello_spu 
GNU gdb 6.3
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "--host=powerpc64-unknown-linux-gnu --target=spu"...
(gdb) disassemble main
Dump of assembler code for function main:
0x00000170 <main+0>:    ila     $3,0x340 <.rodata>
0x00000174 <main+4>:    stqd    $0,16($1)
0x00000178 <main+8>:    nop     $127
0x0000017c <main+12>:   stqd    $1,-32($1)
0x00000180 <main+16>:   ai      $1,$1,-32
0x00000184 <main+20>:   brsl    $0,0x1a0 <puts> # 1a0
0x00000188 <main+24>:   ai      $1,$1,32        # 20
0x0000018c <main+28>:   fsmbi   $3,0
0x00000190 <main+32>:   lqd     $0,16($1)
0x00000194 <main+36>:   bi      $0
0x00000198 <main+40>:   stop
0x0000019c <main+44>:   stop
End of assembler dump.
---------- end   output ----------

---[ 2.4.3 - Communication Mechanisms

The architecture offers three main communications mechanism:
	- DMA 
	  Used to move data and instructions between main storage and
	  a local storage.  SPEs rely on asyncronous DMA transfers to hide
	  memory latency and transfer overhead by moving information in
	  parallel with SPU computation.

	- Mailbox
	  Used for control communications between a SPE and the
	  PPE or other devices.  Mailboxes holds 32-bit messages.  Each
	  SPE has two mailboxes for sending messages and one mailbox for
	  receiving messages.

	- Signal Notification
	  Used for control communications from PPE or
	  other devices.  Signal notification (also known as signalling)
	  uses 32-bit registers that can be configured for
	  one-sender-to-one-receiver signalling or
	  many-senders-to-one-receiver signalling.

All three are controlled and implemented by the SPE MFC and it's
importance is related to the way the vulnerable program will receive it's

---[ 2.4.4 - Memory Flow Control (MFC) Commands

This is the main mechanism for the SPE to access the main storage and
maintain syncronization with other processors and devices in the system.

MFC commands can be issued either by the SPE itself, or by the processor
and other devices, as follow:
	- A code running on the SPU issue a MFC command by executing a
	  series of writes and/or reads using channel instructions.
	- A code running on the PPU or any other device issue a MFC
	  command by performing a serie of stores and/or loads to
	  memory-mapped I/O (MMIO) registers in the MFC.

The MFC commands are then queued in one of those independent queues:
	- MFC SPU Command Queue - For channel-initiated commands by the
	  associated SPU
	- MFC Proxy Command Queue - For MMIO-initiated commands by the PPE
	  or other devices.

---[ 2.4.5 - Direct Memory Access (DMA) Commands

The MFC commands that transfers data are referred as DMA commands. The
transfer direction for DMA commands are based on the SPE point of view:
	- Into a SPE (from main storage to the local storage)   -> get
	- Out of a SPE (from local storage to the main storage) -> put

---[ - Get/Put Commands

DMA get from the main memory to the local storage:
	(void) mfc_get (volatile void *ls, uint64_t ea, uint32_t size,
			uint32_t tag, uint32_t tid, uint32_t rid)

DMA put into the main memory from the local storage:
	(void) mfc_put (volatile void *ls, uint64_t ea, uint32_t size,
			uint32_t tag, uint32_t tid, uint32_t rid)

To guarantee the synchronization of the writes to the main memory, there
is the options:
	- mfc_putf: the 'f' means fenced, or, that all commands executed
	  before within the same tag group must finish first, later ones
	  could be before

	- mfc_putb: the 'b' here means barrier, or, that the barrier
	  command and all commands issued thereafter are NOT executed
	  until all previously issued commands in the same tag group have
	  been performed

---[ - Resources

For DMA operations the system uses DMA transfers with variable length
sizes (1, 2, 4, 8 and n*16 bytes (n an integer, of course).  There is a
maximum of 16 KB per DMA transfer and 128b aligments offer better

The DMA queues are defined per SPU, with 16-element queue for
SPU-initiated requests and 8-element queue for PPU-initiated requests.
The SPU-initiated request has always a higher priority.

To differentiate each DMA command, they receive a tag, with a 5-bit
identifier.  Same identifier can be applied to multiple commands since
it's used for polling status or waiting on the completion of the DMA

A great feature provided is the DMA lists, where a single DMA command can
cause execution of a list of transfers requests (in local storage).  Lists
implements scatter-gather functions and may contain up to 2K transfer

---[ - SPE 2 SPE Communication

An address in another SPE local storage is represented as a 32-bit
effective address (global address).

SPE issuing a DMA command needs a pointer to the other SPE's local
storage.  The PPE code can obtain effective address of an SPE's local

---   code   ---
#include <libspe.h>

speid_t speid;
void *spe_ls_addr;
--- end code ---

This permits the PPE to give to the SPEs each other local addresses and
control the communications.  Vulnerabilities may arise don't matter what
is the communication flow, even without involving the PPE itself.

Follow is a simple DMA demo program between PPE and SPE (see the attached
file for the complete version) - This program will send an address in the
PPE to the SPE through DMA:

--- PPE code ---
information_sent	is[1] __attribute__ ((aligned 128)));	
spe_git_t		gid;
int * pointer=(int *)malloc(128);

gid=spe_create_group(SCHED_OTHER, 0, 1);

if (spe_group_max(gid) < 1 ) {
	printf("\nOps, there is no free SPE to run it...\n");

is[0].addr = (unsigned int) pointer;

/* Create the SPE thread */
speid=spe_create_thread (gid, &hello_dma, (unsigned long long *) &is[0], NULL, -1, 0);

/* Wait for the SPE to complete */
spe_wait(speids[0], &status[0], 0);

/* Best pratice:  Issue a sync before ending - This is good for us ;) */
__asm__ __volatile__ ("sync" : : : "memory");
--- end code ---

--- SPE code ---
information_sent	is __attribute__ ((aligned 128)));	

int main(unsigned long long speid, unsigned long long argp, unsigned long long envp)
	/* Where:
		is   ->  Address in local storage to place the data
		argp ->  Main memory address
		sizeof(is) -> Number of bytes to read
		31   ->  Associated tag to this DMA (from 0 to 31)
		0    ->  Not useful here (just when using caching)
		0    ->  Not useful here (just when using caching)
	mfc_get(&is, argp, sizeof(is), 31, 0, 0);

	mfc_write_tag_mask(1<<31); /* Always 1 left-shifted the value of your tag mask */

	/* Issue the DMA and wait until completion */
--- end code ---

And now between two SPEs (also for the complete code, please refer to the 
attached sources):

--- PPE code ---
speid_t speid[2]
speid[0]=spe_create_thread (0, &dma_spe1, NULL, NULL, -1, 0);
speid[1]=spe_create_thread (0, &dma_spe2, NULL, NULL, -1, 0);

for (i=0; i<2; i++) local_store[i]=spe_get_ls(speid[i]); /* Get local storage address */

for (i=0; i<2; i++) spe_kill(speid[i], SIGKILL); /* Send SIGKILL to the SPE 
threds */
--- end code ---

--- SPE code ---
/* Write something to the PPE */

/* Read something from the PPE */
pointer = spu_read_in_mbox();

/* DMA interface */
mfc_get(buffer, pointer, size, tag, 0, 0);

/* DMA something to the second SPE */
mfc_put(buffer, local_store[1], size, tag, 0, 0);

/* Notify the PPE */
--- end code ---

------[ 3 - Exploiting Software Vulnerabilities on Cell SPE

I love the architecture manuals and the engineers and the way they talk
about really dumb design choices:

"The SPU Local Store has no memory protection, and memory access wraps
from the end of the Local Store back to the beginning.  An SPU program is
free to write anywhere in the Local Store including its own instruction
space.  A common problem in SPU programming is the corruption of the SPU
program text when the stack area overflows into the program area.  This
problem typically does not become apparent until some later point in the
program execution when the program attempts to execute code in area that
was corrupted, which typically results in illegal instruction exception.
Even with a debugger it can be difficult to track down this type of
problem because the cause and effect can occur far apart in the program
execution.  Adding printf's just moves failure point around".

---[ 3.1 - Memory Overflows

In the aforementioned memory design of the SPU is already cleaver that
when an attacker controls the overwrite size it's really easy to exploit a
SPU vulnerability, just replacing the original program .text with the
attacker's one.

It's important to note that the SPU interrupt facility can be configured
to branch to an interrupt handler at address 0 if an external condition is
true (bisled - branch indirect and set link if external data is the
instruction used to check if there is external data available).  Since the
memory layout loops around, it's always possible to overwrite this handler
if it's been used.

Another important note is the fact that instructions on memory MUST be
aligned on word boundaries.

There is instruction and data caches for the local storage (depending on
the implementation details), so it's important to assure:
	- You are overflowing a large enough amount of data to avoid
	- You are not using a self-modifying shellcode unless you issue
	  the sync instruction (see [13] for references)

---[ 3.1.1 - SPE memory layout

The memory layout for the SPE looks like:

	------------------------  -> 0x3FFFF
	 SPU ABI Reserved Usage
	------------------------			| Stack grows from the
	     Runtime Stack				| higher addresses to 
	------------------------			| the lower addresses.
	       Global Data				| 
	------------------------		        \/
	------------------------  -> 0x00000

For the purpose of test your application, it's really interesting to use the
'size' application:

---------- begin output ----------
	# size hello_spu
	text	data	bss	dec	hex	filename
	1346	928	32	2306	902	hello_spu
---------- end   output ----------

---[ 3.1.2 - SPE assembly basics

It's important in order to develop a shellcode to understand the
differences in the SPE assembly when comparing to PowerPC.

The SPE uses risc-based assembly, which means there is a small set of 
instructions and everything in the SPE runs in user-mode (there is no 
kernel-mode for the SPE).  That said we need to remember there is no 
system-calls, but instead there is the PPE calls (stop instructions).

It is also a big endian architecture (keep that in mind while reading the
following sections).

This architecture provides many ways to avoid branches in the code for
maximum efficiency.  Since it's not a real problem while exploiting
software, I'll just avoid to talk about and will also avoid to talk about
SIMD instructions.  For more informations on that refer to the SPU
Instruction Set Architecture document [12].

---[ - Registers

I already explained a little about the way the architecture works and in
this section I'll just include what is the available register set and how
to use it .

The SPE does not define a conditional register, so the comparison
operations will set results that are either 0 (false) or 1 (true) with the
same width as the operands been tested.  This results are used to do
bitwise masking, instruction selection or conditional branching.

As any other platform, there is general purposes registers and special
purpose registers in the SPE:
	- General Purpose Registers (0-127) Used in different ways by the
	  instructions.  In the second word of R1 you have the information
	  about the amount of free space in the stack (the room between
	  end of the heap and the start of the stack).

	- Special Purpose Registers
	The SPE also supports 128 special-purpose registers.  Some
	interesting ones:
		* SRR0 - Save and Restore Register 0 - Holds the address 
		  used by the interrupt return (iret) instruction
		* LR - Link Register - All branch instructions that set 
		  the link register will force the address of the next
		  instruction to be loaded on this register
		* CTR - Count Register - Usually it's used to hold a loop
                  counter (like the loop instruction and %ecx register in
		  intel x86 architecture)
		* CR - Condition Register - Used to perform conditional

To move data between Special Purpose Registers and General Purpose
Registers we have the instructions
	* mtspr (move to special purpose register) mfspr (move from
	* special purpose register)

---[ - Local Storage Addressing Mode 

In order to address information to/from Local Storage the instructions 
uses the following structure:
	Instruction_Opcode  l10_field	RA_field 	RT_field
		8-bit	    10-bit	  7-bit		  7-bit

Where: The signed value of the l10 field is appended with 4 zeros and then
added to the preferred slot in the RA, forcing the 4-rightmost bits of the
sum to zero.  After, the 16 bytes of the local storage address are
inserted in the RT field. 

	Preferred slot for the architecture point of view are the leftmost
	4 bytes (not bits).

Important to note here that the IBM convention specifies that:
	l10 means a 10-bit immediate value
	RA means a general purpose register to be used as 
	RT means a general purpose register to be used as destination

Knowing that makes it easier to understand why the Local Storage Address
Space is limited to 4 GB.

The actual size of the Local Storage can be viewed accessing the LSLR
(local storage limit register).  All effective address are ANDed with the
value in the LSLR before used.

---[ - External Devices

The SPU can send/receive data to/from external devices using the channel
interface.  The channel instructions uses quadwords (128bits) to transfer
data to/from general purpose registers and the channel device (which
supports 128 channels).

---[ - Instruction Set 

Here are some useful instructions to be used while developing a shellcode
for the SPE.

Instruction		Operands		Description		
lqd (load quadword)	rt,symbol(ra)		load a value (16 bytes)
from Local Storage (pointed by RA to the general purpose register RT)
	lqd $0, 16($1)

stqd (store quadword)	rt,symbol(ra)		the contents of the
register (RT) are stored at the local storage address pointed by RA
	stqd	$0, 16($1)

ilh (immediate load halfword) rt,symbol		the value of l16 is placed
in register RT
	ilh $0, 0x1a0

il (immediate load word) rt, symbol		the value of l16 is
expanded to 32bits replicating the leftmost bit and then written to the RT
	il $0, 0x1a0

nop (no operation)	rt			this instruction uses a
false RT and nothing is changed
	nop $127

ila (immediate load address) rt, symbol		the value of the l18 is
placed in the rightmost 18bits of RT (the remaining bits of RT are zeroed)
	ila $3, 0x340

a (add word)		rt,ra,rb		the operand on register ra
is added to the operand on register rb and the result is written to RT
	a $0, $1, $2

ai (add word immediate)	rt,ra,value		the value (l10 field) is
added to the operand in ra and the result written to RT
	ai $1, $1, -32

brsl (branch relative and set link) rt,symbol	execution proceeds to the
target instruction and a link register is set (the symbol is a l16 type
and it is extended to the rigth with two 0 bits) - The address of the
current instruction is added to the symbol address for the branch.  The
address of the next instruction is written to the preferred byte of the RT
        brsl $0, 0x1a0

fsmbi (form select mask for bytes immediate) rt,symbol	the symbol is a
l16 value used to create a mask in the register RT copying eight times
each bit.  Bits in the operand are related to bytes in the result in a
left-to-right correspondence.  fsmbi $3, 0

bi (branch indirect)	ra			execution proceeds to the
preferred slot of RA.  The right two bits in the RA are ignored (supposed
to be zero).  There is two flags, D and E to disable and Enable
	bi $0

---[ 3.1.3 - Exploiting Software Vulnerabilities in SPE

First of all it's important to make it even more clear that it is
impossible to, for example, force the SPE process to execute a new command
(a.k.a. execve() shellcodes).  The same happens for network-based library
functions and others, as already explained we need the PPE to proxy that
for us.

So it open two new paths:
	- Create a PPE shellcode to be used while exploiting PPE software
	  vulnerabilities that will spawn a proxy for commands received by
	  the SPE and will create a SPE thread to do all the job -> This
	  is pure PPC shellcode and this article already discussed
	  everything needed to achieve that.  In the attached sources you
	  have samples in the directory cell-ppe/ [16].
	- Create a vulnerability specific code for the SPE, that will
	  print out internal program information related to the exploited
	  SPE.  This is specially interesting and difficult because:
		* Need to remember that the SPE uses instruction-cache, so
                  sometimes if you overflow just a small amount of bytes, 
		  it will be specially difficult to get it executed
		* If you use the wrap-around characteristics of the memory
                  layout for the SPE, you will probably overwrite also the
		  information you are interested in.

In the other hand, it's important to say that everything the information
will be in the same place (or easier to understand:  there is no ASLR in
the SPE).  Running the attached samples (specially the SPE-SPE
communications because it's printing the pointers addresses will make it
clear to the reader).

---[ - Avoiding Null Bytes

It is important to avoid null bytes, so we cannot use the NOP instruction
in our shellcode.

This creates a problem, since the ori instruction will also generate null
byte if used with 0 as an argument (e.g: ori $1, $1, 0).

A good replacement is the instruction or (e.g: or $1, $1, $1) or the usage
of multiple instructions (which will reduce the probability of your return

---[ 3.1.4 - Finding software vulnerabilities on SPE

The simulator provided by IBM has a feature that monitors selected
addresses or regions on the Local Store for read and write accesses.  This
feature can help identify stack overflows conditions.o

Invoked from the simulator command windows as follows:
	enable_stack_checking [spu_number] [spu_executable_filename]

This procedure uses the nm system utility to determine the area of the
Local Storage that will contain the program code and creates trigger
functions to trap writes by the SPU into this region.

Important to notice that this approach are just looking for writes in the
text and static data and not to the heap.  Of course the same approach
used by this feature could be used to help the creation of a fuzzer using
TCL scripts based on the one provided.

------[ 4 - Future and other uses

I can't foresee the future, but this kind of architectures are becoming
more and more common and will open a wide range of new vulnerabilities.

The complexity behind this kind of asymmetric multi-threaded architectures
are even higher than the normal ones.  The lack of memory protection will
help also the attackers on how to subvert those systems.  The main
processor been based on an already well-known architecture (powerpc) also
helps the dissemination of malicious codes.

Many other researchers are doing stuff using Cell:
	- Nick Breese presented on Crackstation project in BlackHat [5]
	  Basically he used the SIMD capabilities and big registers
	  provided by the architecture to crack passwords [5]

	- IBM Researchers released a study about the usage of the Cell SPU
	  as a Garbage Collector Co-processor [14]

	- Maybe there is JTAG-based interfaces on the cell machines to try
	  to use RiscWatch [15]

	- Unfortunelly the SPU access are controlled by the PPE so run
	  integrity protection mechanisms from SPU seens infeasible ->
	  Anyway, I wrote a network traffic analyzer using cell as base

------[ 5 - Acknowledgments

A lot of people helped me in the long way for these researches that
resulted in something funny to be published, you all know who you are.

Special thanks to the Phrack Staff for the great review of the article,
giving a lot of important insights about how to better structure it and
giving a real value to it.

I always need to thanks to Filipe Balestra, my research partner, for
sharing with me his ideas, feedbacks, comments and experiences improving a
lot the article and the samples.

I'll never ever forget to say thanks to my research team and friends at
RISE Security (http://www.risesecurity.org) for always keeping me
motivated studying completely new things.   Be sure that the unix-asm [16]
project will be updated soon with all the stuff showed here and much more
different types of shellcodes for the architecture.  Also, of course the
updates will be available for Metasploit.

Big thanks to the Cell Kernel guru, Andre Detsch for sharing with me his
ideas and discussing the internals of the Linux implementation for Cell.

Conference organizers who invited me to talk about Cell Software
Exploitation, even after many people already talked about Cell they
trusted that my talk was not about brute-forcing (yeah, a lot of fun in
completely different cultures).

To my girlfriend who waited for me (alone, I suppose) during this travels.

It's impossible to not say thanks to COSEINC, for let me keep doing this
research using important company time.

------[ 6 - References

[1] Cell Broadband Engine Architecture, v1.01 October 2006

[2] Sony Computer Entertainment

[3] Toshiba Corporation

[4] IBM Corporation

[5] Breese, Nick; "Crackstation"; Black Hat Europe 2008 

[6] IBM Power Architecture

[7] IBM Bladecenter QS21

[8] IBM Roadrunner Supercomputer

[9] The cell project at IBM Research

[10] Cell Simulator

[11] Cell resource center at developerWorks (SDK download)

[12] Synergistic Processor Unit Instruction Set Architecture v1.2

[13] Moore, H.D; "Mac OS X PPC Shellcode Tricks"; Uninformed Magazine 2005

[14] Cher, Chen-Yong; Gschwind, Michael; "Cell GC: Using the Cell Synergistic Processor as a Garbage Collector Coprocessor"; 2008

[15] RISCWatch Debugger

[16] Carvalho, Ramon de; "Cell PPE Shellcodes"; RISE Security;


PowerPC User Instruction Set Architecture, Book I, v2.02 January 2005

PowerPC Virtual Environment Architecture, Book II, v2.02 January 2005

PowerPC Operating Environment Architecture, Book III, v2.02 January 2005

Cell developer's corner at power.org

Linux info at the Barcelona Supercomputing Center website

------[ 7 - Notes on SDK/Simulator Environment

There is some pictures on the simulator and sdk running on the attached file:
	images/cell-sim1.jpg and images/cell-sim2.jpg

To install the SDK/Simulator, do:
	- Download the Cell SDK ISO image from the IBM alphaWorks website. 
	- Mount the disk image on the mount directory: mount -o loop
	  CellSDK<version>.iso /mnt/phrack
	- Change directory to /mnt/phrack/software:
	- Install the SDK by using the following command and answer any
	  prompts: ./cellsdk install
To start the simulator: cd /opt/IBM/systemsim-cell/run/cell/linux
../run_gui Click on the 'go' button to start the simulated system	

To copy files to the simulated system (inside it run):
	callthru source /home/bsdaemon/Phrack/hello_ppu > hello_ppu

Then give the correct permissions and execute:
	chmod +x hello_ppu

------[ 8 - Sources [cell_samples.tgz]

Attached all the samples used on this article to be compiled in a Linux
running on Cell machine.

Further updates will be available in the RISE Security website at:

For the author's public key:

begin 644 cell_samples.tgz

--------[ EOF
[ News ] [ Paper Feed ] [ Issues ] [ Authors ] [ Archives ] [ Contact ]
© Copyleft 1985-2021, Phrack Magazine.