The curious case of CVE-2020-14381

Today is the one-year anniversary of this interesting kernel bug I worked on last year with @bluec0re, and as it turns out I wrote something about it during one of these lockdown weekends so I thought I'd release it. The bug itself was discovered by Jann Horn of Project Zero. While I touch most of the elements required to exploit the bug, I stay superficial here since the exploit itself is not particularly exciting. What makes this bug interesting to me is its lifecycle, in particular how unevenly the patch was applied to the various distributions. I also talk briefly about hardware side-channels since it was the first time I had ever used one.

The bug

It’s already well-described in the bug tracker, but here is another summary. The futex syscall's main parameter is a userland address, and this address may belong to a file-backed mapping. In that case, the futex key kernel object held and kept a reference to the inode object, but didn’t hold a reference to the file’s mountpoint. If the mountpoint were to go away, its associated kernel structures would be freed, but the inode wouldn’t. That’s an issue because the inode itself has fields that point to some of these structures, such as its super_block struct.

Further use of the inode by futex code paths may therefore trigger use-after-frees. One particular code path highlighted by Jann in the bug happens when the futex is destroyed: the last reference to the inode is released and the inode needs to be freed. This is done in iput which then calls iput_final. iput_final and its subcalls will then call inode management functions stored in the super_operations struct accessed from the super_block object. The first instance happens right at the beginning of iput_final with a call to the drop_inode function.

Exploiting this bug requires being able to:

  • Successfully umount a mountpoint. A no-go a few years ago, but possible nowadays with the normalization of unprivileged user namespaces. It’s a good example that this feature was never a trivial security tradeoff (unprivileged sandboxes v. augmented kernel attack surface) which in turn makes it somewhat surprising that all mainstream distributions enabled them by default without much debate
  • Survive the op->drop_inode() execution (non-SMEP or a KASLR bypass)
  • Survive the op->drop_inode indirection just before that (non-SMAP or a stack/heap leak)
  • Do everything in one call, because with an incorrect inode state, a corrupted super_block and some linked lists unlinks to do in the remainder of iput_final, it’s doubtful we can even get as far as the second super_operations function pointer call (evict_inode)


The first exploitation pathway that comes to mind goes as follows:

  • wait for the super_block to be freed. It’s done in an RCU callback so one way or another you need to wait for the end of the RCU grace period after umount returns, e.g. with membarrier. For a PoC, spraying allocs for the duration of the expedited grace period works well enough since the super_block slab, kmalloc-2k, is not super busy.
  • overwrite the freed super_block via a dynamic heap allocation primitive (e.g. sendmsg ancillary data).
  • point s_op to an attacker-controlled buffer
  • point drop_inode to a chain of gadgets that pivot the stack to either the super_block or super_operations bufffers (which are both necessarily in registers and almost fully controlled). Example of common gadgets that would work in this situation would be push reg; jmp/call [reg+x] that can then be chained with a pop rsp; ret gadget placed at [reg+x]
  • do whatever with your unconstrained ROP, fixup the stack and return

This would be a sucky exploit to maintain as it relies on precise knowledge of the kernel image, but that’s as good as it gets for a raw function pointer execution without a read primitive in kernel space. The portability issues for exploits like this are in themselves a significant bonus of SMEP: it rarely prevents exploitation but makes many candidates much less appealing for weaponization.

We can take SMEP for granted. It’s only one CPU generation / 2 years older than SMAP, but not having it is getting really rare. Plus if your exploit does rely on no-SMEP but your target ends up having software SMEP enabled, which you sometimes can't really tell at runtime, you've just turned a privesc attempt into a lost foothold. No-SMAP however is still a thing for the time being. As a random example the AWS EC2 CPU roster shows some CPUs that do not support SMAP.

On infoleak bugs

In any case, to exploit this bug one needs at least one infoleak. The most important is to get kernel base for gadgets, and then we could use a heap leak or similar to support SMAP-capable CPUs (to have our "attacker-controlled buffer" in point 3 above in kernel space). A heap/stack leak can often yield a .text address as well so having one would kill two birds with one stone. But, not everyone has the right infoleak in their stash ready to go, contrary to a common anti-KASLR argument. And even when you do have an infoleak bug, it doesn't mean that it will help with your current exploit.

For instance, a good infoleak candidate which was released around the same time last year would be the one with uninitialized memory in coredumps, CVE-2020-10732. But short of a public proof-of-concept, one needs to understand the coredump generation code, then find an object in that slab that allows us to get .text, and another one to deduce a heap address you control. In short, at least as much work as the rest of the exploit we are looking at. And that's without considering that using two bugs in one exploit also means that you need to take into account both bugs limitations. Unprivileged user namespaces for the main bug we are looking at (not a thing on e.g. RHEL 7), and for the coredump, well the ability to retrieve the core files, i.e. not running in a container. Luckily for our project, we already knew we were targeting non-SMAP containers so we were able to avoid spending all that effort on an infoleak bug that would have ended up being worthless; a luxury that real exploit developpers preparing capabilities ahead of time do not have. But if we were targeting SMAP containers, well that would have been it since more effort would have exceeded our resource budget for this project.

Hardware side-channels

For kernel .text however, the situation is different since there are generic, publicly-documented ways to obtain kernel base: hardware vulns. I personally hadn’t ever used any and even saw them as a niche exploitation technique relying on opaque CPU heuristics that don’t hold across models - not something to be considered for resilient exploits. I was simply wrong, but thankfully had access to many specialists (@tehjh, @_fel1x, @_tsuro) who knew better.

While side-channels that allow leaking memory across security boundaries are hopefully bound to be mitigated, there are many side-channels that leak addresses and which we haven’t heard much about since Spectre and friends. These ones are probably here to stay even longer. For this project I used Jump Over ASLR, which was published before Spectre in 2016. It’s simple to understand (especially with access to the aforementioned people) and there are PoCs that are just waiting to be adjusted to your own scenario (e.g. mario_baslr from @_fel1x). Jump Over ASLR relies on the inner workings of the Branch Target Buffer where user and kernel branches may collide. When that happens, the CPU has more work to do and that can be observed. This allows leaking kernel base as long as you have offsets of branches hit during a short kernel path you can trigger at will: you can then leverage the low entropy of KASLR to try all possible base addresses and find the one where the branches are hit.

For the parameters (the branches to measure) you can really use whatever you want. I only tried the creat syscall with arguments that cause a fast return to userland, and then measured whether the sys_creat and do_sys_open offsets had been hit. The offsets need to be fairly precise but not to the byte since there seems to be some aliasing going on in the branch predictor: I originally used __fentry__ as an additional branch target at a +5 offset for both symbols which still worked even though I later learned these calls get dynamically patched out.

With proper filtering of both false negatives and false positives (essentially double checking each address) this works like a charm on recent Intel CPUs, and it’s one of many such techniques that have been published in the past 6 years or so. That makes it something we should be able to rely on as exploit developers for the foreseeable future. So for a known kernel image at least, we are essentially back to pre-KASLR times - and keep in mind that it’s a field I know fairly poorly so other side-channels are probably even better.

Patch gap

Ok here is what I personally found really interesting because I had never looked into kernel bug timelines before. This bug was initially reported on February 28 2020, and fixed in tip on March 3. At this point it’s essentially public for anyone keeping an eye out for interesting kernel patches - even if you don’t spend too much time on it, a reported-by Jann Horn is worth looking into. The main kernel lines were fixed either on March 25 or April 2. If you’re thinking “oh wow one whole month”, please be seated for what’s coming.

Some distros applied the patch almost immediately:

  • Arch Linux: Mar 25
  • Gentoo: Mar 25
  • Fedora: Mar 26

I know they are not supposed to target workstations specifically but outside of personal servers I don't think I have ever seen them used otherwise. The 2nd batch of distributions that fixed the bug is arguably more server-ready:

  • Ubuntu 18.04 LTS: Apr 7
  • Ubuntu 16.04 LTS: Apr 24
  • Debian Buster (stable): Apr 27
  • Debian Stretch (oldstable): Jul 6

Debian is trailing a bit but all in all that’s within one month of the patch being released which sounds reasonable considering additional processes required to ensure better performance and stability guarantees. Oldstable, well it’s old after all - it's just interesting to observe that Ubuntu's oldstable did a much better job for this one bug. Of course that means that observant attackers have had between 5 and 8 weeks to exploit the vulnerability on Ubuntu/Debian stable releases.

On May 7, the Project Zero bug is unrestricted so it actually becomes public for real. And around half a year later:

  • openSUSE: Oct 11
  • RHEL 8: Nov 4
  • Oracle Linux 8: Nov 10
  • CentOS 8: not sure I stopped monitoring on Nov 19

So, attackers looking at exploiting the bug after it was publicly disclosed had 5 months to exploit suse and between 6 and 7 months to exploit redhat and derivates!! Keeping in mind that some groups may have noticed the bug 2 months earlier, it’s 7 and 8-9 months respectively. Of course, this is assuming that the servers are updated and rebooted as soon as the patch is released - which is far from what happens on average.

If you want a really scary prospect, then you should realize that the Linux kernel receives literally hundreds of commits every month - who knows how many of them fix bugs that are theoretically exploitable. In my opinion this goes to show one main thing: kernel forks which integrate patches based on cherry-picking strategies are doomed from a security perspective. And it’s not to throw stones at anyone, but there is simply no way that a maintainer can properly triage the commits to appropriately mark and apply all those that are potential security problems. This bug is a great example of this process failing big time. And all enterprise-facing kernels seem to be maintained that way.

From the offensive standpoint, it was already useful to keep an oldish kernel exploit around given how rarely some companies both patch their kernels and reboot the machines. But what was really enlightening here was that even someone doing everything right could be exposed to an 8 months patch gap. It means that solely relying on N-days to keep a Linux privesc capability in your arsenal is a viable strategy - especially if you focus on bugs that no one talks about.

And I'll see you in 4 years for the next post I guess.