18 juin 2018

gVisor in depth

In my previous blog post I described gVisor as 'some stuff I hardly can really understand'. 

Technology is not only about code, understanding where it comes from, why and how it has been done, do helps to understand the design decision and actual goals.

tl;dr: gVisor has been open-sourced recently but it has been running Google App Engine and Google Cloud Functions for years. It is a security sandbox for application, acting as a "Virtual kernel", but not relying on an hypervisor (unlike KataContainers). Now being open-source we can expect gVisor to support more application runtimes and being portable enough so it can replace Docker's runc at some point for those interested in this additional isolation level.

Being in San Francisco for DockerCon'18 I went to visit Google office to meet Googler Ludovic Champenois, and Google Developer Advocate David Gageot, who kindly explained me gVisor history and design. In the meantime some of the informations required to fully understand gVisor became public, so I now can blog on this topic. By the way, Ludo made a presentation on this topic at BreizhCamp, even gVisor name has not been used this was all about it.


gVisor introduce itself as "sandboxing for linux applications". To fully understand this, we should ask "Where does it come from" ?

I assume you already heard about Google App Engine. GAE was launched 10 years ago, and allowed to run Python application (then later Java) on google infrastructure for the cost of actually consumed resources. No virtual machine to allocate. Nothing to pay when application is not in use. If they did this in 2018, they probably would have named something like "Google Serverless Engine". 

Compared to other cloud hosting platform like Amazon, Google don't rely on virtual machines to isolate applications running on his infrastructure. They made this crazy bet they can provide enough security layers to directly run arbitrary user payload on a shared operating system.

A public cloud platform like Google Cloud is a privileged target for any hacker. in addition, GAE applications do run on the exact same Borg infrastructure as each and every Google services. So the need for security in depth, and Google did invest a lot in security. For sample, the hardware they use in DataCenters do include a dedicated security chip to prevent hardware/firmware backdoors. 

When GAE for java was introduced in 2009, it came with some restrictions. This wasn't the exact same JVM you used to run, but some curated version of it, with some API missing. Cause for those restrictions is for google engineers to analyse each and every low level feature of the JRE that would require some dangerous privileges on their infrastructure. Typically, java.lang.Thread was a problem. 

Java 7 support for GAE has been announced in 2013, 2 years after Java 7 was launched. Not because Google didn't wanted to support Java, nor because they're lazy, but because this one came with new internal feature invokedynamic. This one introduced a significant new attack surface and required a huge investment to implement adequate security barriers and protections. 

Then came Java 8, with lambdas and many other internal changes. And plans for Java 9 with modules was a promise for yet more complicated and brain-burner challenges to support Java on GAE. So they looked for another solution, and here started internal project that became gVisor.


gVisor code you can find on Google's github repository is the actual code running Google App Engine and Google Cloud Function (minus some Google specific pieces which are kept private and wouldn't make any sense outside Google infrastructure). 

When Kubernetes was launched, it was introduced as a simplified (re)implementation of Google's Borg architecture, designed for lower payloads (Borg is running *all* Google infra as a huge cluster of hundreds thousands nodes). gVisor isn't such a "let's do something similar in oss" project, but a proven solution, at least for payloads supported by Google Cloud platform. 

To better understand it's design and usage, we will need to get into details. Sorry if you get lost in following paragraph, if you don't care you can directly scroll down to the kitten. 

What's a kernel by the way ?

"Linux containers", like the ones you use to run with Docker (actually runc, default low level container runtime), but also LXC, Rkt or just systemd (yes, systemd is a plain container runtime, just require a way longer command line to setup :P), all are based on Linux kernel features to filter system calls, applying visibility and usage restrictions on shared system resources (cpu, memory, i/o). They all delegate to kernel responsibility to do this right, which as you can guess is far from being trivial and is the result of a decade of development by kernel experts.

Linux defines a "user-space" (ring 3) and 'kernel-space" (ring 0) as CPU execution levels. "rings" are protection levels implemented by hardware: on can get into a higher ring (during boot), but not the opposite, and each ring only can access a subset of hardware operations.

An application runs in user-space. Doing so there's many hardware related operation it can't use: for sample, allocating memory, which require interactions with hardware and is only available in kernel-space. To get some memory, application has to invoke a system call, a predefined procedure implemented by kernel. When application execute malloc, it actually delegates to kernel the related memory operation. Buy remember : there's no way to move from user-space to kernel-space, so this not just a function call.

System calls implementation depends on architecture. on intel architectures it relies on interruption, which is a signal the hardware uses to handle asynchronous tasks and external events, like timers, a key pressed on keyboard or incoming network packet. Software also can trigger some interruptions, and passing parameters to kernel relies on values set in CPU's registries.

When an interruption happens, the execution of the current program on the CPU is suspended, and a trap assigned to the interruption is executed in kernel-space. When the trap completes, the initial program is restored and follow up it's execution. As interruption only allows to pass few parameters, typically a system call number and some arguments, there's no risk for application to inject illegal code in kernel-space (as long as there's no bug in kernel implementation, typically a buffer overflow weakness).

Kernel trap handling the system call interruption will proceed to memory allocation. Doing so it can apply some for restrictions (so your application can't allocate more that xxx Mb as defined by control-group) and implement memory allocation on actual hardware.

What's wrong with that ? Nothing from a general point of view, this is a very efficient design, and system call mechanism acts as a very efficient filter ... as long as everything in kernel is done right. In real world software comes both with bugs and unexpected security design issues (not even considering hardware ones), so does the kernel. And as Linux kernel protections use by Linux Containers take place within kernel space, anything wrong here can be abused to break security barriers.

I you check number of CVE per year for linux kernel you will understand being a security engineer is a full time job.  Not that linux kernel is badly designed, just that a complex software used by billions devices and responsible to manage shared resources with full isolation on a large set of architectures is ... dawn a complex beast ! 

Congrats to Linux kernel maintainer by the way, they do an awesome job !

Google do have it's own army of kernel security engineers, maintaining a custom kernel : both on purpose for hardware optimisation and to enforce security by removing/replacing/strenghtening everything that may impact their infrastructure, also contributing to mainstream Linux kernel when it makes sense.

But that's still risky : if someone discover an exploit on linux kernel, he might not be smart enough to keep this private or could even try to hack Google. 

Additional isolation : better to be safe than sorry.

A possible workaround to this risk is to add one additional layer of isolation / abstraction : hypervisor isolation. 

To provide more abstraction, a Virtual Machine do rely on hardware capability (typically: intel VT-X) to offer yet another level of interruption based isolation. Let's see how malloc will operate when application runs inside a VM :

- application calls libC's malloc which actually invoke system call number 12 by triggering  an interruption.
- interruption is trapped in kernel-space as configured on hardware during operating system early stage boot.
- kernel access hardware to actually allocate some physical memory if legitimate. On bare metal the process would end here, but we are running in a VT-X enabled virtual machine
- as guest kernel is virtualized, it actually run on hosts as a user-space program. VT-X make it possible to have two parallel ring levels. So attempt to access hardware do trigger VMEXIT and let hypervisor to execute trapping instructions to act accordingly. in KVM architecture this means switching into hosts' user-mode as soon as possible (!) and use user-mode QEmu for hardware emulation.

Hypervisor is configured to trap this interruption, and translating the low level hardware access into some actual physical memory allocation, based on emulated hardware and Virtual Machine configuration. So when VM's kernel things it's allocating memory block xyz on physical memory, it's actually asking hypervisor to allocate on an emulated memory model, and hypervisor can detect an illegal memory range usage. security++

This second level of isolation would prevent a bug in virtual machine kernel to expose actual physical resources. It also ensure the resources management logic implemented by guest kernel is strictly limited to a set of higher-level allocated resources. Hacking both the kernel then the hypervisor is possible in theory, but extremely hard in practice.

KataContainers is an exact implementation of this idea : a docker image when ran by runV (KataContainers' alternative to Docker's runC) do use a KVM hypervisor to run a just-enough virtual machine so the container can start. And thanks to OpenContainerInitiative and docker's modular design you can switch from one to the other.

Google's wish list for application isolation

Google decided to explore another approach. a Virtual Machine comes with some footprint. With a dedicated kernel and hardware emulation, a significant amount of cpu/memory is consumed by translation, and attempts by guest kernel to optimise resource usage are non-sense without a full platform vision and duplicate host's kernel effort. When you run Billions containers, any useless byte has a cost.

On the other side, kernel-based isolation is far from being enough. They are part of a global solution, but Google needs more. Goole wanted to :

  • limit the kernel's attack surface : minimize lines of code involves, so potential bugs
  • limit the kernel's risk for bugs : rely on a structured language. They selected Go (some  advocate Rust would have been a better choice...)
  • limit the impact of kernel being hacked

Virtual Kernel to the rescue.

Google designed a "User-Space Thin Virtual Kernel" (this is how I call this, not sure about their own name for this concept). 

gVisor kernel is a tiny simple thing. It only implement a subset of Linux system calls (~250 over 400), and do this without any attempt to do some clever optimisations. This thin kernel is more or less a kernel firewall, and acts as a barrier to kernel exploits, for sample to prevent a buffer overflow.

buffer overflow is a security exploit relying on kernel to not detect some system call parameter do imply a larger amount of data to be written in some well known kernel-memory location. As a result the sibling kernel memory get overridden, and can allow hackers to execute some code in kernel mode. gVisor kernel is pretty simple in implementation, which drastically reduce the risk for such an attack to find. Linux Kernel in comparison is about thousands line of C code, with a significant attack surface whenever best experts review it's code on a regular basis.

Sounds crazy ? Look at this for a disruptive proof of abusing a credit card payment terminal being hacked by buffer overflow.

gVisor kernel do trap application system calls and (re)implement them as a kernel proxy on host's, without any hardware emulation / hypervisor. Being implemented in Go, it doesn't suffer the permissive C model which force developer to check for buffers size, allocated pointers, reference removal, etc. This for sur comes with some cost (typically: a garbage collector), I bet Google isn't using the standard Go compiler/runtime for internal use.
gVisor do only implement legitimate system calls for payload supported on Google App Engine. Java 8 support for Google App Engine in 2017 means that all system calls a JRE 8 require have been implemented by gVisor. It probably could run many other runtimes, but Google prefer to double check before any public announcement and commitment with customers.

But the most disruptive architecture decision with gVisor is for it to run this Thin Virtual Kernel in user-space. Some magic has to happen so that user program system call get actually trapped by Virtual Kernel running in user-space.

How to trap a system call in user-space ?

gVisor comes with a plugable platforms, offering two options : ptrace and kvm.

ptrace is documented as "reference" implementation on gVisor docs. One should read "portable", as sole guaranteed way to run gVisor on arbitrary Linux systems. ptrace is Linux system call debugger, it's designed to trap system calls in kernel-space and execute a user-space fonction in reaction.

Sounds good but devils is in the details: the actual design has some communication glitches which make it pretty inefficient when accessing large amounts of memory. Not an issue for a debugger, but a huge one for a container runtime. User Mode Linux was designed with this exact same idea, ans is mostly abandoned for bad performances.

The other option is kvm, so ... an hypervisor. This is claimed to be experimental, my guess is that google custom flavour of kvm and Linux kernel has been optimised for this usage. 

Who the hell will use gVisor ?

Anyone running Google App Engine or Google Cloud Functions for sure, but by design of the platform they don't know, and they don't have to care.

For others, without a portable, production ready platform, gVisor so far is "only" an interesting project, which tell us more about how Google do host random code on a shared infrastructure. If one want to run containers with kvm isolation, it's pretty unclear to me if gVisor is a better option vs KataContainer, as this one is public for a longer time with a larger community. On the other side, gVisor project already received request features and pull-requests to add more system-calls. Maybe this can help Google expand its Cloud platform to new application runtimes ?

The other option is for another platform to be implemented. Typically, Google new operating system Fuchsia, designed to run both on mobiles, IoT devices and clusters might be designed with this use-case in mind, offering an efficient syscall-to-userspace mechanism (or maybe using more ring levels ?).

Last but not least, gVisor project demonstrates creativity in an alternative approach. Someone might come with some fresh new idea using this new piece of software in combination with another feature, and build something unexpected... this happened already as Linux kernel had all those namespace and cgroup things, and some technology enthusiasts came with this emergent concept of "_containers_", creating a whole ecosystem and changing the way we build and deliver software today. 


1 commentaires:

The Road Never Travel a dit…

Is any information about kvm optimization to support gVisor? I have tested the performance of gVisor, and the performance is not very good when using normal kvm module.