18 juin 2018

gVisor in depth

In my previous blog post I described gVisor as 'some stuff I hardly can really understand'. 

Technology is not only about code, understanding where it comes from, why and how it has been done, do helps to understand the design decision and actual goals.

tl;dr: gVisor has been open-sourced recently but it has been running Google App Engine and Google Cloud Functions for years. It is a security sandbox for application, acting as a "Virtual kernel", but not relying on an hypervisor (unlike KataContainers). Now being open-source we can expect gVisor to support more application runtimes and being portable enough so it can replace Docker's runc at some point for those interested in this additional isolation level.


Being in San Francisco for DockerCon'18 I went to visit Google office to meet Googler Ludovic Champenois, and Google Developer Advocate David Gageot, who kindly explained me gVisor history and design. In the meantime some of the informations required to fully understand gVisor became public, so I now can blog on this topic. By the way, Ludo made a presentation on this topic at BreizhCamp, even gVisor name has not been used this was all about it.

History

gVisor introduce itself as "sandboxing for linux applications". To fully understand this, we should ask "Where does it come from" ?

I assume you already heard about Google App Engine. GAE was launched 10 years ago, and allowed to run Python application (then later Java) on google infrastructure for the cost of actually consumed resources. No virtual machine to allocate. Nothing to pay when application is not in use. If they did this in 2018, they probably would have named something like "Google Serverless Engine". 

Compared to other cloud hosting platform like Amazon, Google don't rely on virtual machines to isolate applications running on his infrastructure. They made this crazy bet they can provide enough security layers to directly run arbitrary user payload on a shared operating system.

A public cloud platform like Google Cloud is a privileged target for any hacker. in addition, GAE applications do run on the exact same Borg infrastructure as each and every Google services. So the need for security in depth, and Google did invest a lot in security. For sample, the hardware they use in DataCenters do include a dedicated security chip to prevent hardware/firmware backdoors. 

When GAE for java was introduced in 2009, it came with some restrictions. This wasn't the exact same JVM you used to run, but some curated version of it, with some API missing. Cause for those restrictions is for google engineers to analyse each and every low level feature of the JRE that would require some dangerous privileges on their infrastructure. Typically, java.lang.Thread was a problem. 

Java 7 support for GAE has been announced in 2013, 2 years after Java 7 was launched. Not because Google didn't wanted to support Java, nor because they're lazy, but because this one came with new internal feature invokedynamic. This one introduced a significant new attack surface and required a huge investment to implement adequate security barriers and protections. 

Then came Java 8, with lambdas and many other internal changes. And plans for Java 9 with modules was a promise for yet more complicated and brain-burner challenges to support Java on GAE. So they looked for another solution, and here started internal project that became gVisor.

Status

gVisor code you can find on Google's github repository is the actual code running Google App Engine and Google Cloud Function (minus some Google specific pieces which are kept private and wouldn't make any sense outside Google infrastructure). 

When Kubernetes was launched, it was introduced as a simplified (re)implementation of Google's Borg architecture, designed for lower payloads (Borg is running *all* Google infra as a huge cluster of hundreds thousands nodes). gVisor isn't such a "let's do something similar in oss" project, but a proven solution, at least for payloads supported by Google Cloud platform. 


To better understand it's design and usage, we will need to get into details. Sorry if you get lost in following paragraph, if you don't care you can directly scroll down to the kitten. 


What's a kernel by the way ?

"Linux containers", like the ones you use to run with Docker (actually runc, default low level container runtime), but also LXC, Rkt or just systemd (yes, systemd is a plain container runtime, just require a way longer command line to setup :P), all are based on Linux kernel features to filter system calls, applying visibility and usage restrictions on shared system resources (cpu, memory, i/o). They all delegate to kernel responsibility to do this right, which as you can guess is far from being trivial and is the result of a decade of development by kernel experts.

Linux defines a "user-space" (ring 3) and 'kernel-space" (ring 0) as CPU execution levels. "rings" are protection levels implemented by hardware: on can get into a higher ring (during boot), but not the opposite, and each ring only can access a subset of hardware operations.



An application runs in user-space. Doing so there's many hardware related operation it can't use: for sample, allocating memory, which require interactions with hardware and is only available in kernel-space. To get some memory, application has to invoke a system call, a predefined procedure implemented by kernel. When application execute malloc, it actually delegates to kernel the related memory operation. Buy remember : there's no way to move from user-space to kernel-space, so this not just a function call.

System calls implementation depends on architecture. on intel architectures it relies on interruption, which is a signal the hardware uses to handle asynchronous tasks and external events, like timers, a key pressed on keyboard or incoming network packet. Software also can trigger some interruptions, and passing parameters to kernel relies on values set in CPU's registries.



When an interruption happens, the execution of the current program on the CPU is suspended, and a trap assigned to the interruption is executed in kernel-space. When the trap completes, the initial program is restored and follow up it's execution. As interruption only allows to pass few parameters, typically a system call number and some arguments, there's no risk for application to inject illegal code in kernel-space (as long as there's no bug in kernel implementation, typically a buffer overflow weakness).

Kernel trap handling the system call interruption will proceed to memory allocation. Doing so it can apply some for restrictions (so your application can't allocate more that xxx Mb as defined by control-group) and implement memory allocation on actual hardware.

What's wrong with that ? Nothing from a general point of view, this is a very efficient design, and system call mechanism acts as a very efficient filter ... as long as everything in kernel is done right. In real world software comes both with bugs and unexpected security design issues (not even considering hardware ones), so does the kernel. And as Linux kernel protections use by Linux Containers take place within kernel space, anything wrong here can be abused to break security barriers.

I you check number of CVE per year for linux kernel you will understand being a security engineer is a full time job.  Not that linux kernel is badly designed, just that a complex software used by billions devices and responsible to manage shared resources with full isolation on a large set of architectures is ... dawn a complex beast ! 

Congrats to Linux kernel maintainer by the way, they do an awesome job !

Google do have it's own army of kernel security engineers, maintaining a custom kernel : both on purpose for hardware optimisation and to enforce security by removing/replacing/strenghtening everything that may impact their infrastructure, also contributing to mainstream Linux kernel when it makes sense.

But that's still risky : if someone discover an exploit on linux kernel, he might not be smart enough to keep this private or could even try to hack Google. 


Additional isolation : better to be safe than sorry.

A possible workaround to this risk is to add one additional layer of isolation / abstraction : hypervisor isolation. 

To provide more abstraction, a Virtual Machine do rely on hardware capability (typically: intel VT-X) to offer yet another level of interruption based isolation. Let's see how malloc will operate when application runs inside a VM :

- application calls libC's malloc which actually invoke system call number 12 by triggering  an interruption.
- interruption is trapped in kernel-space as configured on hardware during operating system early stage boot.
- kernel access hardware to actually allocate some physical memory if legitimate. On bare metal the process would end here, but we are running in a VT-X enabled virtual machine
- as guest kernel is virtualized, it actually run on hosts as a user-space program. VT-X make it possible to have two parallel ring levels. So attempt to access hardware do trigger VMEXIT and let hypervisor to execute trapping instructions to act accordingly. in KVM architecture this means switching into hosts' user-mode as soon as possible (!) and use user-mode QEmu for hardware emulation.



Hypervisor is configured to trap this interruption, and translating the low level hardware access into some actual physical memory allocation, based on emulated hardware and Virtual Machine configuration. So when VM's kernel things it's allocating memory block xyz on physical memory, it's actually asking hypervisor to allocate on an emulated memory model, and hypervisor can detect an illegal memory range usage. security++

This second level of isolation would prevent a bug in virtual machine kernel to expose actual physical resources. It also ensure the resources management logic implemented by guest kernel is strictly limited to a set of higher-level allocated resources. Hacking both the kernel then the hypervisor is possible in theory, but extremely hard in practice.

KataContainers is an exact implementation of this idea : a docker image when ran by runV (KataContainers' alternative to Docker's runC) do use a KVM hypervisor to run a just-enough virtual machine so the container can start. And thanks to OpenContainerInitiative and docker's modular design you can switch from one to the other.

Google's wish list for application isolation


Google decided to explore another approach. a Virtual Machine comes with some footprint. With a dedicated kernel and hardware emulation, a significant amount of cpu/memory is consumed by translation, and attempts by guest kernel to optimise resource usage are non-sense without a full platform vision and duplicate host's kernel effort. When you run Billions containers, any useless byte has a cost.


On the other side, kernel-based isolation is far from being enough. They are part of a global solution, but Google needs more. Goole wanted to :

  • limit the kernel's attack surface : minimize lines of code involves, so potential bugs
  • limit the kernel's risk for bugs : rely on a structured language. They selected Go (some  advocate Rust would have been a better choice...)
  • limit the impact of kernel being hacked

Virtual Kernel to the rescue.

Google designed a "User-Space Thin Virtual Kernel" (this is how I call this, not sure about their own name for this concept). 

gVisor kernel is a tiny simple thing. It only implement a subset of Linux system calls (~250 over 400), and do this without any attempt to do some clever optimisations. This thin kernel is more or less a kernel firewall, and acts as a barrier to kernel exploits, for sample to prevent a buffer overflow.

buffer overflow is a security exploit relying on kernel to not detect some system call parameter do imply a larger amount of data to be written in some well known kernel-memory location. As a result the sibling kernel memory get overridden, and can allow hackers to execute some code in kernel mode. gVisor kernel is pretty simple in implementation, which drastically reduce the risk for such an attack to find. Linux Kernel in comparison is about thousands line of C code, with a significant attack surface whenever best experts review it's code on a regular basis.

Sounds crazy ? Look at this for a disruptive proof of abusing a credit card payment terminal being hacked by buffer overflow.



gVisor kernel do trap application system calls and (re)implement them as a kernel proxy on host's, without any hardware emulation / hypervisor. Being implemented in Go, it doesn't suffer the permissive C model which force developer to check for buffers size, allocated pointers, reference removal, etc. This for sur comes with some cost (typically: a garbage collector), I bet Google isn't using the standard Go compiler/runtime for internal use.
gVisor do only implement legitimate system calls for payload supported on Google App Engine. Java 8 support for Google App Engine in 2017 means that all system calls a JRE 8 require have been implemented by gVisor. It probably could run many other runtimes, but Google prefer to double check before any public announcement and commitment with customers.

But the most disruptive architecture decision with gVisor is for it to run this Thin Virtual Kernel in user-space. Some magic has to happen so that user program system call get actually trapped by Virtual Kernel running in user-space.

How to trap a system call in user-space ?

gVisor comes with a plugable platforms, offering two options : ptrace and kvm.

ptrace is documented as "reference" implementation on gVisor docs. One should read "portable", as sole guaranteed way to run gVisor on arbitrary Linux systems. ptrace is Linux system call debugger, it's designed to trap system calls in kernel-space and execute a user-space fonction in reaction.


Sounds good but devils is in the details: the actual design has some communication glitches which make it pretty inefficient when accessing large amounts of memory. Not an issue for a debugger, but a huge one for a container runtime. User Mode Linux was designed with this exact same idea, ans is mostly abandoned for bad performances.

The other option is kvm, so ... an hypervisor. This is claimed to be experimental, my guess is that google custom flavour of kvm and Linux kernel has been optimised for this usage. 

Who the hell will use gVisor ?

Anyone running Google App Engine or Google Cloud Functions for sure, but by design of the platform they don't know, and they don't have to care.

For others, without a portable, production ready platform, gVisor so far is "only" an interesting project, which tell us more about how Google do host random code on a shared infrastructure. If one want to run containers with kvm isolation, it's pretty unclear to me if gVisor is a better option vs KataContainer, as this one is public for a longer time with a larger community. On the other side, gVisor project already received request features and pull-requests to add more system-calls. Maybe this can help Google expand its Cloud platform to new application runtimes ?

The other option is for another platform to be implemented. Typically, Google new operating system Fuchsia, designed to run both on mobiles, IoT devices and clusters might be designed with this use-case in mind, offering an efficient syscall-to-userspace mechanism (or maybe using more ring levels ?).

Last but not least, gVisor project demonstrates creativity in an alternative approach. Someone might come with some fresh new idea using this new piece of software in combination with another feature, and build something unexpected... this happened already as Linux kernel had all those namespace and cgroup things, and some technology enthusiasts came with this emergent concept of "_containers_", creating a whole ecosystem and changing the way we build and deliver software today. 

















x

40 commentaires:

The Road Never Travel a dit…

Is any information about kvm optimization to support gVisor? I have tested the performance of gVisor, and the performance is not very good when using normal kvm module.

Elizabeth Johanson a dit…

The technologies in this era have developed a lot with the advancement of such technologies in every part of our daily life, no matter what the work is about. We have the technologies to make things easier than ever. Such as the Internet provides professional dissertation writing service for students & it is an example of the use of technologies.

Eon Morgan a dit…

Yes, it is correct. The expansion of such technologies in every aspect of our everyday lives, regardless of the work, has resulted in significant advancements in technology in our era. We now have the technology to make things easier than ever before. If we need to do any task, we must use technology.
source:https://webdesignbolt.com/services/logo-design

sofialily0101 a dit…

tadora force (Tadalafil), a PDE5 inhibitor, is available in Tablet form to treat erectile dysfunction. The FDA has approved Tadalafil. Tadora (Tadalafil), a medication that treats Erectile Dysfunction, is available. Besides Tadarise 20, we also have Tadora force , which can be used for the same indications.Tadora force can be used to treat male sexual issues (impotence, Ed). Tadalafil, when combined with intimate stimulation.

Super Vilitra contains the dynamic fixing agent Vardenafil. The conventional Vardenafil is one of the most potent and efficient erectile dysfunction treatment and is effective for almost every male. It is a fast performance and can last at least half as long as Sildenafil. Vardenafil is more expensive to manufacture than sildenafil tablets, but it is backed by impressive results. Super Vilitra which is otherwise known as Generic Vardenafil.

Sean a dit…

It was very useful for me. Keep sharing such ideas in the future as well. This was actually what I was looking for, and I am glad to come here! Thanks for sharing such information with us.
epoxy repair

Hyokyung a dit…

Thank you for this awesome post! It was really helpful :)

jamsroot888 a dit…

hi

alexmorgan a dit…

The new device, known as the GVisor, is a new wearable display that displays video in front of the user’s eyes. This allows the user to watch a video and interact with it in real-time. The device is powered by Intel, and will be available in a variety of colors. This device also help you in getting good assignment help ukservice easily by searching on the web , this is the best device I ever seen .

EvelynAdam a dit…

The match will take place on November 23rd at Qatar’s Khalifa International Stadium. Whoever wins advances to the round of 16. If you want to know how to watch Germany vs Japan Live Stream from anywhere in the world, keep reading!

huhu a dit…

All the content you post is great and very helpful dissertation writing service

unknown a dit…

One of the many advantages among all is that you can manage your time and with limited efforts, you can grab the maximum outcomes in the form of high scores. It went outstanding for me when I got cheap master dissertation proposal centered professional service providers. It always helps students in best way and make them able to fulfill their ultimate desire, which is mainly excellence in academics.

Phil Smith a dit…

Fantastic article. Hatta is a lovely place located near Dubai. If you ever visit Dubai, don't forget to enjoy the Hatta tour Dubai as it is the place that is worth a visit.

Aaron Howard a dit…

I am a little confused! Please allow me to ask some silly questions. Do we need Docker to run gVisor? professional dissertation help uk

Harryjames a dit…


gVisor is an impressive tool for providing an additional layer of security for container-based applications. It looks like a great solution for those looking for dissertation writing service an extra layer of security for their applications and data.

Alina Aimes a dit…

One of the numerous benefits is that you may organise your time so that, with a minimum of work, you can achieve the best results in the form of high scores. My experience with the affordable dissertation writing service focused expert service providers was excellent and also get buy dissertation online. It always supports students in the greatest way and enables them to achieve their ultimate goal, which is primarily academic achievement.

Ahsan a dit…

Finally, including photographs and videos of the Dubai Desert safari is important. Incorporating visuals can help bring the story to life and make readers feel like they are part of the experience.

Wilson a dit…

Good information and wish to see much more like this...thanks for sharing an information...
average cost of uncontested divorce in virginia
cost of uncontested divorce in virginia

coffee hero a dit…

blog is informative for Google had recently enhanced its set of webmasters tools allowing site.if you want another blog visit San Francisco Graphics

William Woodruff a dit…

Seccomp is a Linux kernel feature that enables applications to restrict their system calls. By using seccomp, applications can be prevented from making potentially dangerous system calls that could compromise the system's security. Buy Essay Online UK

Earnest B. Billot a dit…

After reading this article on "gVisor in depth", I gained a better understanding of the technical details behind gVisor's container isolation capabilities. It's impressive to see how gVisor leverages the Linux kernel to provide sandboxing and process isolation. I read this article late because i was busy in academic writing services uk.

Jamesen a dit…

Your blogs are really good and interesting. It is very great and informative. Now being open-source we can expect gVisor to support more application runtimes and being portable enough so it can replace Docker's runc at some point for those interested in this additional isolation level Sex Crime Lawyer. I got a lots of useful information in your blog. Keeps sharing more useful blogs..

Laura Foster a dit…


Dissertation Help service that any subject dissertation writing completed on time in the UK.

Michael Arenado a dit…

The GVisor is a brand-new wearable display that puts video in front of the user's eyes. This enables real-time interaction between the user and the video they are watching. The gadget will come in a range of colours and be powered by Intel. This is the nicest item I've ever seen because it makes it simple to find experts to pay someone to take my online course.

Carmelia B. Wood a dit…

Google has recently introduced gVisor, a groundbreaking open-source security sandbox for its App Engine and Cloud Functions platforms. This innovative technology aims to enhance the security and isolation capabilities of these platforms, providing developers with an added layer of protection for their applications.

With gVisor, Google has taken a significant step towards addressing the growing concerns around application security in cloud environments. By utilizing a lightweight virtualization technique known as "sandboxing," gVisor creates a secure environment for running untrusted code. This ensures that even if an application is compromised or contains malicious code, it remains isolated from the underlying host system and other applications.

The decision to open-source gVisor demonstrates Google's commitment to fostering collaboration and innovation within the developer community. By making this technology freely available to the public, Google encourages developers to contribute their expertise and make further advancements in application security.

App Engine and Cloud Functions users can now leverage gVisor's powerful capabilities to enhance the overall security posture of their applications. The sandboxing provided by gVisor helps protect against common vulnerabilities such as privilege escalation attacks or unauthorized access to sensitive data.

Furthermore, gVisor offers compatibility with existing container runtimes like Docker, making it easier for developers to integrate this security solution into their existing workflows without major disruptions. This seamless integration empowers developers to focus on building robust and secure applications without compromising on productivity or efficiency.

In conclusion, Google's introduction of gVisor represents a significant leap forward in ensuring the safety and integrity of applications running on its App Engine and Cloud Functions platforms. With its open-source nature and compatibility with popular container runtimes, developers now have access to an effective security sandbox that enhances application protection while promoting collaboration within the developer community.
Visit Us. Pay Someone To Do My Online Course

Cheap essay writers a dit…

The cliché of living in a fishbowl pertains to the diminishing sense of privacy experienced by contemporary individuals, as they constantly find themselves under scrutiny within their workplaces, recreational venues, homes, and during their travels from one location to another. Whether squeezed into congested buses or subways, ensnared in traffic jams where occupants of countless cars and trucks are exposed to one another for extended periods, the feeling of lacking personal space remains pervasive Help write my PhD dissertation. This metaphor not only alludes to the dwindling diversity of adult experiences due to prolonged hours spent in confined office or factory settings, or the mesmerizing gaze fixated on small electronic screens with an incessant flow of information, which leaves no room for reflection and contemplation of issues. It also symbolizes the notion of leading purposeless and monotonous lives, akin to being trapped inside a small, circular bowl, endlessly revolving in circles.

valentinarosy a dit…

Great article.Abogado Trafico Fredericksburg Va

Janet R. Mack a dit…

One of the advantages of gVisor is its ability to provide strong isolation between containers. Traditional containers share the same kernel as the host operating system, which poses potential security risks if an attacker gains access to the kernel. In contrast, gVisor runs each container in its own lightweight virtual machine-like environment with its own kernel implementation. Pay to take my online exam This ensures that even if one container is compromised, it cannot affect other containers or the underlying host system.

Henry John a dit…

Wow, the Cisco 400-Watt Plug Power Supply for Nexus-OS (341-0375-04) is a game-changer when it comes to ensuring reliable power for Cisco Nexus devices. This power supply unit delivers exceptional performance and peace of mind with its robust 400-watt capacity. It's the perfect choice for maintaining the uptime and performance of your network infrastructure.

kevinnelson a dit…

The article provides a detailed overview of gVisor, a security sandbox for applications, highlighting its importance in Google's infrastructure. It compares gVisor to hypervisor-based solutions like KataContainers and discusses its lightweight Go-based "Thin Virtual Kernel" architecture, highlighting its potential for enhanced security and performance. abogado litigante patrimonial

Daniel Carl a dit…

In the world of superfoods, sprouted chia seeds have emerged as a true nutritional powerhouse. These tiny seeds pack a mighty punch when it comes to health benefits and culinary versatility. So, what exactly are sprouted chia seeds, and why should they be part of your daily diet? Let’s dive into the fascinating world of sprouted chia seeds and uncover their wonders.
health benefits of chia seeds

Daniel Carl a dit…

Among our many talents, Studioawest is also a mobile app development company in USA that is dedicated to creating a complete and responsive experience available to you and your clients in the palms of your hands. Our team is extremely capable at a variety of tasks, including but not limited to 2D and 3D animation services, application revamps, and multi-OS development from scratch, with a directed focus on helping you capture new audiences while maintaining a high standard for your current one. A user-friendly and engagingly interactive interface, vibrant and branding-oriented themes, and personalized features are part of our recipe for creating a successful mobile platform.

tyson a dit…

Thanks for such an informative and useful content. mlops training

James David a dit…

Delving into the depths of gVisor unveils a cutting-edge solution that redefines container security. With its lightweight, sandboxed approach, gVisor provides an additional layer of defense, isolating containers from the host system. Its in-depth security mechanisms make it a formidable choice for safeguarding against potential vulnerabilities, ensuring a robust containerized environment.
from: machine embroidery digitizing

Carmelia B. Wood a dit…

In conclusion, gVisor plays a crucial role in ensuring the integrity and security of online exams. Its innovative approach to sandboxing and isolation provides a robust defense against cheating attempts while maintaining optimal performance levels. By leveraging this technology, educational institutions and testing platforms can confidently offer online assessments while protecting the interests of both administrators and online test takers alike.

shane a dit…

Depth refers to the measurement or extent of a physical dimension, often describing the distance from the surface to the bottom or the innermost part of an object or space. In a metaphorical sense, depth can describe the complexity, profundity, or richness of thought, emotion, or understanding within a concept, work of art, or individual. In various contexts, depth is a fundamental characteristic influencing perception, analysis, and interpretation.
wills and estate lawyer near me
best personal injury attorney in virginia






Alainaa a dit…

The complex reasons why couples in New York may choose to file for divorce are clarified by your essay. It serves as a reminder that partnerships can present a variety of difficulties, and we value your nonjudgmental attitude. Reasons for Divorce in New York State

Jack a dit…

more affordable. Our group gives assignment help on multidisciplinary subjects and is set up to help with a academic work. Essay services UK

elijah a dit…


Depth refers to the measurement of the distance from the top or surface of something to its bottom or interior. In the context of visual perception, depth creates the illusion of three-dimensionality and is a crucial aspect of spatial awareness. In various fields such as art, photography, and psychology, understanding and utilizing depth enhances the richness and realism of representations.
virginia uncontested divorce
virginia personal injury settlements
uncontested divorce in va
uncontested divorce in virginia
semi truck accident attorney





Sarah Jonas a dit…

Exploring gVisor's intricacies is fascinating! Similarly, when it comes to safeguarding your iPhone 14, delve into the details of protection with our clear cases. Experience gVisor-level transparency in design coupled with robust defense. Buy iPhone 14 Clear Cases – where sophistication meets security seamlessly

valentinarosy a dit…

new jersey divorce lawyerThe Best article.