2018 :: new Blog( perso );

20 juin 2018

multi-release jar with Maven

Java 9 is here, and comes with some "surprises". So does Java 10, 11 ...

A recurrent problem I have as a library developer is I'd like to use some recent APIs but still need to support users on a not-so-recent JRE. Java 9 makes this even harder with the introduction of java modules, and a very common issue is getting such a warning at runtime :

WARNING: Illegal reflective access by com.foo.Bar to field org.Zot.qix

The issue here is that Java 9 doesn't just deprecate a method, it makes reflection model obsolete and warn you because this will be strictly unsupported in a future release. This has impacts on many popular frameworks : Spring, Hibernate, Guava ... (and Jenkins for sure). This is such a backward incompatible change we will need to live with, as more will come with future versions of Java platform.

There's a workaround for such issues, relying on a fresh new API introduced by Java 9 (VarHandles for this specific reflection problem) but does this mean your favourite framework will only support Java 9+ for new releases ?

So for sample, this code was used by Jenkins for a while :

public class ProcessUtil {
  public static long getPid(Process p) {
    try {  
        Field f = p.getClass().getDeclaredField("pid");
        f.setAccessible(true);
    return (long)f.get(p);
    } catch (ReflectiveOperationException e) {
        return -1; 
    }
  }
}

(ab)use of reflection to access Process pid attribute can be replaced in Java 9 with a fresh new API.

public class ProcessUtil {
  public static long getPid(Process p) {
    try {
 return p.getPid();
    } catch (UnsupportedOperationException e) {
        return -1; 
    }
  }
}

If we want Jenkins to run on Java 9 we need to replace ProcessUtil legacy implementation with this new code. But on the other side we still want Jenkins to run on Java 8.

Here comes JEP 238 "Multi Release Jar". The idea is to bundle in a Jar implementations of the exact same class targeting distinct Java releases. Anything before Java 9 will pick the plain old class file, but Java 9 will also look into META-INF/versions/9, Java 10 to look into META-INF/versions/10, and so on. So web can write the ProcessUtil class twice for Java 8 and Java 9, and get both included in the Jar, and used according to the platform which actually run the code.

Looks good, but now comes the funny part : how to write and bundle a class file twice in a Jar ?

Jetbrains' IDE Intellij Idea I'm using doesn't support setting distinct java level per source-folder, neither does Maven (see MCOMPILER-323), so I can't adopt a maven project structure like this one :

So I had to convert the library into a multi-module maven project, one of the sub-module being specific to re-implementing some classes for Java 9 :

And here comes a maven chicken-egg issue. The class we want to re-implement with Java 9 APIs do rely on some classes defined by the main library as type references. So core has to be built first by maven, then java9. But we still want to distribute a single Jar, with a single POM deployed to public repositories.

My current setup for this scenario is to let Maven think I'm building a multi-module Jar, then hack the build lifecycle to get Java 9 classes bundled into the "core" Jar. For this purpose, I had to rely on some ant-task in my pom.xml :

  
  ⟨build⟩
    ⟨plugins⟩
      ⟨plugin⟩
        ⟨artifactid⟩maven-antrun-plugin⟨/artifactid⟩
        ⟨executions⟩
          ⟨execution⟩
            ⟨id⟩bundle_java9⟨/id⟩
            ⟨goals⟩
              ⟨goal⟩run⟨/goal⟩
            ⟨/goals⟩
            ⟨phase⟩prepare-package⟨/phase⟩
            ⟨configuration⟩
              ⟨tasks⟩
                ⟨mkdir dir="${project.build.outputDirectory}/META-INF/versions/9"/⟩
                ⟨javac classpath="${project.build.outputDirectory}" destdir="${project.build.outputDirectory}/META-INF/versions/9" includeantruntime="false" source="9" srcdir="../java9/src/main/java" target="9"/⟩
              ⟨/tasks⟩
            ⟨/configuration⟩
          ⟨/execution⟩
        ⟨/executions⟩
      ⟨/plugin⟩
      ⟨plugin⟩
        ⟨artifactid⟩maven-jar-plugin⟨/artifactid⟩
        ⟨configuration⟩
          ⟨archive⟩
            ⟨manifestentries⟩
              ⟨multi-release⟩true⟨/multi-release⟩
            ⟨/manifestentries⟩
          ⟨/archive⟩
        ⟨/configuration⟩
      ⟨/plugin⟩
    ⟨/plugins⟩
  ⟨/build⟩

This hack do run java 9 compilation on sibling "java9" source directory from within the core maven module. As a result I can deploy artifacts from this single module without polluting my pom.xml with unnecessary sub-modules dependencies.

java9 module is configured as a java 9 jar module so my IDE will detect it accordingly, and depends on core module, so I can access all types to re-implement the class I want to replace.

Yes, this is a hack, but as it took me some time to get this running I thought it could be useful to others. You can see this in action on a tiny-library I created to offer a java 9 compliant way to access private fields by reflexion, on all versions of Java : https://github.com/ndeloof/fields

18 juin 2018

gVisor in depth

In my previous blog post I described gVisor as 'some stuff I hardly can really understand'.

Technology is not only about code, understanding where it comes from, why and how it has been done, do helps to understand the design decision and actual goals.

tl;dr: gVisor has been open-sourced recently but it has been running Google App Engine and Google Cloud Functions for years. It is a security sandbox for application, acting as a "Virtual kernel", but not relying on an hypervisor (unlike KataContainers). Now being open-source we can expect gVisor to support more application runtimes and being portable enough so it can replace Docker's runc at some point for those interested in this additional isolation level.

Being in San Francisco for DockerCon'18 I went to visit Google office to meet Googler Ludovic Champenois, and Google Developer Advocate David Gageot, who kindly explained me gVisor history and design. In the meantime some of the informations required to fully understand gVisor became public, so I now can blog on this topic. By the way, Ludo made a presentation on this topic at BreizhCamp, even gVisor name has not been used this was all about it.

History

gVisor introduce itself as "sandboxing for linux applications". To fully understand this, we should ask "Where does it come from" ?

I assume you already heard about Google App Engine. GAE was launched 10 years ago, and allowed to run Python application (then later Java) on google infrastructure for the cost of actually consumed resources. No virtual machine to allocate. Nothing to pay when application is not in use. If they did this in 2018, they probably would have named something like "Google Serverless Engine".

Compared to other cloud hosting platform like Amazon, Google don't rely on virtual machines to isolate applications running on his infrastructure. They made this crazy bet they can provide enough security layers to directly run arbitrary user payload on a shared operating system.

A public cloud platform like Google Cloud is a privileged target for any hacker. in addition, GAE applications do run on the exact same Borg infrastructure as each and every Google services. So the need for security in depth, and Google did invest a lot in security. For sample, the hardware they use in DataCenters do include a dedicated security chip to prevent hardware/firmware backdoors.

When GAE for java was introduced in 2009, it came with some restrictions. This wasn't the exact same JVM you used to run, but some curated version of it, with some API missing. Cause for those restrictions is for google engineers to analyse each and every low level feature of the JRE that would require some dangerous privileges on their infrastructure. Typically, java.lang.Thread was a problem.

Java 7 support for GAE has been announced in 2013, 2 years after Java 7 was launched. Not because Google didn't wanted to support Java, nor because they're lazy, but because this one came with new internal feature invokedynamic. This one introduced a significant new attack surface and required a huge investment to implement adequate security barriers and protections.

Then came Java 8, with lambdas and many other internal changes. And plans for Java 9 with modules was a promise for yet more complicated and brain-burner challenges to support Java on GAE. So they looked for another solution, and here started internal project that became gVisor.

Status

gVisor code you can find on Google's github repository is the actual code running Google App Engine and Google Cloud Function (minus some Google specific pieces which are kept private and wouldn't make any sense outside Google infrastructure).

When Kubernetes was launched, it was introduced as a simplified (re)implementation of Google's Borg architecture, designed for lower payloads (Borg is running *all* Google infra as a huge cluster of hundreds thousands nodes). gVisor isn't such a "let's do something similar in oss" project, but a proven solution, at least for payloads supported by Google Cloud platform.

To better understand it's design and usage, we will need to get into details. Sorry if you get lost in following paragraph, if you don't care you can directly scroll down to the kitten.

What's a kernel by the way ?

"Linux containers", like the ones you use to run with Docker (actually runc, default low level container runtime), but also LXC, Rkt or just systemd (yes, systemd is a plain container runtime, just require a way longer command line to setup :P), all are based on Linux kernel features to filter system calls, applying visibility and usage restrictions on shared system resources (cpu, memory, i/o). They all delegate to kernel responsibility to do this right, which as you can guess is far from being trivial and is the result of a decade of development by kernel experts.

Linux defines a "user-space" (ring 3) and 'kernel-space" (ring 0) as CPU execution levels. "rings" are protection levels implemented by hardware: on can get into a higher ring (during boot), but not the opposite, and each ring only can access a subset of hardware operations.

An application runs in user-space. Doing so there's many hardware related operation it can't use: for sample, allocating memory, which require interactions with hardware and is only available in kernel-space. To get some memory, application has to invoke a system call, a predefined procedure implemented by kernel. When application execute malloc, it actually delegates to kernel the related memory operation. Buy remember : there's no way to move from user-space to kernel-space, so this not just a function call.

System calls implementation depends on architecture. on intel architectures it relies on interruption, which is a signal the hardware uses to handle asynchronous tasks and external events, like timers, a key pressed on keyboard or incoming network packet. Software also can trigger some interruptions, and passing parameters to kernel relies on values set in CPU's registries.

When an interruption happens, the execution of the current program on the CPU is suspended, and a trap assigned to the interruption is executed in kernel-space. When the trap completes, the initial program is restored and follow up it's execution. As interruption only allows to pass few parameters, typically a system call number and some arguments, there's no risk for application to inject illegal code in kernel-space (as long as there's no bug in kernel implementation, typically a buffer overflow weakness).

Kernel trap handling the system call interruption will proceed to memory allocation. Doing so it can apply some for restrictions (so your application can't allocate more that xxx Mb as defined by control-group) and implement memory allocation on actual hardware.

What's wrong with that ? Nothing from a general point of view, this is a very efficient design, and system call mechanism acts as a very efficient filter ... as long as everything in kernel is done right. In real world software comes both with bugs and unexpected security design issues (not even considering hardware ones), so does the kernel. And as Linux kernel protections use by Linux Containers take place within kernel space, anything wrong here can be abused to break security barriers.

I you check number of CVE per year for linux kernel you will understand being a security engineer is a full time job. Not that linux kernel is badly designed, just that a complex software used by billions devices and responsible to manage shared resources with full isolation on a large set of architectures is ... dawn a complex beast !

Congrats to Linux kernel maintainer by the way, they do an awesome job !

Google do have it's own army of kernel security engineers, maintaining a custom kernel : both on purpose for hardware optimisation and to enforce security by removing/replacing/strenghtening everything that may impact their infrastructure, also contributing to mainstream Linux kernel when it makes sense.

But that's still risky : if someone discover an exploit on linux kernel, he might not be smart enough to keep this private or could even try to hack Google.

Additional isolation : better to be safe than sorry.

A possible workaround to this risk is to add one additional layer of isolation / abstraction : hypervisor isolation.

To provide more abstraction, a Virtual Machine do rely on hardware capability (typically: intel VT-X) to offer yet another level of interruption based isolation. Let's see how malloc will operate when application runs inside a VM :

- application calls libC's malloc which actually invoke system call number 12 by triggering an interruption.

- interruption is trapped in kernel-space as configured on hardware during operating system early stage boot.

- kernel access hardware to actually allocate some physical memory if legitimate. On bare metal the process would end here, but we are running in a VT-X enabled virtual machine

- as guest kernel is virtualized, it actually run on hosts as a user-space program. VT-X make it possible to have two parallel ring levels. So attempt to access hardware do trigger VMEXIT and let hypervisor to execute trapping instructions to act accordingly. in KVM architecture this means switching into hosts' user-mode as soon as possible (!) and use user-mode QEmu for hardware emulation.

Hypervisor is configured to trap this interruption, and translating the low level hardware access into some actual physical memory allocation, based on emulated hardware and Virtual Machine configuration. So when VM's kernel things it's allocating memory block xyz on physical memory, it's actually asking hypervisor to allocate on an emulated memory model, and hypervisor can detect an illegal memory range usage. security++

This second level of isolation would prevent a bug in virtual machine kernel to expose actual physical resources. It also ensure the resources management logic implemented by guest kernel is strictly limited to a set of higher-level allocated resources. Hacking both the kernel then the hypervisor is possible in theory, but extremely hard in practice.

KataContainers is an exact implementation of this idea : a docker image when ran by runV (KataContainers' alternative to Docker's runC) do use a KVM hypervisor to run a just-enough virtual machine so the container can start. And thanks to OpenContainerInitiative and docker's modular design you can switch from one to the other.

Google's wish list for application isolation

Google decided to explore another approach. a Virtual Machine comes with some footprint. With a dedicated kernel and hardware emulation, a significant amount of cpu/memory is consumed by translation, and attempts by guest kernel to optimise resource usage are non-sense without a full platform vision and duplicate host's kernel effort. When you run Billions containers, any useless byte has a cost.

On the other side, kernel-based isolation is far from being enough. They are part of a global solution, but Google needs more. Goole wanted to :

limit the kernel's attack surface : minimize lines of code involves, so potential bugs
limit the kernel's risk for bugs : rely on a structured language. They selected Go (some advocate Rust would have been a better choice...)
limit the impact of kernel being hacked

Virtual Kernel to the rescue.

Google designed a "User-Space Thin Virtual Kernel" (this is how I call this, not sure about their own name for this concept).

gVisor kernel is a tiny simple thing. It only implement a subset of Linux system calls (~250 over 400), and do this without any attempt to do some clever optimisations. This thin kernel is more or less a kernel firewall, and acts as a barrier to kernel exploits, for sample to prevent a buffer overflow.

buffer overflow is a security exploit relying on kernel to not detect some system call parameter do imply a larger amount of data to be written in some well known kernel-memory location. As a result the sibling kernel memory get overridden, and can allow hackers to execute some code in kernel mode. gVisor kernel is pretty simple in implementation, which drastically reduce the risk for such an attack to find. Linux Kernel in comparison is about thousands line of C code, with a significant attack surface whenever best experts review it's code on a regular basis.

Sounds crazy ? Look at this for a disruptive proof of abusing a credit card payment terminal being hacked by buffer overflow.

gVisor kernel do trap application system calls and (re)implement them as a kernel proxy on host's, without any hardware emulation / hypervisor. Being implemented in Go, it doesn't suffer the permissive C model which force developer to check for buffers size, allocated pointers, reference removal, etc. This for sur comes with some cost (typically: a garbage collector), I bet Google isn't using the standard Go compiler/runtime for internal use.

gVisor do only implement legitimate system calls for payload supported on Google App Engine. Java 8 support for Google App Engine in 2017 means that all system calls a JRE 8 require have been implemented by gVisor. It probably could run many other runtimes, but Google prefer to double check before any public announcement and commitment with customers.

But the most disruptive architecture decision with gVisor is for it to run this Thin Virtual Kernel in user-space. Some magic has to happen so that user program system call get actually trapped by Virtual Kernel running in user-space.

How to trap a system call in user-space ?

gVisor comes with a plugable platforms, offering two options : ptrace and kvm.

ptrace is documented as "reference" implementation on gVisor docs. One should read "portable", as sole guaranteed way to run gVisor on arbitrary Linux systems. ptrace is Linux system call debugger, it's designed to trap system calls in kernel-space and execute a user-space fonction in reaction.

Sounds good but devils is in the details: the actual design has some communication glitches which make it pretty inefficient when accessing large amounts of memory. Not an issue for a debugger, but a huge one for a container runtime. User Mode Linux was designed with this exact same idea, ans is mostly abandoned for bad performances.

The other option is kvm, so ... an hypervisor. This is claimed to be experimental, my guess is that google custom flavour of kvm and Linux kernel has been optimised for this usage.

Who the hell will use gVisor ?

Anyone running Google App Engine or Google Cloud Functions for sure, but by design of the platform they don't know, and they don't have to care.

For others, without a portable, production ready platform, gVisor so far is "only" an interesting project, which tell us more about how Google do host random code on a shared infrastructure. If one want to run containers with kvm isolation, it's pretty unclear to me if gVisor is a better option vs KataContainer, as this one is public for a longer time with a larger community. On the other side, gVisor project already received request features and pull-requests to add more system-calls. Maybe this can help Google expand its Cloud platform to new application runtimes ?

The other option is for another platform to be implemented. Typically, Google new operating system Fuchsia, designed to run both on mobiles, IoT devices and clusters might be designed with this use-case in mind, offering an efficient syscall-to-userspace mechanism (or maybe using more ring levels ?).

Last but not least, gVisor project demonstrates creativity in an alternative approach. Someone might come with some fresh new idea using this new piece of software in combination with another feature, and build something unexpected... this happened already as Linux kernel had all those namespace and cgroup things, and some technology enthusiasts came with this emergent concept of "_containers_", creating a whole ecosystem and changing the way we build and deliver software today.

04 mai 2018

gVisor, WTF ?

Google vient d'annoncer un nouveau runtime pour les containers OCI (les images Docker quoi) : gVisor.

Décryptage. (je pique les illustrations de la page gVisor parce que j'ai la flemme de les refaire)

Linux n'a pas de concept natif de "container", contrairement à BSD avec ses Jails ou à Solaris avec ses Zones. Ce qu'on appelle un "container" n'est donc que l'émergence d'en empilement de divers mécanismes de sécurisation et d'isolation. En un sens c'est pratique parce qu'on peut tuner tout ça et désactiver deux ou trois trucs quand on en a besoin, mais quid de la sécurité du machin en termes de "sandbox" qui permet d'exécuter du code arbitraire sans risquer d'exposer la machine ?

Le runtime à la Docker (containerd) met en place l'arsenal mis à disposition par les noyeau Linux : isolation de visibilité des ressources (namespaces), restrictions (cgroup) et filtres sur les appels noyeau (capabilities, seccomp, apparmor, SELinux). En gros, on sait imposer que telle application a droit d'accéder à tels fichiers, uniquement en lecture, et avec un débit I/O de xx maximum, ce qui permet aux autres application d'avoir aussi des ressources I/O et de placer leur propres données sur le même disque sans risque.

Configuré correctement - la conf par défaut étant déjà pensée pour permettre tout ce qui est raisonnable à une application standard, et bloquer tout ce qui clairement n'est pas justifié - on est en principe à l'abris de tout débordement de la part d'une application malveillante / malcodée / lesdeuxmoncapitaine.

Notons au passage que le support du user-namespace dans Docker est encore minimal (et désactivé par défaut). C'est lui qui permet d'être "root" dans le container mais en fait non, utilisateur non-privilégié sur le système hôte. Idéalement chaque container devrait tourner avec un user ID distinct, comme on fait tourner les divers services d'un système Unix avec des uid/gid différents pour bien isoler qui a droit de faire quoi. Fin de la parenthèse

Bon ça c'est en théorie, parce qu'évidemment tout ça repose sur la solidité de ces mécanismes, et donc sur l'absence de bug dans le noyau. On a pu voir passer quelques articles détracteurs, à charge contre Docker, qui mettent en lumière un problème de partage de ressources au niveau du noyau lorsqu'un container a un comportement particulier, et qu'il finit par pénaliser les autres. On ne parle pas d'escalade de permission ou de fuite d'information, mais c'est déjà un soucis. Dans les articles que j'ai lu le problème était résolut en faisant une mise à jour du noyau (!). Bref, tout repose sur les épaules du noyau et de ses mécanismes de partage équitable et de protection des resources. Et même avec toute la bonne volonté du monde on se doute qu'il y a toujours un risque.

D'où l'idée de ne pas partager le noyau entre containers, et d'avoir un noyau par container. C'est le principe de KataContainers, la fusion des projets ClearContainer (intel) et runV (hyper). Dans les deux cas, on utilise un hyperviseur KVM et un noyau minimal dédié à chaque container. Ce n'est pas magique, ici aussi la machine hôte doit assurer un partage équitable et sécurisé des ressources entre les différentes machines virtuelles + container. Mais en ajoutant cette indirection on limite le risque qu'un bug donné dans le noyau expose significativement la machine hôte.

La partie "virtual hardware" est là où le bâs blesse : Si vraiment on émule le CPU, les disques, la mémoire ... on va se retrouver avec les performances d'un MO5 (c'est d'ailleurs comme ça qu'on fait un émulateur MO5 sur PC, mais c'est une toute autre histoire).

Vous vous doutez bien vu l'utilisation massive de VMs dans le Cloud que ça ne fonctionne pas comme ça : le "hardware" exposé à la machine virtuelle est une version pré-machée, super facile à utiliser, et qui fait passe plat avec les drivers du système hôte. Le noyau de la VM hébergée dispose lui aussi du driver qui va bien pour utiliser ce "hardware" tellement simple que le code fait quelques lignes (bon, quelques centaines, ça reste du code C :P).
On gagne en simplicité - donc on limite les risques de bugs => faille de sécurité - et en performances.

Cette technique de "para-virtualisation" (virt-io) permet de conserver une certaine isolation sans renoncer à la performance. Elle est parfois implémentée niveau hardware, en particulier par le CPU - sans quoi les machines virtuelles râmeraient dur dur, mais aussi sur certaines architectures par le contrôleur mémoire ou encore l'interface réseau...

Bref, les VMs ça marche (non, sans blague ?) et ça permet d'avoir ceinture + bretelle, ce qui est la tenue préférée de l'expert sécurité (sous son gilet pare-balle).

Ok donc qu'est-ce que gVisor vient faire là dedans ?

gVisor veut lui aussi séparer le noyau utilisé par un container du noyau hôte. Pour cela il utilise une technique qui n'est pas nouvelle : User Mode Linux avait déjà exploré la même approche, et le fait que le site soit sur sourceforge vous donne une idée de son ancienneté.

L'idée est de démarrer un process classique en user-space (pas root quoi) qui va servir de noyau à un sous-système hébergé, mais sans couche de virtualisation comme le propose un hyperviseur. Les appels systèmes que l'application fait au noyau hébergé sont alors implémentés sur la base des resources légitimes allouées au process.

Soucis principal : comment faire pour qu'un process standard intercepte les appels systèmes (qui par définition sont ... au coeur du systèmes donc dans le noyau). L'approche utilisée par UML et gVisor est de détourner ptrace, outil de debug Posix, qui permet en gros d'avoir un point d'arrêt sur tout appel système et de répondre à la place du noyau. Un bon gros hack quoi. Ca marche, mais c'est pas gratuit en termes de performances. Sans surprise d'ailleurs, c'est conçu pour du debug !

gVisor par ailleurs utilise un noyau développé en Go par Google. Alors je veux bien croire qu'ils sont très, très bon et que gVisor ne sort pas d'un chapeau, mais celle là je l'avais pas vue venir. Quand je parlais de re-coder Jenkins en Rust c'était pour la blague, coder un noyau OS en Go c'est, heu ... disruptif ?

Mais bon, admettons. Après tout ce "noyau" n'est là que pour filtrer ce qu'on veut bien exposer au container et produire une implémentation simple puisqu'il se base non pas sur du hardware mais sur un système Linux avec tout ce qu'il faut.

Malgré tout je reste perplexe : Certes ce pseudo-noyau permet d'expose quelque chose de plus propre au conteneurs que le fait un runtime Docker :

D'un côté de "vrai" mounts (bien camouflés quoi), de l'autre des montages qui montrent bien le runtime sous-jacent.

Certes ce noyau écrit exprès pour ça peut mettre en oeuvre des choses qui n'auraient aucun sens dans le noyau Linux upstream. Certes Google a une expérience sur le sujet qui lui permet d'être crédible. Il n'en reste pas moins que ça me semble étrange de développer une pile réseau dans ce Go-noyau là où Linux propose des pile réseau virtuelles (veth) depuis des lustres, et qui a priori donne satisfaction à tout le monde. Il y a donc une volonté de prendre le contrôle sur certains détails très fins de ces aspects, qui ne sont pas (encore) documentés

We plan to release a full paper with technical details and will include it here when available.

Ce qui m'amène au point principal, plutôt que de débattre des choix techniques : quel est le cas d'utilisation qui justifie tout ça ? Clairement ce n'est pas prévu pour un usage mainstream par Mme Sysadmin-Michu. Et ça, je l'ai pas trouvé dans la doc.

Mon hypothèse, indéniablement fausse, mais je la propose quand même :

Un noyau linux complet, même utilisant virt-io, est bien trop complexe et "intelligent" pour un contexte aussi trivial qu'une mono-application dans son conteneur, ce qui augmente les risques de bugs. L'idée de Google est donc d'avoir un noyau ~~débile~~ épuré qui n'expose que le strict nécessaire, avec une logique interne ultra-simple pour limiter les failles, puis de passer la main à un "adulte" pour ce qui est de la gestion plus délicate du vrai hardware.

NB: Ma compréhension du noyau est très insuffisante pour dire si cette vision des choses est réaliste / pertinente, donc prenez là pour ce qu'elle est.

Ce qui m'embête c'est que ce cahier des charges c'est ... celui de seccomp :-/ Ce serait donc une approche alternative pour ne pas se baser sur seccomp qui

tourne en espace noyau
ne permet de filtrer que partiellement les paramètres des appels (valeurs directes). Limitation que n'a pas Landlock qui lui succédera peut-être ?

Je ne pense pas qu'on ai du gVisor en production tout de suite (!). Il n'en reste pas moins que l'approche pose des questions, et que je n'arrive pas à trouver des réponse satisfaisantes. Autant dire que je ne vous ai sans doute pas donné la réponse que vous attendiez peut être dans cet article. Poussez-moi des commentaires si vous avez lu des infos intéressantes dessus ou un avis sur la question.

Autre point noir : gVisor n'a pas de logo, et ça c'est nul quand on cherche une illustration pour un billet de blog. Du coup tant pis pour vous, je vous met un David Gageot à la place.

09 février 2018

Breizh to the Camp

Si vous lisez ce blog vous me connaissez sans doute déjà un peu :
Je bricole des trucs chez CloudBees, j'anime une chaîne Youtube sur Docker, et j'organise une conférence à Rennes : le BreizhCamp

Il est rare que je puisse réunir ces trois composantes dans le même billet, et voici donc l'exception qui confirme une règle :

une vidéo Youtube pleine de hacks à pour faire la promo du BreizhCamp 2018

Profitez en, partagez là, aidez-nous à faire du bruit !

Bientôt un second lot de places pour vous inscrire au BreizhCamp 2018 va être débloqué ...

19 janvier 2018

to DinD or not do DinD ?

A colleague of mines pointed me to this interesting article about using Docker-in-Docker ("DinD") as a valid solution for Continuous Integration use-case.

Don't bind-mount docker.sock !

This article is pretty interesting as it explains very well the issue by exposing underlying docker socket to a container. tldr; you just give up with security and isolation. Remember this single excerpt:

" they can create Docker containers that are privileged or bind mount host paths, potentially creating havoc on the host "

This article starts with a reference to Jérôme's blog post explaining why one should not use DinD for CI, so this is interesting to understand the reasoning to adopt a solution the original author explicitly disclaimed for this usage.

Let's now have a look at the follow-up article on DinD : A case for Docker-in-Docker on Kubernetes (Part 2)

Here again, the issue exposing underlying docker infrastructure is well described. Please read-it, I'm exhausted trying to explain why '-v /var/run/docker.sock:/var/run/docker.sock' is an option you never should type.

Then the DinD solution applied to kubernetes is demonstrated, and a point I want you to notice is this one in pod's yaml definition :

securityContext: 
    privileged: true

Privileged ?

What does this option implies ? It sounds like few people actually understand the impact. The option name should anyway ring a bell.

Let's have a look at the reference documentation:

" The --privileged flag gives all capabilities to the container, and it also lifts all the limitations enforced by the device cgroup controller"

Such a container can then access every hardware resource exposed at lowest level within the host's /dev pseudo-filesystem, which includes all your disks, which is the most obvious security issue. Are you comfortable your build can also access /dev/mem (physical memory) ?

Allowing all capabilities also means your container can use all system calls from cap_sys_admin capability, which as a short overview of Linux capabilities means ... there's no restriction what this process can do on system. Typically, with cap_sys_admin you can use mknod to create /dev/* if you didn't already had access to it from container...

--privileged is sort of a sudo++. Just like if you could use this :

➜ ~ echo hello

Permission denied

➜ ~ sudo echo hello

Permission denied

➜ ~ echo --privileged hello

hello

So, a DinD container runs as root, without restriction on system calls it can run, and access to all devices. Sounds like a good place to run arbitrary processes and build pull-requests, isn't it ?

Maybe you consider docker resources isolation as "we want to prevent development team to shoot into it's own foot" and just ensure no build process will start an infinite fork loop or break the CI service with a memory leak. Only public cloud services need to prevent hackers to break the isolation and steal secrets, right ? If so, please take few minutes to talk with your Ops team :P

So, is DinD such a bad idea ?

Actually, one can use privileged container and enforce security, using a fine-grained AppArmor profile to ensure only adequate resources. You also can use docker's --device to restrict the devices your DinD container actually can use, and --cap-drop to restrict allowed system calls to the strict minimum. This is actually how play-with-docker is built, but as you can guess this wasn't created within a day, and require advanced understanding of those security mechanisms.

Is there any alternative ?

My guess is that Applatix solution is driven by lack for a simple and viable alternative. Exposing underlying docker infrastructure is just a no-go, as you then loose kubernetes management control on your side containers. Your nodes would quickly be running thousands orphaned containers. From this point of view, using DinD allows to maintain all your containers under cluster management.

How do others solve this issue ?

CircleCI for sample do allow access to a docker infrastructure to build your own image. The documentation explains a dedicated, remote, docker machine will be allocated for your build. So they just create VMs (or something comparable) to allow your build to access a dedicated docker daemon with some strong isolation. This is far from being transparent for end-user, but at least don't give up with a secured solution.

My recommendation is to have your build include the required logic for such a dedicated docker box to be setup. In terms of a Jenkinsfile pipeline, you could mimic CircleCI with a shared library to offer a setup_remote_docker() high-level function to jobs within your company. This library would allocated a short lived VM on your infrastructure to host docker commands, and inject DOCKER_HOST environment variable accordingly.

What's next ?

Another solution I've been investigating is to create a docker API proxy, which do expose the underlying docker infrastructure but filter all API calls to reject anything you're not supposed to do :

only proxy supported API calls (whitelist)
parse API payload and rebuild payload sent to underlying infrastructure. This ensure only supported options will be passed to docker daemon.
reject security related options like : bind mounts, privileged, cap-add, etc
block access to containers/volumes/networks you didn't created
filter API responses to only let you see legitimate resources (for sample, docker ps will only give you access to your own containers)

This proxy also transparently adds constraints to API commands: it enforces all containers you create do inherit from the same cgroup hierarchy. So if your build is constrained to 2Gb memory, you can't get more running side containers. It also adds labels, which could be used for infrastructure monitoring to track resources ownership.

so, generally speaking, this proxy adds a "namespace" feature on top of Docker API.

This is just a prototype so far, and sorry : it's not open-source...

new Blog( perso );