21 octobre 2015

First experiment with Tutum

I heard about Tutum from the time CloudBees was still a PaaS company and I considered them a newcomer competitor. I just read today announcement they have been acquired by Docker Inc, so wanted to know more. 

Tutum is actually not a PaaS but a deployment orchestrator and infrastructure manager you plug to your IaaS account(s). As I'm using Google Compute and this provider isn't supported, online help guided me to "bring your own node" option. This one installed an agent and immediately appeared connected on Tutum web UI.

First impression is important in IT for adoption, and Tutum do offer an awesome UX. Within few seconds I had my account setup and have found the adequate configuration informations. Need to admit they made an impressive work here.

So I have my infra setup and ready to host my app. Next step for me is to reproduce my environment, as I'm using docker-compose for local testing. 

Kubernetes or Amazon ECS both do offer the concept of running a set of containers as a single entity. Tutum has comparable concepts with Services (N replicas of a docker image) and Stacks (composition of services).  But all of them do rely on custom descriptor I would have to keep in sync and can't test locally. But according to announcement blog, Tutum also do support docker-compose so I could just use my existing setup, especially as my application does not require horizontal scaling nor replicas. 

After looking into details, Tutum stack syntax is actually the same (maybe a subset/superset ?) or docker-compose yaml syntax. 

So, I can see two significant benefits of Tutum :

1. no IaaS lock-in. I can rely on support Cloud providers or run my app on my own nodes, but still will benefit node management by Tutum. This looks to me like a real "private PaaS" i.e. my own hardware but still "as-a-service" experience.

2. same bits from dev to production. Both my docker image and stack definition are used on my development laptop and on my production service. So I can reproduce what happens on production at any time.

Need to experiment more, but looks very promising so far.

How to run 2500 webservers on a Raspberry Pi

If you didn't saw the announcement, I'm part of the winner team for DockerCon RPi Challenge. This blog post is about giving some details on our setup to get such a high number of webservers on a small device.

Some might thing you have to make your Docker image as small as possible, but this isn't actually the case. The image will result into space on disk for /var/lib/docker but not memory consumption. Also, a big process loaded into memory would only consume memory once, then kernel will share code page between equivalent processes, so hundred of them would only consume memory once. My first idea was to build a webserver to include the html and image content into source code. But then Yoann explained me sendfile can be used to fully delegate this to kernel and make the process even simpler. For Java developers, consider sendfile as some kernel-level IOUtils.copy(File, OutputStream).

We used hypriot's nano http image. This one is a webserver developped in assembly code to just serve files from disk using kernel sendfile call. Such a program as a minimal memory footprint and a 1-depth stack. The memory allocation for kernel to handle such a process can then be as compact as possible.

Next step was to run some tests and tweak Docker to run as much webservers as possible. We applied various strategies, without any methodology but just apply various recipes we had in mind and check the result (it takes hours to run thousand servers...)

Free memory

We tweaked the Raspberry and OS to reduce memory usage. Some low level tweaks allow to disable useless features at boot, some system level one are used to disable linux feature we don't need for this challenge.

Swap !

Yoann tried to explain me what zRAM is and I probably didn't got it right, but the general idea is that classic swap on disk is incredibly slow, and is only your last chance to free memory. A better, modern approach is to compress memory, which CPU can do very efficiently, a lot faster than accessing disk (especially on a RPi as disk is a SD card).

So our setup do use 5 zram 4 of them for swap (on per CPU, to allow concurrent access) + one for /var/lib/docker filesystem

What? Yes, we use a RamDisk for /var/lib/docker, even we did all those efforts to reduce memory usage... Main issue for this challenge is that running a test and start thousands containers takes hours. Having /var/lib/docker on the SD card made it terribly slow. If we had to get further on the challenge we would have used an external USB SSD disk.

Tweak docker command

Web servers are started by docker from a script. We selected docker options to reduce resource consumed by each web server. Especially, running with a a dedicated IP stack per container involve a huge resource usage, so a key hack was to run with --net=host. We also disabled log driver so docker don't have to collect logs and as such uses less resources. This seem to not work as expected (read later)

Tweak docker process

Linux also allows to tweak the way a process is managed in kernel, we used it to ensure docker run with minimal required resources and use swap

Tweak docker daemon config

Docker is ran by systemd on hypriot OS image, so we had to tweak it a few to unlock limitations. My naive understanding of Linux was that being ran as root, docker deamon could do anything. This isn't the case and it actually can't run more than few dozen processes with default configuration.

Docker daemon has many options we used to reduce it's memory usage.  Generally speaking we tried to disable everything that is not required to run a webserver with docker engine. logs, network, proxies. We expected this to prevent Docker daemon to run threads to collect logs or proxy signals to the contained processes.

2499 Limit

Then we hit the 2499 limit, with this in daemon.log :

docker[307]: runtime: program exceeds 10000-thread limit

Go language did introduce a thread limit to prevent misuse of threading. 10000 was considered enough for any reasonable usage. I indeed would not consider running so much thread a correct design, but here we hit such a limit because docker daemon do run 4 threads per container. It's not yet clear to me what those threads are used for. 

Using Go thread dump (SIGQUIT) I noticed some of them are related to logging, even we ran with --log-driver=none as an attempt to get further. I guess docker design here is to always collect then dispatch to "none" log driver which is NoOp, not to fully disable logging feature.

 So, 2499 is our best official score considering the RpiDocker Challenge rules.


We also wanted to know the upper limit. We made experiments running the plain httpd webserver without docker, and were able to run 27000 of them on the Raspberry. Docker daemon actually grows in memory usage and at some point as some bad impact on the system so you can't run more process. Please note this isn't relevant for arguments against docker on production system, until your business is to run thousands containers on a extra small server.

So, we hacked docker source code to force the MaxThread limit to 12000, built ARM docker executable and ran the script. We were able to run ~2740 web servers before we reach our first, real OOM

[21112.371259] INFO: rcu_preempt detected stalls on CPUs/tasks:

[21112.377124]  Tasks blocked on level-0 rcu_node (CPUs 0-3):

What's next ?

We'd like to better understand Docker threading model, and discuss this issue with docker core team. Using Non-Blocking IO might be an option to rely on a minimal set of threads. I have no idea yet how Golang do handle NIO, I just know it's a pain in Java so I wouldn't do it until I have good reasons to...