If you didn't saw the
announcement, I'm part of the winner team for DockerCon RPi Challenge. This blog post is about giving some details on our setup to get such a high number of webservers on a small device.
Some might thing you have to make your Docker image as small as possible, but this isn't actually the case. The image will result into space on disk for /var/lib/docker but not memory consumption. Also, a big process loaded into memory would only consume memory once, then kernel will share code page between equivalent processes, so hundred of them would only consume memory once. My first idea was to build a webserver to include the html and image content into source code. But then Yoann explained me
sendfile can be used to fully delegate this to kernel and make the process even simpler. For Java developers, consider sendfile as some kernel-level
IOUtils.copy(File, OutputStream).
We used hypriot's
nano http image. This one is a webserver developped in assembly code to just serve files from disk using kernel sendfile call. Such a program as a minimal memory footprint and a 1-depth stack. The memory allocation for kernel to handle such a process can then be as compact as possible.
Next step was to run some tests and tweak Docker to run as much webservers as possible. We applied various strategies, without any methodology but just apply various recipes we had in mind and check the result (it takes hours to run thousand servers...)
Free memory
We tweaked the Raspberry and OS to reduce memory usage. Some
low level tweaks allow to disable useless features at boot, some
system level one are used to disable linux feature we don't need for this challenge.
Swap !
Yoann tried to explain me what
zRAM is and I probably didn't got it right, but the general idea is that classic swap on disk is incredibly slow, and is only your last chance to free memory. A better, modern approach is to compress memory, which CPU can do very efficiently, a lot faster than accessing disk (especially on a RPi as disk is a SD card).
So our setup do use
5 zram 4 of them for swap (on per CPU, to allow concurrent access) + one for
/var/lib/docker filesystem
What? Yes, we use a RamDisk for /var/lib/docker, even we did all those efforts to reduce memory usage... Main issue for this challenge is that running a test and start thousands containers takes hours. Having /var/lib/docker on the SD card made it terribly slow. If we had to get further on the challenge we would have used an external USB SSD disk.
Tweak docker command
Web servers are started by docker from
a script. We selected docker options to reduce resource consumed by each web server. Especially, running with a a dedicated IP stack per container involve a huge resource usage, so a key hack was to run with
--net=host. We also disabled log driver so docker don't have to collect logs and as such uses less resources. This seem to not work as expected (read later)
Tweak docker process
Linux also allows to
tweak the way a process is managed in kernel, we used it to ensure docker run with minimal required resources and use swap
Tweak docker daemon config
Docker is ran by systemd on hypriot OS image, so we had to tweak it a few to
unlock limitations. My naive understanding of Linux was that being ran as root, docker deamon could do anything. This isn't the case and it actually can't run more than few dozen processes with default configuration.
Docker daemon has many options we used to
reduce it's memory usage. Generally speaking we tried to disable everything that is not required to run a webserver with docker engine. logs, network, proxies. We expected this to prevent Docker daemon to run threads to collect logs or proxy signals to the contained processes.
2499 Limit
Then we hit the 2499 limit, with this in daemon.log :
docker[307]: runtime: program exceeds 10000-thread limit
Go language did introduce a thread limit to prevent misuse of threading. 10000 was considered enough for any reasonable usage. I indeed would not consider running so much thread a correct design, but here we hit such a limit because docker daemon do run 4 threads per container. It's not yet clear to me what those threads are used for.
Using Go thread dump (SIGQUIT) I noticed some of them are related to logging, even we ran with --log-driver=none as an attempt to get further. I guess docker design here is to always collect then dispatch to "none" log driver which is NoOp, not to fully disable logging feature.
So, 2499 is our best official score considering the RpiDocker Challenge rules.
More
We also wanted to know the upper limit. We made experiments running the plain httpd webserver without docker, and were able to run 27000 of them on the Raspberry. Docker daemon actually grows in memory usage and at some point as some bad impact on the system so you can't run more process. Please note this isn't relevant for arguments against docker on production system, until your business is to run thousands containers on a extra small server.
So, we hacked docker source code to force the MaxThread limit to 12000, built
ARM docker executable and ran the script. We were able to run ~2740 web servers before we reach our first, real OOM
[21112.371259] INFO: rcu_preempt detected stalls on CPUs/tasks:
[21112.377124] Tasks blocked on level-0 rcu_node (CPUs 0-3):
What's next ?
We'd like to better understand Docker threading model, and discuss this issue with docker core team. Using Non-Blocking IO might be an option to rely on a minimal set of threads. I have no idea yet how Golang do handle NIO, I just know it's a pain in Java so I wouldn't do it until I have good reasons to...