21 octobre 2015

How to run 2500 webservers on a Raspberry Pi

If you didn't saw the announcement, I'm part of the winner team for DockerCon RPi Challenge. This blog post is about giving some details on our setup to get such a high number of webservers on a small device.

Some might thing you have to make your Docker image as small as possible, but this isn't actually the case. The image will result into space on disk for /var/lib/docker but not memory consumption. Also, a big process loaded into memory would only consume memory once, then kernel will share code page between equivalent processes, so hundred of them would only consume memory once. My first idea was to build a webserver to include the html and image content into source code. But then Yoann explained me sendfile can be used to fully delegate this to kernel and make the process even simpler. For Java developers, consider sendfile as some kernel-level IOUtils.copy(File, OutputStream).

We used hypriot's nano http image. This one is a webserver developped in assembly code to just serve files from disk using kernel sendfile call. Such a program as a minimal memory footprint and a 1-depth stack. The memory allocation for kernel to handle such a process can then be as compact as possible.

Next step was to run some tests and tweak Docker to run as much webservers as possible. We applied various strategies, without any methodology but just apply various recipes we had in mind and check the result (it takes hours to run thousand servers...)

Free memory

We tweaked the Raspberry and OS to reduce memory usage. Some low level tweaks allow to disable useless features at boot, some system level one are used to disable linux feature we don't need for this challenge.

Swap !

Yoann tried to explain me what zRAM is and I probably didn't got it right, but the general idea is that classic swap on disk is incredibly slow, and is only your last chance to free memory. A better, modern approach is to compress memory, which CPU can do very efficiently, a lot faster than accessing disk (especially on a RPi as disk is a SD card).

So our setup do use 5 zram 4 of them for swap (on per CPU, to allow concurrent access) + one for /var/lib/docker filesystem

What? Yes, we use a RamDisk for /var/lib/docker, even we did all those efforts to reduce memory usage... Main issue for this challenge is that running a test and start thousands containers takes hours. Having /var/lib/docker on the SD card made it terribly slow. If we had to get further on the challenge we would have used an external USB SSD disk.

Tweak docker command

Web servers are started by docker from a script. We selected docker options to reduce resource consumed by each web server. Especially, running with a a dedicated IP stack per container involve a huge resource usage, so a key hack was to run with --net=host. We also disabled log driver so docker don't have to collect logs and as such uses less resources. This seem to not work as expected (read later)

Tweak docker process

Linux also allows to tweak the way a process is managed in kernel, we used it to ensure docker run with minimal required resources and use swap

Tweak docker daemon config

Docker is ran by systemd on hypriot OS image, so we had to tweak it a few to unlock limitations. My naive understanding of Linux was that being ran as root, docker deamon could do anything. This isn't the case and it actually can't run more than few dozen processes with default configuration.

Docker daemon has many options we used to reduce it's memory usage.  Generally speaking we tried to disable everything that is not required to run a webserver with docker engine. logs, network, proxies. We expected this to prevent Docker daemon to run threads to collect logs or proxy signals to the contained processes.

2499 Limit

Then we hit the 2499 limit, with this in daemon.log :

docker[307]: runtime: program exceeds 10000-thread limit


Go language did introduce a thread limit to prevent misuse of threading. 10000 was considered enough for any reasonable usage. I indeed would not consider running so much thread a correct design, but here we hit such a limit because docker daemon do run 4 threads per container. It's not yet clear to me what those threads are used for. 

Using Go thread dump (SIGQUIT) I noticed some of them are related to logging, even we ran with --log-driver=none as an attempt to get further. I guess docker design here is to always collect then dispatch to "none" log driver which is NoOp, not to fully disable logging feature.

 So, 2499 is our best official score considering the RpiDocker Challenge rules.


More

We also wanted to know the upper limit. We made experiments running the plain httpd webserver without docker, and were able to run 27000 of them on the Raspberry. Docker daemon actually grows in memory usage and at some point as some bad impact on the system so you can't run more process. Please note this isn't relevant for arguments against docker on production system, until your business is to run thousands containers on a extra small server.

So, we hacked docker source code to force the MaxThread limit to 12000, built ARM docker executable and ran the script. We were able to run ~2740 web servers before we reach our first, real OOM

[21112.371259] INFO: rcu_preempt detected stalls on CPUs/tasks:

[21112.377124]  Tasks blocked on level-0 rcu_node (CPUs 0-3):

What's next ?

We'd like to better understand Docker threading model, and discuss this issue with docker core team. Using Non-Blocking IO might be an option to rely on a minimal set of threads. I have no idea yet how Golang do handle NIO, I just know it's a pain in Java so I wouldn't do it until I have good reasons to... 

5 commentaires:

David Gageot a dit…

Nice! And Crazy. But nice!

Emmanuel Lécharny a dit…

10 000 threads limitation is a bit stupid, especially when it's a language limitation. When you are doing blocking IO, with a connected protocol (like LDAP), you might want to be able to deal with more than 10 000 incoming connections, thus you will need as many threads as you have connections.

OTOH, you still can use non-blocking IO, but you will pay a huge performance penalty for doing so (around 30%).

Anonyme a dit…

Emmanuel, The 10,000 thread limit in Go is 10,000 OS-threads, not 10,000 goroutines (of which you can comfortably have hundreds of thousands if you wanted).

In your example (or any blocking IO example really) Go would handle all those incoming connections in a few threads only, and multiplex goroutines onto them as and when connections were unblocked and ready to do something. Goroutines are cheap, while threads are quite expensive, so handling a blocking IO on a per-thread basis would probably be quite slow as you scaled up threads beyond the number of CPUs available on the system.

Nicolas De Loof a dit…

@Edd is there some doc to explain how Goroutines are implemented ?

Darren Gordon a dit…

@Nicolas http://dave.cheney.net/2015/08/08/performance-without-the-event-loop