10:10 -0400

At work, we recently switched from Tutum (now Docker Cloud) to Kubernetes. Part of that work was building up a load balancer. Kubernetes has built-in load balancing capabilities, but it only works with Google Compute or AWS, which we are not using. It also requires a public IP address for each service, which usually means extra (unnecessary) costs.

Having previously worked with Tutum's HAProxy image, I figured I could do the same thing with Kubernetes. A quick web search didn't turn up any existing project, so I quickly wrote my own. Basically, we have HAProxy handling all incoming HTTP(S) connections and passing them off to different services based on the Host header. There's a watcher that watches Kubernetes such as new/deleted pods for relevant changes and updates the HAProxy configuration so that it always sends requests to the right place. I also improved the setup by adding in a Varnish cache for some of our services. Here's how it all works.

We have two sets of pods: an set of HAProxy pods and a set of Varnish pods. Each pod has a Python process that watches etcd for Kubernetes changes, updates the appropriate (HAProxy or Varnish) configuration, and tells HAProxy/Varnish about the new configuration. Why do we watch etcd instead of using the Kubernetes API directly? Because, as far as I can tell, in the Kubernetes API, you can only watch one type of object (either pods, configmaps, secrets, etc.) for changes, whereas we need to watch multiple types at once, so dealing with the Kubernetes API means that we would need to make multiple simultaneous API requests, which would just make things more complicated.

Unlike Tutum's HAProxy image, which only allows you to change certain settings using environment variables, our entire configuration template is configurable using Jinja2 templates. This gives us a lot more flexibility, including being able to plug in Varnish fairly easily without having to make any code changes to the HAProxy configurator. Also, configuration variables for services are stored in their own ConfigMap, rather than as environment variables in the target pods which allows us to make configuration changes without restarting the pods.

When combining HAProxy and Varnish, one question to ask is how to arrange them: HAProxy in front of Varnish, or Varnish in front of HAProxy? We are using a setup similar to the one recommended in the HAProxy blog. In that setup, HAProxy handles all requests and passes non-cacheable requests directly to the backend servers. Cacheable requests are, of course, passed to Varnish. If Varnish has a cache miss, then it passes the request back to HAProxy, which then hands off the request to the backend server. As the article points out, in the event of a cache miss, there's a lot of requests, but cache misses should be very infrequent since Varnish only sees cacheable content. One main difference between the setup we have and the one in the article is that in the article, HAProxy listens on two IP addresses: one for requests coming from the public, and one for requests coming from Varnish. In our setup, we don't have two IP addresses for HAProxy to use. Instead, Varnish adds a request header that indicates that the request is coming from it, and HAProxy checks for that header.

At first, I set the Python process as the pod's command (the pod's PID 1), but ran into a slight issue. HAProxy reloads its configuration by, well, not reloading its configuration; it starts a new set of processes with the new configuration, which means that we ended up with a lot of zombie processes. To fix this, I could have changed the Python process to reap the zombies, but it was easier to just use Yelp's dumb-init instead.

We have the HAProxy pods managed as a DaemonSet, so it's running on every node, and the pods are set to use host networking for better performance. HAProxy itself is small enough that, at least with our current traffic, it doesn't affect the nodes much, so it isn't a problem for us right now to run it on every node. If we get enough traffic that it does make a difference, we can dedicate a node to it without much problem. One thing about this setup is that, even though it uses Kubernetes, HAProxy and Varnish don't need to be managed by Kubernetes. It just needs to be able to talk to etcd. So if we ever need a dedicated load balancer, we can spin up a node (or nodes) that just runs HAProxy and/or Varnish, say, using a DaemonSet and nodeSelector. Varnish is managed as a normal Kubernetes deployment and uses the normal container networking, so there's a bit of overhead there, but is fine for now. Again, if we have more concerns about performance, we can change our configuration easily enough.

It all seems to be working fairly well so far. There are some configuration tweaks that I'll have to go make, and there's one strange issue where Varnish doesn't like one of our services and just returns an empty response. But other than that, Varnish and HAProxy are just doing what they're supposed to do.

All the code is available on GitHub (HAProxy, Varnish).