April 29, 2016

Let's encrypt errata

10:06 -0400

Back in February, I posted about Automatic Let's Encrypt certificates on nginx. One of the scripts had a problem in that it downloaded the Let's Encrypt X1 intermediate certificate. Let's Encrypt recently switched to using their X3 intermidiate, which means that Firefox was unable to reach sites using the generated certificates, and Chrome/IE/Safari needed to make an extra download to verify the certificate.

Of course, instead of just changing the script to download the X3 certificate, it's best to automatically download the right certificate. So I whipped up a quick Python script, cert-chain-resolver-py (inspired by the Go version) that checks a certificate and downloads the other certificates in the chain.

I've updated my original blog post. The changed script is /usr/local/sbin/letsencrypt-renew, and of course you'll need to install cert-chain-resolver-py (the script expects it to be in /opt/cert-chain-resolver-py).

0 Comments
April 8, 2016

Antagonistic Co-operation

08:56 -0400

This article was originally written for our housing co-operative's newsletter. Even though it was written in the context of a housing co-operative, I think the idea is useful in other contexts as well.

-

The word "antagonistic" and its relatives generally have negative connotations. Nobody likes to be antagonized. In literature, the antagonist in a story works against the protagonist or main character, so we do not like to see the antagonist succeed. However, antagonism can be essential in some cases. Many of our muscles come in what are called "antagonistic pairs," without which you would not be able to move. Muscles can only pull (by contracting) and relax; muscles cannot push. If you only had biceps, you would only be able to bend your arm; you also need your triceps in order to be able to straighten your arm. Your basic movements rely on muscles that oppose each other, yet work together to allow you to walk, lift, or write.

But sometimes our muscles do not work as they should. If you have ever experienced a cramp, you know how painful this can be sometimes. A cramp happens when a muscle suddenly tightens and will not loosen. Many cases of back pain are also due to muscles that fail to relax as they should. Some people require regular massage therapy due to pain caused by tight muscles.

As a co-operative, we should strive to operate like a well functioning body. As members of our co-operative, we all have different opinions and priorities, and we pull our co-operative in different directions. Some people may be more focused on providing activities for our children, and some are more focused on helping our elders adapt to new challenges. Some people prefer to be frugal, while others may wish to spend money to improve the quality of life here. Some people value a strict adherence to our bylaws, while others adopt a more "live and let live" attitude. Each of these views is welcome in our co-operative, and we should celebrate our differences. Indeed, without different opinions pulling us in different directions, our co-operative would be as lifeless as a skeleton with no muscles.

But in order for our co-operative to get anywhere, we must be willing, not just to pull in the direction that we want to go, but also to sometimes let go when others are pulling in a different direction. Sometimes we must allow other members to go ahead with their opinions and priorities without getting in their way.

Unlike our bodies, however, our co-operative does not have a central "brain" coordinating our actions, telling us when to pull and when to let go. Instead we must, as a co-operative, come to an agreement among ourselves. We must communicate with each other, and come to understand the perspectives of other members. Then we can decide when each member should have an opportunity to pull so that we do not prevent our co-op from moving forward by pulling in opposite directions at the same time.

We often see people with opposing viewpoints as adversaries. But while we may be antagonistic, we can still be co-operative.

-

This article may be copied under the terms of the Creative Commons Attribution-ShareAlike license.

0 Comments
March 23, 2016

leftpad improved

13:26 -0400

Improved versions of left-pad

Shorter version (removed unneeded variable, and change to for loop: 9 lines of code instead of 11):

module.exports = leftpad;

function leftpad (str, len, ch) {
  str = String(str);

  if (!ch && ch !== 0) ch = ' ';

  for (len -= str.length; len > 0; len--) {
    str = ch + str;
  }

  return str;
}

Faster version (only perform O(log n) concatenations, where n is the number of characters needed to pad to the right length):

module.exports = leftpad;

function leftpad (str, len, ch) {
  str = String(str);

  if (!ch && ch !== 0) ch = ' ';
  ch = String(ch);

  len -= str.length;

  while (len > 0) {
    if (len & 1) {
      str = ch + str;
    }
    len >>>= 1;
    ch += ch;
  }

  return str;
}

ES6 version (which may be faster if String.prototype.repeat is implemented natively):

module.exports = leftpad;

function leftpad (str, len, ch) {
  str = String(str);

  if (!ch && ch !== 0) ch = ' ';
  ch = String(ch);

  len -= str.length;

  if (len > 0) str = ch.repeat(len) + str;

  return str;
}

Of course, you could combine the last two by detecting whether String.prototype.repeat is defined:

module.exports = String.prototype.repeat ?
function leftpad (str, len, ch) {
  str = String(str);

  if (!ch && ch !== 0) ch = ' ';
  ch = String(ch);

  len -= str.length;

  if (len > 0) str = ch.repeat(len) + str;

  return str;
}
:
function leftpad (str, len, ch) {
  str = String(str);

  if (!ch && ch !== 0) ch = ' ';
  ch = String(ch);

  len -= str.length;

  while (len > 0) {
    if (len & 1) {
      str = ch + str;
    }
    len >>>= 1;
    ch += ch;
  }

  return str;
}

As with the original left-pad, this code is released under the WTFPL.

(See also pad-left, which uses another dependency (by the same author) for repeating strings)

0 Comments
February 18, 2016

Automating Let's Encrypt certificates on nginx

17:57 -0500

Let's Encrypt is a new Certificate Authority that provides free SSL certificates. It is intended to be automated, so that certificates are renewed automatically. We're using Let's Encrypt certificates for our set of free Calculus practice problems. Our front end is currently served by an Ubuntu server running nginx, and here's how we have it scripted on that machine. In a future post, I'll describe how it's automated on our Docker setup with HAProxy.

First of all, we're using acme-tiny instead of the official Let's Encrypt client, since it's much smaller and, IMHO, easier to use. It takes a bit more to set up, but works well once it's set up.

We installed acme-tiny in /opt/acme-tiny, and created a new letsencrypt user. The letsencrypt user is only used to run the acme-tiny client with reduced priviledge. In theory, you could run the entire renewal process with a reduced priviledge user, but the rest of the process is just basic shell commands, and my paranoia level is not that high.

We then install cert-chain-resolver-py into /opt/cert-chain-resolver-py. This script requires the pyOpenSSL library to be installed, so make sure that it's installed. On Debian/Ubuntu systems, it's the python-openssl package.

We created an /opt/acme-tiny/challenge directory, owned by the letsencrypt user, and we created /etc/acme-tiny with the following contents:

  • account.key: the account key created in step 1 from the acme-tiny README. This file should be readable only by the letsencrypt user.
  • certs: a directory containing a subdirectory for each certificate that we want. Each subdirectory should have a domain.csr file, which is the certificate signing request created in step 2 from the acme-tiny README. The certs directory should be publicly readable, and the subdirectories should be writable by the user that the cron job will run as (which does not have to be the letsencrypt user).
  • private: a directory containing a subdirectory for each certificate that we want, like we had with the certs directory. Each subdirectory has a file named privkey.key, which will be the private key associated with the certificate. To coincide with the common setup on Debian systems, the private directory should be readable only by the ssl-cert group.

Instead of creating the CSR files as described in the acme-tiny README, I created a script called gen_csr.sh:

#!/bin/bash
openssl req -new -sha256 -key /etc/acme-tiny/private/"$1"/privkey.pem -subj "/" -reqexts SAN -config <(cat /etc/ssl/openssl.cnf <(printf "[SAN]\nsubjectAltName=DNS:") <(cat /etc/acme-tiny/certs/"$1"/domains | sed "s/\\s*,\\s*/,DNS:/g")) > /etc/acme-tiny/certs/"$1"/domain.csr

The script is invoked as gen_scr.sh <name>. It reads a file named /etc/acme-tiny/certs/<name>/domains, which is a text file containing a comma-separated list of domains, and it writes the /etc/acme-tiny/certs/<name>/domain.csr file.

Now we need to configure nginx to serve the challenge files. We created a /etc/nginx/snippets/acme-tiny.conf file with the following contents:

location /.well-known/acme-challenge/ {
    auth_basic off;
    alias /opt/acme-tiny/challenge/;
}

(The "auth_basic off;" line is needed because some of our virtual hosts on that server use basic HTTP authentication.) We then modify the sites in /etc/nginx/sites-enabled that we want to use Let's Encrypt certificates to include the line "include snippets/acme-tiny.conf;".

After this is set up, we created a /usr/local/sbin/letsencrypt-renew script that will be used to request a new certificate:

#!/bin/sh
set +e

# only renew if certificate will expire within 20 days (=1728000 seconds)
openssl x509 -checkend 1728000 -in /etc/acme-tiny/certs/"$1"/cert.pem && exit 255

set -e
DATE=`date +%FT%R`
su letsencrypt -s /bin/sh -c "python /opt/acme-tiny/acme_tiny.py --account-key /etc/acme-tiny/account.key --csr /etc/acme-tiny/certs/\"$1\"/domain.csr --acme-dir /opt/acme-tiny/challenge/" > /etc/acme-tiny/certs/"$1"/cert-"$DATE".pem
ln -sf cert-"$DATE".pem /etc/acme-tiny/certs/"$1"/cert.pem
python /opt/cert-chain-resolver-py/cert-chain-resolver.py -o /etc/acme-tiny/certs/"$1"/chain-"$DATE".pem -i /etc/acme-tiny/certs/"$1"/cert.pem -n 1
ln -sf chain-"$DATE".pem /etc/acme-tiny/certs/"$1"/chain.pem
cat /etc/acme-tiny/certs/"$1"/cert-"$DATE".pem /etc/acme-tiny/lets-encrypt-x1-cross-signed.pem > /etc/acme-tiny/certs/"$1"/fullchain-"$DATE".pem
ln -sf fullchain-"$DATE".pem /etc/acme-tiny/certs/"$1"/fullchain.pem

The script will only request a new certificate if the current certificate will expire within 20 days. The certificates are stored in /etc/acme-tiny/certs/<name>/cert-<date>.pem (symlinked to /etc/acme-tiny/certs/<name>/cert.pem). The full chain (including the intermediate CA certificate) is stored in /etc/acme-tiny/certs/<name>/fullchain-<date>.pem (symlinked to /etc/acme-tiny/certs/<name>/fullchain.pem).

If you have pyOpenSSL version 0.15 or greater, you can replace the -n 1 option for cert-chain-resolver.py with something like -t /etc/ssl/certs/ca-certificates.crt, where /etc/ssl/certs/ca-certificates.crt should be set to the location of a set of trusted CA certificates in PEM format.

As-is, the script must be run as root, since it does a su to the letsencrypt user. It should be trivial to modify it to use sudo instead, so that it can be run by any user that has the appropriate permissions on /etc/acme-tiny.

the letsencrypt-renew script is run by another script that will restart the necessary servers if needed. For us, the script looks like this:

#!/bin/sh

letsencrypt-renew sbscalculus.com

RV=$?

set -e

if [ $RV -eq 255 ] ; then
  # renewal not needed
  exit 0
elif [ $RV -eq 0 ] ; then
  # restart servers
  service nginx reload;
else
  exit $RV;
fi

This is then called by a cron script of the form chronic /usr/local/sbin/letsencrypt-renew-and-restart. Chronic is a script from the moreutils package that runs a command and only passes through its output if it fails. Since the renewal script checks whether the certificate will expire, we run the cron task daily.

Of course, once you have the certificate, you want to tell nginx to use it. We have another file in /etc/nginx/snippets that, aside from setting various SSL parameters, includes

ssl_certificate /etc/acme-tiny/certs/sbscalculus.com/fullchain.pem;
ssl_certificate_key /etc/acme-tiny/private/sbscalculus.com/privkey.pem;

This is the setup we use for one of our server. I tried to make it fairly general, and it should be fairly easy to modify for other setups.

Update (Apr. 29, 2016): Let's Encrypt changed their intermediate certificate, so the old instructions for downloading the intermediate certificate are incorrect. Instead of using a static location for the intermediate certificate, it's best to use a tool such as https://github.com/muchlearning/cert-chain-resolver-py to fetch the correct intermediate certificate. The instructions have been updated accordingly.

0 Comments
January 27, 2016

Automating browser-side unit tests with nodeunit and PhantomJS

11:24 -0500

I love unit tests, but they're only useful if they get run. For one of my projects at work, I have a set of server-side unit tests, and a set of browser-side unit tests. The server-side unit tests get run automatically on “git push`‘ via Buildbot, but the browser-side tests haven't been run for a long time because they don't work in Firefox, which is my primary browser, due to differences in the way it iterates through object keys.

Of course, automation would help, in the same way that automating the server-side tests ensured that they were run regularly. Enter PhantomJS, which is a scriptable headless WebKit environment. Unfortunately, even though PhantomJS can support many different testing frameworks, there is no existing support for nodeunit, which is the testing framework that I'm using in this particular project. Fortunately, it isn't hard to script support for nodeunit.

nodeunit's built-in browser support just dynamicall builds a web page with the test results and a test summary. If we just ran it as-is in PhantomJS, it would happily run the tests for us, but we wouldn't be able to see the results, and it would just sit there doing nothing when it was done. What we want is for the test results to be output to the console, and to exit when the tests are done (and exit with an error code if tests failed). To do this, we will create a custom nodeunit reporter that will communicate with PhantomJS.

First, let's deal with the PhantomJS side. Our custom nodeunit reporter will use console.log to print the test results, so we will pass through console messages in PhantomJS.

page.onConsoleMessage = function (msg) {
    console.log(msg);
};

We will use PhantomJS's callback functionality to signal the end of the tests. The callback data will just be an object containing the total number of assertions, the number of failed assertions, and the time taken.

page.onCallback = function (data) {
    if (data.failures)
    {
        console.log("FAILURES: " + data.failures + "/" + data.length + " assertions failed (" + data.duration + "ms)")
    }
    else
    {
        console.log("OK: " + data.length + " assertions (" + data.duration + "ms)");
    }
    phantom.exit(data.failures);
};

(Warning: the callback API is marked as experimental, so may be subject to change.)

If the test page fails to load for whatever reason, PhantomJS will just sit there doing nothing, which is not desirable behaviour, so we will exit with an error if something fails.

phantom.onError = function (msg, trace) {
    console.log("ERROR:", msg);
    for (var i = 0; i < trace.length; i++)
    {
        var t = trace[i];
        console.log(i, (t.file || t.sourceURL) + ': ' + t.line + t.function ? t.function : "");
    }
    phantom.exit(1);
};
page.onError = function (msg, trace) {
    console.log("ERROR:", msg);
    for (var i = 0; i < trace.length; i++)
    {
        var t = trace[i];
        console.log(i, (t.file || t.sourceURL) + ': ' + t.line + t.function ? t.function : "");
    }
    phantom.exit(1);
};
page.onLoadFinished = function (status) {
    if (status !== "success")
    {
        console.log("ERROR: page failed to load");
        phantom.exit(1);
    }
};
page.onResourceError = function (resourceError) {
    console.log("ERROR: failed to load " + resourceError.url + ": " + resourceError.errorString + " (" + resourceError.errorCode + ")");
    phantom.exit(1);
};

Now for the nodeunit side. The normal test page looks like this:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>ML Editor Test Suite</title>
    <link rel="stylesheet" href="stylesheets/nodeunit.css" type="text/css" />
    <script src="javascripts/module-requirejs.js" type="text/javascript"></script>
    <script src="javascripts/requirejs-config.js" type="text/javascript"></script>
    <script data-main="test" src="javascripts/require.js" type="text/javascript"></script>
  </head>
  <body>
    <h1 id="nodeunit-header">ML Editor Test Suite</h1>
  </body>
</html>

If you're not familiar with RequireJS pages, the <script data-main="test" src="javascripts/require.js" type="text/javascript"></script> line means that the main JavaScript file is called "test.js". We want to use the same script file for both a normal browser test and the PhantomJS-based test, so in PhantomJS, we will set window.nodeunit_reporter to our custom reporter. In "test.js", then, we will check for window.nodeunit_reporter, and if it is present, we will replace nodeunit's default reporter. Although there's no documented way of changing the reporter in the browser version of nodeunit, looking at the code, it's pretty easy to do.

if (window.nodeunit_reporter) {
    nodeunit.reporter = nodeunit_reporter;
    nodeunit.run = nodeunit_reporter.run;
}

(Disclaimer: since this uses an undocumented interface, it may break some time in the future.)

So what does a nodeunit reporter look like? It's just an object with two items: info (which is just a textual description) and run. run is a function that calls the nodeunit runner with a set of callbacks. I based the reporter off of a combination of nodeunit's default console reporter and its browser reporter.

window.nodeunit_reporter = {
    info: "PhantomJS-based test reporter",
    run: function (modules, options) {
        var opts = {
            moduleStart: function (name) {
                console.log("\n" + name);
            },
            testDone: function (name, assertions) {
                if (!assertions.failures())
                {
                    console.log("✔ " + name);
                }
                else
                {
                    console.log("✖ " + name);
                    assertions.forEach(function (a) {
                        if (a.failed()) {
                            console.log(a.message || a.method || "no message");
                            console.log(a.error.stack || a.error);
                        }
                    });
                }
            },
            done: function (assertions) {
                window.callPhantom({failures: assertions.failures(), duration: assertions.duration, length: assertions.length});
            }
        };
        nodeunit.runModules(modules, opts);
    }
};

Now in PhantomJS, I just need to get it to load a modified test page that sets window.nodeunit_reporter before loading "test.js", and voilà, I have browser tests running on the console. All that I need to do now is to add it to my Buildbot configuration, and then I will be alerted whenever I break a browser test.

The script may or may not work in SlimerJS, allowing the tests to be run in a Gecko-based rendering engine, but I have not tried it since, as I said before, my tests don't work in Firefox. One main difference, though, is that SlimerJS doesn't honour the exit code, so Buildbot would need to parse the output to determine whether the tests passed or failed.

0 Comments
January 18, 2016

When native code is slower than interpreted code

16:56 -0500

At work, I'm working on a document editor, and it needs to be able to read in HTML data. Well, that's simple, right? We're in a browser, which obviously is able to parse HTML, so just offload the HTML parsing to the browser, and then traverse the DOM tree that it creates.

var container = document.createElement("div");
container.innerHTML = html;

The browser's parser is native code, built to be robust, well tested. What could go wrong?

Unfortunately, going this route, it ended up taking about 70 seconds to parse a not-very-big document on my 4 year old laptop. 70 seconds. Not good.

Switching to a JavaScript-based HTML parser saw the parsing time drop down to about 9 seconds. Further code optimizations in other places brought it down to about 3 seconds. Not too bad.

So why is the JavaScript parser faster than the browser's native parser? Without digging into what the browser is actually doing, my best guess is that the browser isn't just parsing the HTML, but is also calculating styles, layouts, etc. This guess seems to be supported by the fact that not all HTML is parsed slowly; some other HTML of similar size is parsed very quickly (faster than using the JavaScript-based parser). But it can't be the whole story, because the browser is able to display that same HTML fairly quickly.

I may have to do some further investigations, but I guess the moral of the story is to not assume that offloading work is the fastest solution.

0 Comments
November 19, 2015

My development-to-production workflow so far

12:04 -0500

As a follow-up to my previous post, I've fleshed out my CI/(almost-)CD workflow a bit more. I write "(almost-)CD", because I've decided that I don't really want deployment to production to be completely automatic; I want to be able to manually mark a build as ready for production, at least for now.

When we last left off, I had Buildbot watching our git repository, and when it detected a change, it ran unit tests, and if the tests passed, updated the code on our VPS and triggered a reload. Since then, I've added email notifications (for all failures, and for success on the final step), and we've switched over to a Docker-based deployment. Here's what the process looks like now:

Buildbot still watches our git repository. When it detects a change, it checks out the code, and runs a docker build on a remote Docker instance to build a new image of the application; the buildbot slave is still running on our VPS, which does not support Docker, so we need to run Docker on a separate host. I could have also created a new buildbot slave on a Docker-capable host, but that seems like it would have been more work.

The new image is tagged with the git commit hash, as well as the "staging" tag. Next Buildbot runs the unit tests on the image itself. If the tests all pass, it pushes the image to our Docker repository on Tutum with the "staging" tag. Tutum watches for changes in the Docker repository, and when a new "staging" image is pushed, it redeploys our staging server. In the meantime, Buildbot sends me an email telling me that the build has passed.

Up to this point, everything since the git push has been automatic. After I get the email from Buildbot, I do a quick sanity check on our staging server, just in case the unit tests missed anything, and if all goes well, I re-push the Docker image, but this time with the "latest" tag. Again, Tutum will notice a new "latest" image, and this time redeploys our production server.

There are a couple things I really like about this setup. First of all, the number of manual steps involved is minimal; the only thing I do after pushing the code is checking the stage site, and re-pushing the image. Everything else is done automatically, which means that there's less chance of me forgetting to do something. Secondly, by using Docker images, I'm sure that the staging and production environments are exactly the same (or at least as close as possible).

One downside is that redeploying an image means there's a slight amount of downtime. This can be solved by using Blue/Green deployment instead of production/staging, but it's a bit more complicated to set up. This will probably be the next thing for me to look into.

0 Comments
September 17, 2015

simple process respawning

16:29 -0400

File this under "I can't believe how long it took me to figure it out".

I'm Dockerizing some of our services at work. One of them (by design) kills itself after handling a number of requests. Of course, I want it to restart after it kills itself. Most solutions seem like overkill. Using a service supervisor like runit is great, but requires too much setup for monitoring just a single process. Forever is probably a good option, but I don't want to have to install Node.js in the image just to monitor it. Not to mention, Node.js has a non-trivial memory footprint.

Basically, I want something small and simple. No extra dependencies, minimal extra setup, minimal extra resource usage. After too much time looking for a solution, I came up with a 5-line shell script:

#!/bin/sh
while :
do
    "$@"
done

Name it forever.sh and put it in your PATH, and use it as: forever.sh <command> (e.g. forever.sh server -p 8080). It's just an infinite loop that executes its arguments until it gets killed.

0 Comments
August 26, 2015

Limiting concurrency in Node.js with promises

21:41 -0400

The nice thing about Node.js is its asynchronous execution model, which means that it can handle many requests very quickly. The flip side of this is that it can also generate many requests very quickly, which is fine if they can then be handled quickly, and not so good when they can't. For one application that I'm working on, some of the work gets offloaded to an external process; a new process is created for each request. (Unfortunately, that's the architecture that I'm stuck with for now.) And when doing a batch operation, Node.js will happily spawn hundreds of processes at once, without caring that doing so will cause everything on my workstation to slow to a crawl.

Limiting concurrency in Node.js has been written about elsewhere, but I'd like to share my promise-based version of this solution. In particular, this was built for the bluebird flavour of promises.

Suppose that we have a function f that performs some task, and returns a promise that is fulfilled when that task in completed. We want to ensure that we don't have too many instances of f running at the same time.

We need to keep track of how many instances are currently running, we need a queue of instances when we've reached our limit, and of course we need to define what our limit is.

var queue = [];
var numRunning = 0;
var max = 10

Our queue will just contain functions that, when called, will call f with the appropriate arguments, as well as perform the record keeping necessary for calling f. So to process the queue, we just check whether we are below our run limit, check whether the queue is non-empty, and run the function at the front of the queue.

function runnext()
{
    numRunning--;
    if (numRunning <= max && queue.length)
    {
        queue.shift()();
    }
}

Now we create a wrapper function f1 that will limit the concurrency of f. We will call f with the same arguments that f1 is called with. If we have already reached our limit, we queue the request; otherwile, we run f immediately. When we run f, whether it is immediately, or in the future, we must first increment our counter. After f is done, we process the next element in the queue. We must process the queue whether f succeeds or not, and we don't want to change the resolution of f's promise, so we tack a finally onto the promise returned by f.

function f1 ()
{
    var args = Array.prototype.slice.call(arguments);
    return new Promise(function (resolve, reject) {
        function run() {
            numRunning++;
            resolve(f.apply(undefined, args)
                    .finally(runnext));
        }
        if (numRunning > max)
        {
            queue.push(run);
        }
        else
        {
            run();
        }
    });
}

Of course, if you need to do this a lot, you may want to wrap this all up in a higher-order function. For example:

function limit(f, max)
{
    var queue = [];
    var numRunning = 0;

    function runnext()
    {
        numRunning--;
        if (numRunning <= max && queue.length)
        {
            queue.shift()();
        }
    }

    return function ()
    {
        var args = Array.prototype.slice.call(arguments);
        return new Promise(function (resolve, reject) {
            function run() {
                numRunning++;
                resolve(f.apply(undefined, args)
                        .finally(runnext));
            }
            if (numRunning > max)
            {
                queue.push(run);
            }
            else
            {
                run();
            }
        });
    };
}

This would be used as:

f = limit(f, 10);

which would replace f with a new function that is equivalent to f, except that only 10 instances will be running at a time.

0 Comments
May 24, 2015

First steps in CI/CD with Buildbot

15:59 -0400

At work, I've been looking into Continuous Integration and Continuous Delivery/Deployment. So far, our procedures have been mostly manual, which means that some things take longer than necessary, and sometimes things get missed. The more that can be automated, the less developer time has to be spent on mundane tasks, and the less brain power needed to remember all the steps.

There are many CI solutions out there, and after investigating a bunch of them, I settled on using Buildbot for a few reasons:

  • it can manage multiple codebases for the same project, unlike many of the simpler CI tools. This is important since the back end for the next iteration of our product is based on plugins that live in individual git repositories.
  • it is lightweight enough to run on our low-powered VPS.
  • it has a flexible configuration language (its configuration file is Python code) and is easily extendable.

Right now, we're in development mode for our product, and I want to make sure that our development test site is always running the latest available code. That means combining plugins together, running unit tests, and if everything checks out, deploying. Eventually, my hope is to be able to tag a branch and have our production site update automatically.

The setup

Our code has one main tree, with plugins each in their own directory within a special plugins directory. The development test site should track the master branch on the main tree and all relevant plugins. For ease of deployment (especially for new development environments), we want to use git submodules to pull in all the relevant plugins. However, the master branch will be the basis of all deployments, which may use different plugins, or different versions of plugins, and so should not have any plugins specified in itself. Instead, we have one branch for each deployed version, which includes as submodules the plugins that are used for that build.

The builds

Managing git submodules can be a bit of a pain. Especially since we're not developing on the branch that actually contains the submodules, managing them would require switching branches, pulling the correct versions of the submodules, and pushing.

The first step in automation, then, is to automatically update the deployment branch whenever either a plugin or the main tree are updated. Buildbot has a list of the plugins that are used in a deployment branch, along with the branch that it follows. Each plugin is associated with a repository, and we use the codebase setting in Buildbot to keep the plugins separate. We then have a scheduler listen on the appropriate codebases, and triggers a build whenever any pushes are made. A Buildbot slave then checks out the latest version of the deployment branch, merges in the changes to the main tree and the submodules, and then pushes out a new version of the deployment branch.

Naturally, pushes to plugins and to the main tree are generally grouped. For example, changes in one plugin may require, or enable, changes in other plugins. We don't want a separate commit in our deployment branch for each change in each plugin, so we take advantage of Buildbot's ability to merge changes. We also wait for the git repositories to be stable for two minutes before running the build, to make sure that all the changes are caught. This reduces the number of commits we have in our deployment branch, making things a bit cleaner.

When Buildbot pushes out a new version of the deployment branch, this in turn triggers another build in Buildbot. Buildbot checks out the full sources, including submodules, installs the required node modules, installs configuration files for testing, and then runs the unit tests. If the tests all pass, then this triggers yet another build.

The final build checks out the latest sources into the web application directory for the test site, and then notifies the web server (currently using Passenger) to restart the web application.

Next steps

This setup seems to be working fairly well so far, but the setup isn't complete yet. Being a first attempt, I'm sure there are many improvements that can be made to the setup, both in terms of flexibility and performance. Especially since we only have one site being updated, the configuration works fine for now, but can probably be made more general to make it easier to deploy multiple sites.

One major issue in the current setup, though, is the lack of notifications. Currently, in order to check the state of the build, I need to view Buildbot's web UI, which is inconvenient. Buildbot has email notification built in, but I just haven't had the chance to set it up yet. When I do set it up, I will likely set it to notify on any failure, as well as whenever a deployment is made.

I'd also like to get XMPP notifications, which isn't built into Buildbot, and so is something that I would have to write myself. Buildbot is based on Twisted, which has an XMPP module built in, so it should be doable. I think the XMPP module is a bit outdated, but we don't need any fancy functionality, so hopefully it will work well enough.

I'm looking into using Docker for deployments once we're ready to push this project to production, so I'll need to look into creating build steps for Docker. The VPS that we're currently using for Buildbot is OpenVZ-based, and so does not support Docker, so we'd need to put a Buildbot slave on a Docker-capable host for building and testing the Docker images, or even use a Docker container as a Buildbot slave, which would be even better.

There's probably a lot that can be done to improve the output in the UI too. For example, when the unit tests are run, it only reports whether the tests passed or failed. It should be possible to create a custom build step that will report how many tests failed.

Assessment

Although Buildbot seems like the best fit for our setup, it isn't perfect. The main thing that I'd like is better project support. Buildbot allows you to set projects on change sets, but I'd like to be able to set projects on builds as well, in order to filter by projects in the waterfall view.

All in all, Buildbot seems like a worthwhile tool that is flexible, yet easy enough to configure. It's no-nonsense and just does what it claims to do. The documentation is well done, and for simple projects, you should be able to just dive right in without any issues. For more complex projects, it's helpful to understand what's going on before charging right in. Of course, I just charged right in without understanding certain concepts, so I had to redo some stuff to make it work better, but the fact that I was able to actually get it to work in the first place, even doing it the wrong way, gives some indication to its power.

0 Comments