May 7, 2018

A New Vector for my Career

10:19 -0400

In three weeks, I will be joining the team at New Vector, working with Matrix, an open communications protocol. It's exciting to be working full time on Free/Open Source software again (I used to work for a Moodle partner). Matrix itself is pretty exciting, with features such as federation (the ability to host your own server and communicate with anyone else using Matrix), bridging together different communication networks, and end-to-end encryption.

My tasks at New Vector will be quite varied. At some point I will be working on bridges, but to start with, I'll probably be helping out with some of the more pressing tasks such as spec wrangling (both documenting missing parts of the spec, and working with the community on spec improvements), doing some work on Dendrite, and helping out with some of the outstanding end-to-end encryption UX work.

I've been doing some Matrix-related things in my spare time, and I've been enjoying it, both working with the technology and interacting with the community. But my free time has been quite limited, so I'm really looking forward to being able to work on Matrix full-time starting in a few weeks.

0 Comments
December 1, 2016

Let's Encrypt for Kubernetes

21:08 -0500

A while ago, I blogged about automatic Let's Encrypt certificate renewal with nginx. Since then, I've also set up renewal in our Kubernetes cluster.

Like with nginx, I'm using acme-tiny to do the renewals. For Kubernetes, I created a Docker image. It reads the Let's Encrypt secret key from /etc/acme-tiny/secrets/account.key, and CSR files from /etc/acme-tiny/csrs/{name}.csr. In Kubernetes, these can be set up by mounting a Secrets store and a ConfigMap, respectively. It also reads the current certificates from /etc/acme-tiny/certs/{name}, which should also be set up by mounting a ConfigMap (called certificates), since that is where the container will put the new certificates.

Starting an acme-tiny pod will start an nginx server to store the .well-known directory for the Acme challenge. Running /opt/acme-tiny-utils/renew in the pod will renew the certificate if it will expire within 20 days (running it with the -f option will disable the check). Of course, we want the renewal to be automated, so we want to set up a sort of cron task. Kubernetes has cron jobs since 1.4, but at the time I was setting this up, we were still on 1.3. Kubernetes also does cron jobs by creating a new pod, whereas the way I want this to work is to run a program in an existing pod (though it could be set up to work the other way too). So I have another cron Docker image, which I have set up to run

kubectl exec `kubectl get pods --namespace=lb -l role=acme -o name | cut -d / -f 2` --namespace=lb ./renew sbscalculus.com

every day. That command finds the acme-tiny pod and runs the renew command, telling it to renew the sbscalculus.com certificate.

Now in order for the Acme challenge to work, HTTP requests to /.well-known/acme-challenge/ get redirected to acme-tiny rather than to the regular pods serving those services. Our services are behind our HAProxy image. So I have a 0acmetiny entry (the 0 causes it to be sorted before all other entries) in the services ConfigMap for HAProxy that reads:

    {
      "namespace": "lb",
      "selector": {
        "role": "acme"
      },
      "hostnames": ["^.*"],
      "path": "^/\\.well-known/acme-challenge/.*$",
      "ports": [80]
    }

This causes HAProxy to all the Acme challeges to the acme-tiny pod, while leaving all the other requests alone.

And that's how we have our certificates automatically renewed from Let's Encrypt.

0 Comments
September 6, 2016

Buildbot latent build slaves

12:28 -0400

I've blogged before about using Buildbot to build our application server. One problem with it is that the build (and testing) process can be memory intensive, which can sometimes exceed the memory that we have available in our Kubernetes cluster. I could add another worker node, but that would be a waste, since we do builds infrequently.

Fortunately, the Buildbot developers have already built a solution to this: latent buildslaves. A latent buildslave is a virtual server that is provisioned on-demand. That means that when a build isn't active, then we don't have to pay for an extra server to be active; we only have to pay for the compute time that we actually need for builds (plus a bit of storage space).

I chose to use AWS EC2 as the basis of our buildslave. Buildbot also supports OpenStack, so I could have just used DreamCompute, which we already use for our Kubernetes cluster, but with AWS EC2, we can take advantage of spot instances and save even more money if we needed to. In any event, the setup would have been pretty much the same.

Setting up a latent buildslave on AWS is pretty straightforward. First, create an EC2 instance in order to build a base image for the buildslave. I started with a Debian image. Then, install any necessary software for the buildslave. For us, that included the buildslave software itself (Debian package buildbot-slave), git, tinc, npm, and Docker. Most of our build process happens inside of Docker containers, so we don't need anything else. We use tinc to build a virtual network with our Kubernetes cluster, so that we can push Docker images to our own private Docker repository.

After installing the necessary software, we need to configure it. It's configured just like a normal buildslave would be configured: I configured tinc, added an ssh key so that it could check out our source code, configured Docker so that it could push to our repository, and of course configured the Buildbot slave itself. Once it's configured, I cleaned up the image a bit (truncated logs, cleared bash history, etc.), and then took a snapshot in the AWS control panel, giving it a name so that it would show up as an AMI.

Finally, I added the latent buildslave in our Buildbot master configuration, giving it the name of the AMI that was created. Once set up, it ran pretty much as expected. I pushed out a change, Buildbot master created a new EC2 instance, built our application server, pushed and deployed it to our Kubernetes cluster, and after a short delay (to make sure there are no other builds), deleted the EC2 instance. In all, the EC2 instance ran for about 20 minutes. Timings will vary, of course, but it will run for less than an hour. If we were paying full price for a t2.micro instance in us-east-1, each build would cost just over 1 cent. We also need to add in the storage cost for the AMI which, given that I started with an 8GB image, will cost us at most 80 cents per month (since EBS snapshots don't store empty blocks, it should be less than that). We probably average about two builds a month, giving us an average monthly cost of at most 83 cents, which is not too bad.

0 Comments
June 22, 2016

Load balancing Kubernetes pods

10:10 -0400

At work, we recently switched from Tutum (now Docker Cloud) to Kubernetes. Part of that work was building up a load balancer. Kubernetes has built-in load balancing capabilities, but it only works with Google Compute or AWS, which we are not using. It also requires a public IP address for each service, which usually means extra (unnecessary) costs.

Having previously worked with Tutum's HAProxy image, I figured I could do the same thing with Kubernetes. A quick web search didn't turn up any existing project, so I quickly wrote my own. Basically, we have HAProxy handling all incoming HTTP(S) connections and passing them off to different services based on the Host header. There's a watcher that watches Kubernetes such as new/deleted pods for relevant changes and updates the HAProxy configuration so that it always sends requests to the right place. I also improved the setup by adding in a Varnish cache for some of our services. Here's how it all works.

We have two sets of pods: an set of HAProxy pods and a set of Varnish pods. Each pod has a Python process that watches etcd for Kubernetes changes, updates the appropriate (HAProxy or Varnish) configuration, and tells HAProxy/Varnish about the new configuration. Why do we watch etcd instead of using the Kubernetes API directly? Because, as far as I can tell, in the Kubernetes API, you can only watch one type of object (either pods, configmaps, secrets, etc.) for changes, whereas we need to watch multiple types at once, so dealing with the Kubernetes API means that we would need to make multiple simultaneous API requests, which would just make things more complicated.

Unlike Tutum's HAProxy image, which only allows you to change certain settings using environment variables, our entire configuration template is configurable using Jinja2 templates. This gives us a lot more flexibility, including being able to plug in Varnish fairly easily without having to make any code changes to the HAProxy configurator. Also, configuration variables for services are stored in their own ConfigMap, rather than as environment variables in the target pods which allows us to make configuration changes without restarting the pods.

When combining HAProxy and Varnish, one question to ask is how to arrange them: HAProxy in front of Varnish, or Varnish in front of HAProxy? We are using a setup similar to the one recommended in the HAProxy blog. In that setup, HAProxy handles all requests and passes non-cacheable requests directly to the backend servers. Cacheable requests are, of course, passed to Varnish. If Varnish has a cache miss, then it passes the request back to HAProxy, which then hands off the request to the backend server. As the article points out, in the event of a cache miss, there's a lot of requests, but cache misses should be very infrequent since Varnish only sees cacheable content. One main difference between the setup we have and the one in the article is that in the article, HAProxy listens on two IP addresses: one for requests coming from the public, and one for requests coming from Varnish. In our setup, we don't have two IP addresses for HAProxy to use. Instead, Varnish adds a request header that indicates that the request is coming from it, and HAProxy checks for that header.

At first, I set the Python process as the pod's command (the pod's PID 1), but ran into a slight issue. HAProxy reloads its configuration by, well, not reloading its configuration; it starts a new set of processes with the new configuration, which means that we ended up with a lot of zombie processes. To fix this, I could have changed the Python process to reap the zombies, but it was easier to just use Yelp's dumb-init instead.

We have the HAProxy pods managed as a DaemonSet, so it's running on every node, and the pods are set to use host networking for better performance. HAProxy itself is small enough that, at least with our current traffic, it doesn't affect the nodes much, so it isn't a problem for us right now to run it on every node. If we get enough traffic that it does make a difference, we can dedicate a node to it without much problem. One thing about this setup is that, even though it uses Kubernetes, HAProxy and Varnish don't need to be managed by Kubernetes. It just needs to be able to talk to etcd. So if we ever need a dedicated load balancer, we can spin up a node (or nodes) that just runs HAProxy and/or Varnish, say, using a DaemonSet and nodeSelector. Varnish is managed as a normal Kubernetes deployment and uses the normal container networking, so there's a bit of overhead there, but is fine for now. Again, if we have more concerns about performance, we can change our configuration easily enough.

It all seems to be working fairly well so far. There are some configuration tweaks that I'll have to go make, and there's one strange issue where Varnish doesn't like one of our services and just returns an empty response. But other than that, Varnish and HAProxy are just doing what they're supposed to do.

All the code is available on GitHub (HAProxy, Varnish).

0 Comments
June 9, 2016

Kubernetes vs Docker Cloud

09:42 -0400

Note: this is not a comprehensive comparison of Kubernetes and Docker Cloud. It is just based on my own experiences. I am also using Tutum and Docker Cloud more or less interchangeably, since Tutum became Docker Cloud.

At work, we used to use Tutum for orchestrating our Docker containers for our Calculus practice problems site. While it was in beta, Tutum was free, but Tutum has now become Docker cloud and costs about $15 per month per managed node per month, on top of server costs. Although we got three free nodes since we were Tutum beta testers, we still felt the pricing was a bit steep, since the management costs would be more than the hosting costs. Even more so since we would have needed more private Docker repositories than what would have been included.

So I started looking for self-hosted alternatives. The one I settle on was Kubernetes, which originated from Google. Obviously, if you go self-hosted, you need to have enough system administration knowledge to do it, whereas with Docker Cloud, you don't need to know anything about system administration. It's also a bit more time consuming to set up — it took me about a week to set up Kubernetes (though most of that time was scripting the process so that we could do it again more quickly next time), whereas with Tutum, it took less than a day to get up and running.

Kubernetes will require at least one server for itself — if you want to ensure high availability, you'll want to run multiple masters. We're running on top of CoreOS, and a 512MB node seems a bit tight for the master for our setup. A 1GB node was big enough that, although they recommend not to, I allowed the master to schedule running other pods.

Kubernetes seems to have a large-ish overhead on the worker nodes (a.k.a. minions). Running top, the system processes take up at least 200MB, which means that on a 512MB node, you'd only have about 300MB to run your own pods unless you have swap space. I have no idea what the overhead on a Tutum/Docker cloud node was, since I didn't have access to check.

Previously, under Tutum, we were running on 5*512MB nodes, each of which had 512MB swap space. Currently, we're running on 3*1GB worker nodes plus 1*1GB master node (which also serves as a worker), no swap. (We'll probably need to add another worker in the near future (or maybe another combined master/worker) though under Tutum, we would have probably needed another node with the changes that I'm planning anyways.) Since we also moved from DigitalOcean to DreamHost (affiliate link) and their new DreamCompute service (which just came out of Beta as we were looking into self-hosting), our new setup ended up costing $1 less per month.

Under Tutum, the only way to pass in configuration (other than baking it into your Docker image, or unless you run your own configuration server) is through environment variables. With Kubernetes, you have more options, such as ConfigMaps and Secrets. That gives you more flexibility and allows (depending on your setup) on changing configuration on-the-fly. For example, I created an auto-updating HAProxy configuration that allows you to specify a configuration template via a ConfigMap. When you update the ConfigMap, HAProxy gets immediately reconfigured with almost no downtime. This is in contrast to the Tutum equivalent, in which a configuration change (via environment variables) would require a restart and hence more downtime.

The other configuration methods also allows the configuration to be more decoupled. For example, with Tutum's HAProxy, the configuration for a service such as virtual host names are specified using the target container's environment variables, which means that if you want to change the set of virtual hosts or the SSL certificate, you would need to restart your application containers. Since our application server takes a little while to restart, we want to avoid having to do that. On the other hand, if the configuration were set in HAProxy's environment, then it would be lost to other services that might want to use it (such as monitoring software that would might use the HTTP_CHECK variable). With a ConfigMap, however, the configuration does not need to belong to one side or the other; it can stand on its own, and so it doesn't interfere with the application container, and can be accessed by other pods.

Kubernetes can be all configured using YAML (or JSON) files, which means that everything can be version controlled. Under Tutum, things are primarily configured via the web interface, though they do have a command-line tool that you could use as well. However, the command-line tool uses a different syntax for creating versus updating, whereas with Kubernetes, you can just "kubectl apply -f", so even if you use the Tutum CLI and keep a script under version control for creating your services, it's easier to forget to change your script after you've changed a service.

There are a few things that Tutum does that Kubernetes doesn't do. For example, Tutum has built-in node management (if you use AWS, DigitalOcean, or one of the other providers that it is made to work with), whereas with Kubernetes, you're responsible for setting up your own nodes. Though there are apparently tools built on top of Kubernetes that do similar things, I never really looked into them, since we currently don't need to bring up/take down nodes very frequently. Tutum also has more deployment strategies (such as "emptiest node" and "high availability"), which was not that important for us, but might be more important for others.

Based on my experience so far, Kubernetes seems to be a better fit for us. For people who are unable/unwilling to administer their own servers, Docker Cloud would definitely be the better choice, and starting with Tutum definitely gave me time to look around in the Docker ecosystem before diving into a self-hosted solution.

0 Comments
April 29, 2016

Let's encrypt errata

10:06 -0400

Back in February, I posted about Automatic Let's Encrypt certificates on nginx. One of the scripts had a problem in that it downloaded the Let's Encrypt X1 intermediate certificate. Let's Encrypt recently switched to using their X3 intermidiate, which means that Firefox was unable to reach sites using the generated certificates, and Chrome/IE/Safari needed to make an extra download to verify the certificate.

Of course, instead of just changing the script to download the X3 certificate, it's best to automatically download the right certificate. So I whipped up a quick Python script, cert-chain-resolver-py (inspired by the Go version) that checks a certificate and downloads the other certificates in the chain.

I've updated my original blog post. The changed script is /usr/local/sbin/letsencrypt-renew, and of course you'll need to install cert-chain-resolver-py (the script expects it to be in /opt/cert-chain-resolver-py).

0 Comments
February 18, 2016

Automating Let's Encrypt certificates on nginx

17:57 -0500

Let's Encrypt is a new Certificate Authority that provides free SSL certificates. It is intended to be automated, so that certificates are renewed automatically. We're using Let's Encrypt certificates for our set of free Calculus practice problems. Our front end is currently served by an Ubuntu server running nginx, and here's how we have it scripted on that machine. In a future post, I'll describe how it's automated on our Docker setup with HAProxy.

First of all, we're using acme-tiny instead of the official Let's Encrypt client, since it's much smaller and, IMHO, easier to use. It takes a bit more to set up, but works well once it's set up.

We installed acme-tiny in /opt/acme-tiny, and created a new letsencrypt user. The letsencrypt user is only used to run the acme-tiny client with reduced priviledge. In theory, you could run the entire renewal process with a reduced priviledge user, but the rest of the process is just basic shell commands, and my paranoia level is not that high.

We then install cert-chain-resolver-py into /opt/cert-chain-resolver-py. This script requires the pyOpenSSL library to be installed, so make sure that it's installed. On Debian/Ubuntu systems, it's the python-openssl package.

We created an /opt/acme-tiny/challenge directory, owned by the letsencrypt user, and we created /etc/acme-tiny with the following contents:

  • account.key: the account key created in step 1 from the acme-tiny README. This file should be readable only by the letsencrypt user.
  • certs: a directory containing a subdirectory for each certificate that we want. Each subdirectory should have a domain.csr file, which is the certificate signing request created in step 2 from the acme-tiny README. The certs directory should be publicly readable, and the subdirectories should be writable by the user that the cron job will run as (which does not have to be the letsencrypt user).
  • private: a directory containing a subdirectory for each certificate that we want, like we had with the certs directory. Each subdirectory has a file named privkey.key, which will be the private key associated with the certificate. To coincide with the common setup on Debian systems, the private directory should be readable only by the ssl-cert group.

Instead of creating the CSR files as described in the acme-tiny README, I created a script called gen_csr.sh:

#!/bin/bash
openssl req -new -sha256 -key /etc/acme-tiny/private/"$1"/privkey.pem -subj "/" -reqexts SAN -config <(cat /etc/ssl/openssl.cnf <(printf "[SAN]\nsubjectAltName=DNS:") <(cat /etc/acme-tiny/certs/"$1"/domains | sed "s/\\s*,\\s*/,DNS:/g")) > /etc/acme-tiny/certs/"$1"/domain.csr

The script is invoked as gen_scr.sh <name>. It reads a file named /etc/acme-tiny/certs/<name>/domains, which is a text file containing a comma-separated list of domains, and it writes the /etc/acme-tiny/certs/<name>/domain.csr file.

Now we need to configure nginx to serve the challenge files. We created a /etc/nginx/snippets/acme-tiny.conf file with the following contents:

location /.well-known/acme-challenge/ {
    auth_basic off;
    alias /opt/acme-tiny/challenge/;
}

(The "auth_basic off;" line is needed because some of our virtual hosts on that server use basic HTTP authentication.) We then modify the sites in /etc/nginx/sites-enabled that we want to use Let's Encrypt certificates to include the line "include snippets/acme-tiny.conf;".

After this is set up, we created a /usr/local/sbin/letsencrypt-renew script that will be used to request a new certificate:

#!/bin/sh
set +e

# only renew if certificate will expire within 20 days (=1728000 seconds)
openssl x509 -checkend 1728000 -in /etc/acme-tiny/certs/"$1"/cert.pem && exit 255

set -e
DATE=`date +%FT%R`
su letsencrypt -s /bin/sh -c "python /opt/acme-tiny/acme_tiny.py --account-key /etc/acme-tiny/account.key --csr /etc/acme-tiny/certs/\"$1\"/domain.csr --acme-dir /opt/acme-tiny/challenge/" > /etc/acme-tiny/certs/"$1"/cert-"$DATE".pem
ln -sf cert-"$DATE".pem /etc/acme-tiny/certs/"$1"/cert.pem
python /opt/cert-chain-resolver-py/cert-chain-resolver.py -o /etc/acme-tiny/certs/"$1"/chain-"$DATE".pem -i /etc/acme-tiny/certs/"$1"/cert.pem -n 1
ln -sf chain-"$DATE".pem /etc/acme-tiny/certs/"$1"/chain.pem
cat /etc/acme-tiny/certs/"$1"/cert-"$DATE".pem /etc/acme-tiny/lets-encrypt-x1-cross-signed.pem > /etc/acme-tiny/certs/"$1"/fullchain-"$DATE".pem
ln -sf fullchain-"$DATE".pem /etc/acme-tiny/certs/"$1"/fullchain.pem

The script will only request a new certificate if the current certificate will expire within 20 days. The certificates are stored in /etc/acme-tiny/certs/<name>/cert-<date>.pem (symlinked to /etc/acme-tiny/certs/<name>/cert.pem). The full chain (including the intermediate CA certificate) is stored in /etc/acme-tiny/certs/<name>/fullchain-<date>.pem (symlinked to /etc/acme-tiny/certs/<name>/fullchain.pem).

If you have pyOpenSSL version 0.15 or greater, you can replace the -n 1 option for cert-chain-resolver.py with something like -t /etc/ssl/certs/ca-certificates.crt, where /etc/ssl/certs/ca-certificates.crt should be set to the location of a set of trusted CA certificates in PEM format.

As-is, the script must be run as root, since it does a su to the letsencrypt user. It should be trivial to modify it to use sudo instead, so that it can be run by any user that has the appropriate permissions on /etc/acme-tiny.

the letsencrypt-renew script is run by another script that will restart the necessary servers if needed. For us, the script looks like this:

#!/bin/sh

letsencrypt-renew sbscalculus.com

RV=$?

set -e

if [ $RV -eq 255 ] ; then
  # renewal not needed
  exit 0
elif [ $RV -eq 0 ] ; then
  # restart servers
  service nginx reload;
else
  exit $RV;
fi

This is then called by a cron script of the form chronic /usr/local/sbin/letsencrypt-renew-and-restart. Chronic is a script from the moreutils package that runs a command and only passes through its output if it fails. Since the renewal script checks whether the certificate will expire, we run the cron task daily.

Of course, once you have the certificate, you want to tell nginx to use it. We have another file in /etc/nginx/snippets that, aside from setting various SSL parameters, includes

ssl_certificate /etc/acme-tiny/certs/sbscalculus.com/fullchain.pem;
ssl_certificate_key /etc/acme-tiny/private/sbscalculus.com/privkey.pem;

This is the setup we use for one of our server. I tried to make it fairly general, and it should be fairly easy to modify for other setups.

Update (Apr. 29, 2016): Let's Encrypt changed their intermediate certificate, so the old instructions for downloading the intermediate certificate are incorrect. Instead of using a static location for the intermediate certificate, it's best to use a tool such as https://github.com/muchlearning/cert-chain-resolver-py to fetch the correct intermediate certificate. The instructions have been updated accordingly.

0 Comments
February 3, 2016
14:03 -0500
Hubert Chathi: The fastest query is the one you don't do. Sped up the response time of some @sbscalculus.com pages from about 3s to about 0.3s by not fetching some irrelevant data. #
0 Comments
January 27, 2016

Automating browser-side unit tests with nodeunit and PhantomJS

11:24 -0500

I love unit tests, but they're only useful if they get run. For one of my projects at work, I have a set of server-side unit tests, and a set of browser-side unit tests. The server-side unit tests get run automatically on “git push`‘ via Buildbot, but the browser-side tests haven't been run for a long time because they don't work in Firefox, which is my primary browser, due to differences in the way it iterates through object keys.

Of course, automation would help, in the same way that automating the server-side tests ensured that they were run regularly. Enter PhantomJS, which is a scriptable headless WebKit environment. Unfortunately, even though PhantomJS can support many different testing frameworks, there is no existing support for nodeunit, which is the testing framework that I'm using in this particular project. Fortunately, it isn't hard to script support for nodeunit.

nodeunit's built-in browser support just dynamicall builds a web page with the test results and a test summary. If we just ran it as-is in PhantomJS, it would happily run the tests for us, but we wouldn't be able to see the results, and it would just sit there doing nothing when it was done. What we want is for the test results to be output to the console, and to exit when the tests are done (and exit with an error code if tests failed). To do this, we will create a custom nodeunit reporter that will communicate with PhantomJS.

First, let's deal with the PhantomJS side. Our custom nodeunit reporter will use console.log to print the test results, so we will pass through console messages in PhantomJS.

page.onConsoleMessage = function (msg) {
    console.log(msg);
};

We will use PhantomJS's callback functionality to signal the end of the tests. The callback data will just be an object containing the total number of assertions, the number of failed assertions, and the time taken.

page.onCallback = function (data) {
    if (data.failures)
    {
        console.log("FAILURES: " + data.failures + "/" + data.length + " assertions failed (" + data.duration + "ms)")
    }
    else
    {
        console.log("OK: " + data.length + " assertions (" + data.duration + "ms)");
    }
    phantom.exit(data.failures);
};

(Warning: the callback API is marked as experimental, so may be subject to change.)

If the test page fails to load for whatever reason, PhantomJS will just sit there doing nothing, which is not desirable behaviour, so we will exit with an error if something fails.

phantom.onError = function (msg, trace) {
    console.log("ERROR:", msg);
    for (var i = 0; i < trace.length; i++)
    {
        var t = trace[i];
        console.log(i, (t.file || t.sourceURL) + ': ' + t.line + t.function ? t.function : "");
    }
    phantom.exit(1);
};
page.onError = function (msg, trace) {
    console.log("ERROR:", msg);
    for (var i = 0; i < trace.length; i++)
    {
        var t = trace[i];
        console.log(i, (t.file || t.sourceURL) + ': ' + t.line + t.function ? t.function : "");
    }
    phantom.exit(1);
};
page.onLoadFinished = function (status) {
    if (status !== "success")
    {
        console.log("ERROR: page failed to load");
        phantom.exit(1);
    }
};
page.onResourceError = function (resourceError) {
    console.log("ERROR: failed to load " + resourceError.url + ": " + resourceError.errorString + " (" + resourceError.errorCode + ")");
    phantom.exit(1);
};

Now for the nodeunit side. The normal test page looks like this:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>ML Editor Test Suite</title>
    <link rel="stylesheet" href="stylesheets/nodeunit.css" type="text/css" />
    <script src="javascripts/module-requirejs.js" type="text/javascript"></script>
    <script src="javascripts/requirejs-config.js" type="text/javascript"></script>
    <script data-main="test" src="javascripts/require.js" type="text/javascript"></script>
  </head>
  <body>
    <h1 id="nodeunit-header">ML Editor Test Suite</h1>
  </body>
</html>

If you're not familiar with RequireJS pages, the <script data-main="test" src="javascripts/require.js" type="text/javascript"></script> line means that the main JavaScript file is called "test.js". We want to use the same script file for both a normal browser test and the PhantomJS-based test, so in PhantomJS, we will set window.nodeunit_reporter to our custom reporter. In "test.js", then, we will check for window.nodeunit_reporter, and if it is present, we will replace nodeunit's default reporter. Although there's no documented way of changing the reporter in the browser version of nodeunit, looking at the code, it's pretty easy to do.

if (window.nodeunit_reporter) {
    nodeunit.reporter = nodeunit_reporter;
    nodeunit.run = nodeunit_reporter.run;
}

(Disclaimer: since this uses an undocumented interface, it may break some time in the future.)

So what does a nodeunit reporter look like? It's just an object with two items: info (which is just a textual description) and run. run is a function that calls the nodeunit runner with a set of callbacks. I based the reporter off of a combination of nodeunit's default console reporter and its browser reporter.

window.nodeunit_reporter = {
    info: "PhantomJS-based test reporter",
    run: function (modules, options) {
        var opts = {
            moduleStart: function (name) {
                console.log("\n" + name);
            },
            testDone: function (name, assertions) {
                if (!assertions.failures())
                {
                    console.log("✔ " + name);
                }
                else
                {
                    console.log("✖ " + name);
                    assertions.forEach(function (a) {
                        if (a.failed()) {
                            console.log(a.message || a.method || "no message");
                            console.log(a.error.stack || a.error);
                        }
                    });
                }
            },
            done: function (assertions) {
                window.callPhantom({failures: assertions.failures(), duration: assertions.duration, length: assertions.length});
            }
        };
        nodeunit.runModules(modules, opts);
    }
};

Now in PhantomJS, I just need to get it to load a modified test page that sets window.nodeunit_reporter before loading "test.js", and voilà, I have browser tests running on the console. All that I need to do now is to add it to my Buildbot configuration, and then I will be alerted whenever I break a browser test.

The script may or may not work in SlimerJS, allowing the tests to be run in a Gecko-based rendering engine, but I have not tried it since, as I said before, my tests don't work in Firefox. One main difference, though, is that SlimerJS doesn't honour the exit code, so Buildbot would need to parse the output to determine whether the tests passed or failed.

0 Comments
January 18, 2016

When native code is slower than interpreted code

16:56 -0500

At work, I'm working on a document editor, and it needs to be able to read in HTML data. Well, that's simple, right? We're in a browser, which obviously is able to parse HTML, so just offload the HTML parsing to the browser, and then traverse the DOM tree that it creates.

var container = document.createElement("div");
container.innerHTML = html;

The browser's parser is native code, built to be robust, well tested. What could go wrong?

Unfortunately, going this route, it ended up taking about 70 seconds to parse a not-very-big document on my 4 year old laptop. 70 seconds. Not good.

Switching to a JavaScript-based HTML parser saw the parsing time drop down to about 9 seconds. Further code optimizations in other places brought it down to about 3 seconds. Not too bad.

So why is the JavaScript parser faster than the browser's native parser? Without digging into what the browser is actually doing, my best guess is that the browser isn't just parsing the HTML, but is also calculating styles, layouts, etc. This guess seems to be supported by the fact that not all HTML is parsed slowly; some other HTML of similar size is parsed very quickly (faster than using the JavaScript-based parser). But it can't be the whole story, because the browser is able to display that same HTML fairly quickly.

I may have to do some further investigations, but I guess the moral of the story is to not assume that offloading work is the fastest solution.

0 Comments
December 18, 2015
11:35 -0500
Hubert Chathi: Out of context comment of the day: "do nothing, since we don't care about users"
0 Comments
November 19, 2015

My development-to-production workflow so far

12:04 -0500

As a follow-up to my previous post, I've fleshed out my CI/(almost-)CD workflow a bit more. I write "(almost-)CD", because I've decided that I don't really want deployment to production to be completely automatic; I want to be able to manually mark a build as ready for production, at least for now.

When we last left off, I had Buildbot watching our git repository, and when it detected a change, it ran unit tests, and if the tests passed, updated the code on our VPS and triggered a reload. Since then, I've added email notifications (for all failures, and for success on the final step), and we've switched over to a Docker-based deployment. Here's what the process looks like now:

Buildbot still watches our git repository. When it detects a change, it checks out the code, and runs a docker build on a remote Docker instance to build a new image of the application; the buildbot slave is still running on our VPS, which does not support Docker, so we need to run Docker on a separate host. I could have also created a new buildbot slave on a Docker-capable host, but that seems like it would have been more work.

The new image is tagged with the git commit hash, as well as the "staging" tag. Next Buildbot runs the unit tests on the image itself. If the tests all pass, it pushes the image to our Docker repository on Tutum with the "staging" tag. Tutum watches for changes in the Docker repository, and when a new "staging" image is pushed, it redeploys our staging server. In the meantime, Buildbot sends me an email telling me that the build has passed.

Up to this point, everything since the git push has been automatic. After I get the email from Buildbot, I do a quick sanity check on our staging server, just in case the unit tests missed anything, and if all goes well, I re-push the Docker image, but this time with the "latest" tag. Again, Tutum will notice a new "latest" image, and this time redeploys our production server.

There are a couple things I really like about this setup. First of all, the number of manual steps involved is minimal; the only thing I do after pushing the code is checking the stage site, and re-pushing the image. Everything else is done automatically, which means that there's less chance of me forgetting to do something. Secondly, by using Docker images, I'm sure that the staging and production environments are exactly the same (or at least as close as possible).

One downside is that redeploying an image means there's a slight amount of downtime. This can be solved by using Blue/Green deployment instead of production/staging, but it's a bit more complicated to set up. This will probably be the next thing for me to look into.

0 Comments
September 17, 2015

simple process respawning

16:29 -0400

File this under "I can't believe how long it took me to figure it out".

I'm Dockerizing some of our services at work. One of them (by design) kills itself after handling a number of requests. Of course, I want it to restart after it kills itself. Most solutions seem like overkill. Using a service supervisor like runit is great, but requires too much setup for monitoring just a single process. Forever is probably a good option, but I don't want to have to install Node.js in the image just to monitor it. Not to mention, Node.js has a non-trivial memory footprint.

Basically, I want something small and simple. No extra dependencies, minimal extra setup, minimal extra resource usage. After too much time looking for a solution, I came up with a 5-line shell script:

#!/bin/sh
while :
do
    "[email protected]"
done

Name it forever.sh and put it in your PATH, and use it as: forever.sh <command> (e.g. forever.sh server -p 8080). It's just an infinite loop that executes its arguments until it gets killed.

0 Comments
August 26, 2015

Limiting concurrency in Node.js with promises

21:41 -0400

The nice thing about Node.js is its asynchronous execution model, which means that it can handle many requests very quickly. The flip side of this is that it can also generate many requests very quickly, which is fine if they can then be handled quickly, and not so good when they can't. For one application that I'm working on, some of the work gets offloaded to an external process; a new process is created for each request. (Unfortunately, that's the architecture that I'm stuck with for now.) And when doing a batch operation, Node.js will happily spawn hundreds of processes at once, without caring that doing so will cause everything on my workstation to slow to a crawl.

Limiting concurrency in Node.js has been written about elsewhere, but I'd like to share my promise-based version of this solution. In particular, this was built for the bluebird flavour of promises.

Suppose that we have a function f that performs some task, and returns a promise that is fulfilled when that task in completed. We want to ensure that we don't have too many instances of f running at the same time.

We need to keep track of how many instances are currently running, we need a queue of instances when we've reached our limit, and of course we need to define what our limit is.

var queue = [];
var numRunning = 0;
var max = 10

Our queue will just contain functions that, when called, will call f with the appropriate arguments, as well as perform the record keeping necessary for calling f. So to process the queue, we just check whether we are below our run limit, check whether the queue is non-empty, and run the function at the front of the queue.

function runnext()
{
    numRunning--;
    if (numRunning <= max && queue.length)
    {
        queue.shift()();
    }
}

Now we create a wrapper function f1 that will limit the concurrency of f. We will call f with the same arguments that f1 is called with. If we have already reached our limit, we queue the request; otherwile, we run f immediately. When we run f, whether it is immediately, or in the future, we must first increment our counter. After f is done, we process the next element in the queue. We must process the queue whether f succeeds or not, and we don't want to change the resolution of f's promise, so we tack a finally onto the promise returned by f.

function f1 ()
{
    var args = Array.prototype.slice.call(arguments);
    return new Promise(function (resolve, reject) {
        function run() {
            numRunning++;
            resolve(f.apply(undefined, args)
                    .finally(runnext));
        }
        if (numRunning > max)
        {
            queue.push(run);
        }
        else
        {
            run();
        }
    });
}

Of course, if you need to do this a lot, you may want to wrap this all up in a higher-order function. For example:

function limit(f, max)
{
    var queue = [];
    var numRunning = 0;

    function runnext()
    {
        numRunning--;
        if (numRunning <= max && queue.length)
        {
            queue.shift()();
        }
    }

    return function ()
    {
        var args = Array.prototype.slice.call(arguments);
        return new Promise(function (resolve, reject) {
            function run() {
                numRunning++;
                resolve(f.apply(undefined, args)
                        .finally(runnext));
            }
            if (numRunning > max)
            {
                queue.push(run);
            }
            else
            {
                run();
            }
        });
    };
}

This would be used as:

f = limit(f, 10);

which would replace f with a new function that is equivalent to f, except that only 10 instances will be running at a time.

0 Comments
May 24, 2015

First steps in CI/CD with Buildbot

15:59 -0400

At work, I've been looking into Continuous Integration and Continuous Delivery/Deployment. So far, our procedures have been mostly manual, which means that some things take longer than necessary, and sometimes things get missed. The more that can be automated, the less developer time has to be spent on mundane tasks, and the less brain power needed to remember all the steps.

There are many CI solutions out there, and after investigating a bunch of them, I settled on using Buildbot for a few reasons:

  • it can manage multiple codebases for the same project, unlike many of the simpler CI tools. This is important since the back end for the next iteration of our product is based on plugins that live in individual git repositories.
  • it is lightweight enough to run on our low-powered VPS.
  • it has a flexible configuration language (its configuration file is Python code) and is easily extendable.

Right now, we're in development mode for our product, and I want to make sure that our development test site is always running the latest available code. That means combining plugins together, running unit tests, and if everything checks out, deploying. Eventually, my hope is to be able to tag a branch and have our production site update automatically.

The setup

Our code has one main tree, with plugins each in their own directory within a special plugins directory. The development test site should track the master branch on the main tree and all relevant plugins. For ease of deployment (especially for new development environments), we want to use git submodules to pull in all the relevant plugins. However, the master branch will be the basis of all deployments, which may use different plugins, or different versions of plugins, and so should not have any plugins specified in itself. Instead, we have one branch for each deployed version, which includes as submodules the plugins that are used for that build.

The builds

Managing git submodules can be a bit of a pain. Especially since we're not developing on the branch that actually contains the submodules, managing them would require switching branches, pulling the correct versions of the submodules, and pushing.

The first step in automation, then, is to automatically update the deployment branch whenever either a plugin or the main tree are updated. Buildbot has a list of the plugins that are used in a deployment branch, along with the branch that it follows. Each plugin is associated with a repository, and we use the codebase setting in Buildbot to keep the plugins separate. We then have a scheduler listen on the appropriate codebases, and triggers a build whenever any pushes are made. A Buildbot slave then checks out the latest version of the deployment branch, merges in the changes to the main tree and the submodules, and then pushes out a new version of the deployment branch.

Naturally, pushes to plugins and to the main tree are generally grouped. For example, changes in one plugin may require, or enable, changes in other plugins. We don't want a separate commit in our deployment branch for each change in each plugin, so we take advantage of Buildbot's ability to merge changes. We also wait for the git repositories to be stable for two minutes before running the build, to make sure that all the changes are caught. This reduces the number of commits we have in our deployment branch, making things a bit cleaner.

When Buildbot pushes out a new version of the deployment branch, this in turn triggers another build in Buildbot. Buildbot checks out the full sources, including submodules, installs the required node modules, installs configuration files for testing, and then runs the unit tests. If the tests all pass, then this triggers yet another build.

The final build checks out the latest sources into the web application directory for the test site, and then notifies the web server (currently using Passenger) to restart the web application.

Next steps

This setup seems to be working fairly well so far, but the setup isn't complete yet. Being a first attempt, I'm sure there are many improvements that can be made to the setup, both in terms of flexibility and performance. Especially since we only have one site being updated, the configuration works fine for now, but can probably be made more general to make it easier to deploy multiple sites.

One major issue in the current setup, though, is the lack of notifications. Currently, in order to check the state of the build, I need to view Buildbot's web UI, which is inconvenient. Buildbot has email notification built in, but I just haven't had the chance to set it up yet. When I do set it up, I will likely set it to notify on any failure, as well as whenever a deployment is made.

I'd also like to get XMPP notifications, which isn't built into Buildbot, and so is something that I would have to write myself. Buildbot is based on Twisted, which has an XMPP module built in, so it should be doable. I think the XMPP module is a bit outdated, but we don't need any fancy functionality, so hopefully it will work well enough.

I'm looking into using Docker for deployments once we're ready to push this project to production, so I'll need to look into creating build steps for Docker. The VPS that we're currently using for Buildbot is OpenVZ-based, and so does not support Docker, so we'd need to put a Buildbot slave on a Docker-capable host for building and testing the Docker images, or even use a Docker container as a Buildbot slave, which would be even better.

There's probably a lot that can be done to improve the output in the UI too. For example, when the unit tests are run, it only reports whether the tests passed or failed. It should be possible to create a custom build step that will report how many tests failed.

Assessment

Although Buildbot seems like the best fit for our setup, it isn't perfect. The main thing that I'd like is better project support. Buildbot allows you to set projects on change sets, but I'd like to be able to set projects on builds as well, in order to filter by projects in the waterfall view.

All in all, Buildbot seems like a worthwhile tool that is flexible, yet easy enough to configure. It's no-nonsense and just does what it claims to do. The documentation is well done, and for simple projects, you should be able to just dive right in without any issues. For more complex projects, it's helpful to understand what's going on before charging right in. Of course, I just charged right in without understanding certain concepts, so I had to redo some stuff to make it work better, but the fact that I was able to actually get it to work in the first place, even doing it the wrong way, gives some indication to its power.

0 Comments
April 15, 2015

Switching to nginx

09:17 -0400

I think that I've been running lighttpd for almost as long as I've had a VPS, but I've recently decided to switch to nginx.

The main reason that I've decided to switch is that lighttpd no longer seems to be actively developed. They still do bug fix releases, but aside from that, development seems to have been stalled. They have been working on their 1.5 branch for years, without marking it as stable. In fact, they even started working on a 2.0 branch without first releasing 1.5, which was a warning sign that development was losing focus.

Nginx has some weirdnesses and unexpected design decisions, though. For example

One feature that I will miss from lighttpd is its ability to automatically split SCRIPT_NAME and PATH_INFO based on what files are actually on the filesystem. I depend on that feature in my own CMS, which means I'll have to implement it myself, which is slightly inconvenient, but not too big of a deal.

I slightly prefer the lighttpd configuration file format, but that could be just a matter of what I'm used to.

Switching to nginx means that I'll be able to try out Passenger, which seems like a very interesting application server.

I've already switched my dev machine. Next I'll switch our home server, and once I have the CMS changes done, I'll switch my VPS.

0 Comments
December 30, 2013

Random testing

17:55 -0500

My current project at work requires implementing non-trivial data structures and algorithms, and despite my best efforts (unit testing consisting of over 600 assertions), I don't have everything unit tested. In order to find bugs in my code, I've created a randomized tester.

First of all, the code is structured so that all operations are decoupled from the interface, which means that it can be scripted; anything that a user can do from the interface can also be done programmatically. Of course, this is a requirement for any testable code.

I want to make sure that the code is tested in a variety of scenarios, but without having to create the tests manually. So I let the computer generate it (pseudo)randomly. Basically, my test starts with a document (which, for now, is hard-coded). The program then creates a series of random operations to apply to the document: it randomly selects a type of operation, and then randomly generates an operation of that type. It then runs some tests on the resulting document, and checks for errors.

Most of the time, when doing random things, you don't want things to be repeatable; if you write a program to generate coin flips, you don't want the same results every time you run the program. In this case, however, I need to be able to re-run the exact same tests over and over; if the tests find a bug, I need to be able to recreate the conditions leading to the bug, so that I can find the cause. Unfortunately, JavaScript's default random number generator (unlike many other programming languages) is automatically seeded, and provides no way of setting the seed. That isn't a major problem, though — we just need to use an alternate random number generator. In this case, I used an implementation of the Mersenne Twister. Now, I just hard-code the seed, and every time I run the tester, I get the same results. And if I want a different test, I just need to change the seed.

It seems to be working well so far. I've managed to squish some bugs, uncover some logic errors, and, of course, some silly errors too. Of course, the main downside is that I can't be sure that the random tests cover all possible scenarios, but the sheer number of tests that are generated far exceeds what I would be able to reasonably do by hand, and my hand-written tests weren't covering all possible scenarios anyways.

Addendum: I should add that when the randomized tester finds a bug, I try to distill it to a minimal test case and create a new unit test based on the randomized tester result.

0 Comments
April 22, 2013

Thoughts on literate programming

12:50 -0400

At work, I've been implementing a data structure to make our collaborative editor run quickly. As part of that work, I've had to write a couple of complex functions (a couple 200+ line functions), which got me thinking about comments, readability, and presentation.

If you've never heard of literate programming, it's an idea introduced by Knuth (surely you've heard of him) that combines programming with documentation intended for human consumption. The program is presented in a document written for people to read, and transformed by a program into something a computer can execute. (The Wikipedia article on literate programming gives a decent description.)

I've dabbled a bit with literate programming in the past. In fact, I'm the maintainer for the noweb package in Debian. One of my (very) long-term projects is to build a free data structure library written for people to learn how the data structures work, and I've started implementing a couple simple data structures in literate programming style. However, looking at literate programming again, it seems to me that it has a few deep limitations.

First of all, if you want to describe something in depth, you're forcing everyone to read it, even if they aren't interested. For example, in the wc example, “#include <stdio.h>” takes 3 lines, even though anyone who has read an introductory C programming book will know immediately why that's there. On the other hand, you might want to include that for beginner programmers. One of the frustrating things I found when writing research papers was that I often had to go into too much detail, to make sure that every single step was covered, which I felt sometimes turned a short, simple proof into something unwieldy. What I would have liked to do was something like Leslie Lamport's (of LATEX fame) hierarchical proofs (though it doesn't translate well to printed text, and needs a more dynamic medium like a web page).

This limitation is partially due to the time that literate programming was conceived. With printed text, either you write something and everyone sees it (even if they just skim it, it's still there for them to see), or you omit it and nobody sees it. With something like a web page, however, you don't have this limitation. You can write “#include <stdio.h>”, and hide the descriptive text unless the reader wants to learn more.

Another limitation that I find with literate programming is that one of its underlying implications is that code is a lesser way of communicating between people, and that people communicate best using natural language. Each code chunk is intended to be described in words. While natural language is the best tool for general human communication, a small chunk of well-written code, like well-written mathematical notation, can be very effective in communicating certain ideas. Literate programming would encourage you to write the chunk twice, once is code and once in natural language, even if the code is a sufficient (or even sometimes better) way of communicating the idea. Going back to the stdio.h example, just writing “#include <stdio.h> // we send formatted output to stdout and stderr” would be a sufficient description for most programmers.

Related to this, literate programming pulls code chunks out of context, which sometimes is an important part in understanding how the code works. Seeing the code in context gives clues about what state the computer is in before it is executed, and what is expected after it executes. Of course you can always describe that in text, but seeing the code in context sometimes gives experienced programmers a more intuitive feel for how the code works.

One thing that I like about literate programming, though, is that emphasizes understanding over a line-by-line presentation. For example if you have two chunks of code that operate on the same data (say one reads and the other writes), or if you have two chunks that have operate similarly, then you would write those together, instead of having them spread out according to how the computer would execute them. It also allows you to deal with more important or interesting parts first, and leave the more mundane parts for later (I would have put “#include <stdio.h>” near the end of the document).

It is also useful to have at your disposal some of the document-writing tools, such as sectioning, lists, mathematical equations, and beautifully formatted text (and not having to make sure that your lines are wrapped properly).

While I think that literate programming is a great idea for presenting code in an understandable manner, I think that it has a lot of room for improvement, especially if we can take advantage of some of the features of the web. I'm doing some experimentation, and I hope to have some positive results.

0 Comments
April 1, 2013

Useless metrics

15:47 -0400

Just for fun, I decided to run David A. Wheeler's SLOCCount on my current work project. Here is the output (with the default options, slightly cleaned up):

SLOC	Directory	SLOC-by-Language (Sorted)
10656   mleditor        js=10656
2299    util            js=2299

Totals grouped by language (dominant language first):
js:           12955 (100.00%)

Total Physical Source Lines of Code (SLOC)                = 12,955
Development Effort Estimate, Person-Years (Person-Months) = 2.95 (35.34)
 (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months)                         = 0.81 (9.69)
 (Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule)  = 3.65
Total Estimated Cost to Develop                           = $ 397,833
 (average salary = $56,286/year, overhead = 2.40).
SLOCCount, Copyright (C) 2001-2004 David A. Wheeler
SLOCCount is Open Source Software/Free Software, licensed under the GNU GPL.
SLOCCount comes with ABSOLUTELY NO WARRANTY, and you are welcome to
redistribute it under certain conditions as specified by the GNU GPL license;
see the documentation for details.
Please credit this data as "generated using David A. Wheeler's 'SLOCCount'."

Note: This includes some, but not all, unit tests. I had to modify SLOCCount to support JavaScript — I just used the C parser.

I started working on the project in October, so I've spent 6 months on it. So according to the COCOMO model, I've produced almost $400,000 worth of work (at 2004 wages) in 6 months.

I think I need a raise. wink emoticon

(P.S. If you're lucky enough, you'll get the Bill Gates quote in the random quote section on the right-hand side of this page.)

0 Comments
February 16, 2013

Wave, drawing, and what not to do

22:36 -0500

A few months ago, I wrote a blog post about Wave, in which I said that Wave wouldn't be my first choice as a protocol for collaborative vector graphics. Here is an expansion on that statement.

Obviously, when designing something, you want to avoid reinventing things that you don't need to. The Wave protocol operates on documents that have a similar model to XML, or at least the most common parts of XML. SVG is an XML-based format for vector graphics. So a temptation would be to slap Wave on top of SVG to do collaborative drawing. Here are some reasons why that wouldn't be the best idea.

Note that we will be taking a simplified view of SVG, so some of my statements may not be completely accurate, if you want to nit pick. However, the ideas behind the statements should still be valid.

Locking

First of all, I've always thought that a collaborative drawing protocol should include some sort of locking; it would probably be confusing if two users tried to drag the same object at the same time. So that eliminates the possibility of using a stock Wave implementation, but it shouldn't be too hard to add locking on top of the Wave protocol.

Rendering order

In SVG, objects are rendered based (more or less) on their order in the document tree. That is, objects that appear earlier in the document are rendered first. Now consider what happens when someone tries to change the order of the objects (for example, moving an object to the back or to the front). The only way to do this with the Wave protocol is to delete the object from its current position in the document, and re-insert it in its new position.

However, if another user is modifying the object at the same time, since the Wave server has no way of knowing that the deletion and insertion represent the same object, when the server resolves the conflict, the unmodified object will be re-inserted, losing the second user's changes. In addition, if two users try to change the rendering order of the same object at the same time, the object could get re-inserted twice.

We have a similar issue with object grouping, but we will only look at object ordering.

How do we fix this? One way might be to change the document model: we could use an attribute to store the object ordering, rather than using the document ordering. If we try to just use a simple integer sequence (that is, 0, 1, 2,…) for the object ordering attribute, then changing object ordering may result in most of these attributes changing, and so if multiple users try to change ordering at the same time, the server (if it is naive) may not be able to resolve the conflicts, while still maintaining the property that the attributes are a simple integer sequence. We could try using decimal numbers (e.g. to move an object between the objects ordered 3 and 4, we give the object an order of 3.5), but then we may get the ugliness of extremely long decimal numbers. This could be solved by having a watcher that periodically renumbers the objects. As long as it doesn't try to renumber the objects while a user is also reordering the objects.

Another way is to change the protocol: add an operation for reordering the objects. Of course, adding operations means that we need to do more work figuring out how it fits in with the others. So the fewer operations that we need to add, the better. For object ordering, we can probably get by with just one operation, specifying an object, and a delta in its rendering order.

Another option is to just say that these types of conflicts should happen rarely enough that we don't care about them, and just use Wave and SVG unchanged. This is certainly a valid option, as long as the users are prepared for this (or as long as you are prepared to deal with the users). It can be argued that textual documents have a similar issue when users move text from one area to another, but it is probably less of an issue with textual documents since not all editors include a "move" operation, and even if it is included, it not commonly used. Instead, users usually "copy-and-paste", which arguably makes this type of conflict less confusing.

Object nodes

Now consider the actual description of an object. Let's just look at the <path> element. The nodes of a path are represented in its d attribute, which consists of a number of commands, indicating how the cursor moves. If the nodes of a path are changed, then the corresponding Wave operation is to replace the entire contents of the d attribute. If multiple users try to change the same object at the same time, then the server has no way of resolving the conflict, unless it runs a diff on the attribute value, and even then, it might not be reliable.

One way to fix this is to change the document model: instead of using a single attribute to store the path, we could use sub-elements to represent the nodes. This would allow individual nodes to be modified independently, as well as inserting and deleting nodes without conflict.

Another issue is that in SVG, the path data for an object is relative to the document. That means that if a user moves an object, then every node gets changed, and so if another user is modifying an individual node, then the modifications will conflict.

This can be fixed by disallowing users from modifying an object's nodes while the object is being moved (and vice versa); this is probably a rare-enough occurrence that users would not notice it. Another option is to specify a "position" for the path, and have the node positions relative to the position. (In fact, SVG does allow for transformations, to change the coordinate system for objects, so we could enforce that each object gets its own coordinate system.)

Summary

Now I should clarify something: if you slapped Wave on top of SVG, then you would still get a system where every user's copy of the document is synchronized, and all editing conflicts would be resolved. However, the conflicts might not be resolved in a way that makes sense for the users.

In general, there are two options for resolving these issues: change the document model, or change the operations. One option may be more appropriate than the other in different circumstances.

I should also add that even if you don't use SVG within the Wave document, doesn't mean that you can't base your editor on SVG — depending on how you have modified the document model, it should be possible to translate between the two formats.

So how would I do collaborative drawing? Well, maybe that will be a topic for a future blog post.

0 Comments