May 10, 2013

CloudFlare

14:51 -0400

I've started trying CloudFlare as a CDN for my web site. They have a nice free plan that works well for personal and smaller sites. So far it seems to be going well. Not that I was having any problems with my hosting, but it's a fun thing to try. And it's nice to have some security in knowing that my site won't blow up if I accidentally post something wildly popular and get Slashdotted. And in the case of a server failure, as happened yesterday, my site is still (somewhat) accessible.

As a brief description of how CloudFlare works, you host your site as normal, and visitors will connect to CloudFlare's servers rather than your own. If CloudFlare has a cached version of your page (that is still current), then the visitor gets sent the cached copy; otherwise it queries your server for the contents.

The Good

One nice thing about CloudFlare is that it includes IPv6 support, even in the free plan, which is great for when your host doesn't have IPv6 support. Of course, I still don't have IPv6 ssh, email or Jabber, but it's a start.

CloudFlare also doesn't charge for bandwidth. If your site gets Slashdotted, you don't have to worry about increased charges (at least, not from CloudFlare — you might still exceed your host's bandwidth allowance). You just pay your plan's rates (or not pay, in the case of the free plan).

Redundancy. I'll say it again: redundancy. If my host should ever become unavailable for whatever reason (yesterday, it was hardware failure), my site (at least whatever CloudFlare has cached) is still accessible. If CloudFlare ever goes down for an extended period of time, I can change the DNS records so that visitors go directly to my server. (Of course, due to propagation delays, that would have to be quite an extended CloudFlare outage.)

CloudFlare is also a very transparent company. They had a network failure not too long ago, and they were very upfront about what happened. No matter what you do, something will always go wrong at some point. It's how you react to things going wrong that differentiate companies. Even when things don't go wrong, CloudFlare shares a lot of their technical details on their blog.

The Bad

Unfortunately, there are some limitations. CloudFlare takes over your DNS service. This is necessary for how CloudFlare operates: they must be able to return different DNS responses depending on the visitor's location. But this means that you must use CloudFlare's DNS editing interface, which isn't as flexible as editing a zone file by hand. I had to give up my CSA and RP records. Given that almost nobody uses CSA, and my contact information is easily found on my site, it isn't a great loss.

Another limitation is that the free plans doesn't include SSL support. This is perfectly reasonable, given that it's a free plan — you need to give people a reason to pay. But it's something to be aware of. I only really needed SSL for my OpenID service, so I just put it on a different host name, and set it to not be handled by CloudFlare.

The Questionable

CloudFlare is a very popular CDN, which means that if it should ever go down, it would take down a lot of sites. It would also be a popular target for attack. As more sites use CloudFlare, it's becoming an Internet monoculture. So if anyone knows of a similar service, let me know and I'll give it a try.

All CloudFlare plans include analytics. Unfortunately, their analytics is very basic; it only indicates how many visitors (regular, potential threat, web crawler) you got over certain periods of time. Threats are broken down by country, but regular visitors are not. The statistics are also not broken down by URL, browser, etc. Of course, some analytics is better than the no analytics that I had before. However, since not all web requests reach your server, and you don't have access to the raw logs on their server (unless you have a Business plan), this prevents you from running your own (reliable) analytics, should I ever have had the time to set it up. If course, you can use your own (or a 3rd-party) JavaScript-based analytics system, but it isn't as accurate.

All in all, CloudFlare seems like a good service, and the price is right. I'll keep it on trial for a bit longer, but it seems like I'll be keeping it.

0 Comments
April 22, 2013

Thoughts on literate programming

12:50 -0400

At work, I've been implementing a data structure to make our collaborative editor run quickly. As part of that work, I've had to write a couple of complex functions (a couple 200+ line functions), which got me thinking about comments, readability, and presentation.

If you've never heard of literate programming, it's an idea introduced by Knuth (surely you've heard of him) that combines programming with documentation intended for human consumption. The program is presented in a document written for people to read, and transformed by a program into something a computer can execute. (The Wikipedia article on literate programming gives a decent description.)

I've dabbled a bit with literate programming in the past. In fact, I'm the maintainer for the noweb package in Debian. One of my (very) long-term projects is to build a free data structure library written for people to learn how the data structures work, and I've started implementing a couple simple data structures in literate programming style. However, looking at literate programming again, it seems to me that it has a few deep limitations.

First of all, if you want to describe something in depth, you're forcing everyone to read it, even if they aren't interested. For example, in the wc example, “#include <stdio.h>” takes 3 lines, even though anyone who has read an introductory C programming book will know immediately why that's there. On the other hand, you might want to include that for beginner programmers. One of the frustrating things I found when writing research papers was that I often had to go into too much detail, to make sure that every single step was covered, which I felt sometimes turned a short, simple proof into something unwieldy. What I would have liked to do was something like Leslie Lamport's (of LATEX fame) hierarchical proofs (though it doesn't translate well to printed text, and needs a more dynamic medium like a web page).

This limitation is partially due to the time that literate programming was conceived. With printed text, either you write something and everyone sees it (even if they just skim it, it's still there for them to see), or you omit it and nobody sees it. With something like a web page, however, you don't have this limitation. You can write “#include <stdio.h>”, and hide the descriptive text unless the reader wants to learn more.

Another limitation that I find with literate programming is that one of its underlying implications is that code is a lesser way of communicating between people, and that people communicate best using natural language. Each code chunk is intended to be described in words. While natural language is the best tool for general human communication, a small chunk of well-written code, like well-written mathematical notation, can be very effective in communicating certain ideas. Literate programming would encourage you to write the chunk twice, once is code and once in natural language, even if the code is a sufficient (or even sometimes better) way of communicating the idea. Going back to the stdio.h example, just writing “#include <stdio.h> // we send formatted output to stdout and stderr” would be a sufficient description for most programmers.

Related to this, literate programming pulls code chunks out of context, which sometimes is an important part in understanding how the code works. Seeing the code in context gives clues about what state the computer is in before it is executed, and what is expected after it executes. Of course you can always describe that in text, but seeing the code in context sometimes gives experienced programmers a more intuitive feel for how the code works.

One thing that I like about literate programming, though, is that emphasizes understanding over a line-by-line presentation. For example if you have two chunks of code that operate on the same data (say one reads and the other writes), or if you have two chunks that have operate similarly, then you would write those together, instead of having them spread out according to how the computer would execute them. It also allows you to deal with more important or interesting parts first, and leave the more mundane parts for later (I would have put “#include <stdio.h>” near the end of the document).

It is also useful to have at your disposal some of the document-writing tools, such as sectioning, lists, mathematical equations, and beautifully formatted text (and not having to make sure that your lines are wrapped properly).

While I think that literate programming is a great idea for presenting code in an understandable manner, I think that it has a lot of room for improvement, especially if we can take advantage of some of the features of the web. I'm doing some experimentation, and I hope to have some positive results.

0 Comments
April 1, 2013

Useless metrics

15:47 -0400

Just for fun, I decided to run David A. Wheeler's SLOCCount on my current work project. Here is the output (with the default options, slightly cleaned up):

SLOC	Directory	SLOC-by-Language (Sorted)
10656   mleditor        js=10656
2299    util            js=2299

Totals grouped by language (dominant language first):
js:           12955 (100.00%)

Total Physical Source Lines of Code (SLOC)                = 12,955
Development Effort Estimate, Person-Years (Person-Months) = 2.95 (35.34)
 (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months)                         = 0.81 (9.69)
 (Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule)  = 3.65
Total Estimated Cost to Develop                           = $ 397,833
 (average salary = $56,286/year, overhead = 2.40).
SLOCCount, Copyright (C) 2001-2004 David A. Wheeler
SLOCCount is Open Source Software/Free Software, licensed under the GNU GPL.
SLOCCount comes with ABSOLUTELY NO WARRANTY, and you are welcome to
redistribute it under certain conditions as specified by the GNU GPL license;
see the documentation for details.
Please credit this data as "generated using David A. Wheeler's 'SLOCCount'."

Note: This includes some, but not all, unit tests. I had to modify SLOCCount to support JavaScript — I just used the C parser.

I started working on the project in October, so I've spent 6 months on it. So according to the COCOMO model, I've produced almost $400,000 worth of work (at 2004 wages) in 6 months.

I think I need a raise. wink emoticon

(P.S. If you're lucky enough, you'll get the Bill Gates quote in the random quote section on the right-hand side of this page.)

0 Comments
February 16, 2013

Wave, drawing, and what not to do

22:36 -0500

A few months ago, I wrote a blog post about Wave, in which I said that Wave wouldn't be my first choice as a protocol for collaborative vector graphics. Here is an expansion on that statement.

Obviously, when designing something, you want to avoid reinventing things that you don't need to. The Wave protocol operates on documents that have a similar model to XML, or at least the most common parts of XML. SVG is an XML-based format for vector graphics. So a temptation would be to slap Wave on top of SVG to do collaborative drawing. Here are some reasons why that wouldn't be the best idea.

Note that we will be taking a simplified view of SVG, so some of my statements may not be completely accurate, if you want to nit pick. However, the ideas behind the statements should still be valid.

Locking

First of all, I've always thought that a collaborative drawing protocol should include some sort of locking; it would probably be confusing if two users tried to drag the same object at the same time. So that eliminates the possibility of using a stock Wave implementation, but it shouldn't be too hard to add locking on top of the Wave protocol.

Rendering order

In SVG, objects are rendered based (more or less) on their order in the document tree. That is, objects that appear earlier in the document are rendered first. Now consider what happens when someone tries to change the order of the objects (for example, moving an object to the back or to the front). The only way to do this with the Wave protocol is to delete the object from its current position in the document, and re-insert it in its new position.

However, if another user is modifying the object at the same time, since the Wave server has no way of knowing that the deletion and insertion represent the same object, when the server resolves the conflict, the unmodified object will be re-inserted, losing the second user's changes. In addition, if two users try to change the rendering order of the same object at the same time, the object could get re-inserted twice.

We have a similar issue with object grouping, but we will only look at object ordering.

How do we fix this? One way might be to change the document model: we could use an attribute to store the object ordering, rather than using the document ordering. If we try to just use a simple integer sequence (that is, 0, 1, 2,…) for the object ordering attribute, then changing object ordering may result in most of these attributes changing, and so if multiple users try to change ordering at the same time, the server (if it is naive) may not be able to resolve the conflicts, while still maintaining the property that the attributes are a simple integer sequence. We could try using decimal numbers (e.g. to move an object between the objects ordered 3 and 4, we give the object an order of 3.5), but then we may get the ugliness of extremely long decimal numbers. This could be solved by having a watcher that periodically renumbers the objects. As long as it doesn't try to renumber the objects while a user is also reordering the objects.

Another way is to change the protocol: add an operation for reordering the objects. Of course, adding operations means that we need to do more work figuring out how it fits in with the others. So the fewer operations that we need to add, the better. For object ordering, we can probably get by with just one operation, specifying an object, and a delta in its rendering order.

Another option is to just say that these types of conflicts should happen rarely enough that we don't care about them, and just use Wave and SVG unchanged. This is certainly a valid option, as long as the users are prepared for this (or as long as you are prepared to deal with the users). It can be argued that textual documents have a similar issue when users move text from one area to another, but it is probably less of an issue with textual documents since not all editors include a "move" operation, and even if it is included, it not commonly used. Instead, users usually "copy-and-paste", which arguably makes this type of conflict less confusing.

Object nodes

Now consider the actual description of an object. Let's just look at the <path> element. The nodes of a path are represented in its d attribute, which consists of a number of commands, indicating how the cursor moves. If the nodes of a path are changed, then the corresponding Wave operation is to replace the entire contents of the d attribute. If multiple users try to change the same object at the same time, then the server has no way of resolving the conflict, unless it runs a diff on the attribute value, and even then, it might not be reliable.

One way to fix this is to change the document model: instead of using a single attribute to store the path, we could use sub-elements to represent the nodes. This would allow individual nodes to be modified independently, as well as inserting and deleting nodes without conflict.

Another issue is that in SVG, the path data for an object is relative to the document. That means that if a user moves an object, then every node gets changed, and so if another user is modifying an individual node, then the modifications will conflict.

This can be fixed by disallowing users from modifying an object's nodes while the object is being moved (and vice versa); this is probably a rare-enough occurrence that users would not notice it. Another option is to specify a "position" for the path, and have the node positions relative to the position. (In fact, SVG does allow for transformations, to change the coordinate system for objects, so we could enforce that each object gets its own coordinate system.)

Summary

Now I should clarify something: if you slapped Wave on top of SVG, then you would still get a system where every user's copy of the document is synchronized, and all editing conflicts would be resolved. However, the conflicts might not be resolved in a way that makes sense for the users.

In general, there are two options for resolving these issues: change the document model, or change the operations. One option may be more appropriate than the other in different circumstances.

I should also add that even if you don't use SVG within the Wave document, doesn't mean that you can't base your editor on SVG — depending on how you have modified the document model, it should be possible to translate between the two formats.

So how would I do collaborative drawing? Well, maybe that will be a topic for a future blog post.

0 Comments
January 19, 2013

Languages for web applications

22:03 -0500

A question was asked at work quite a while ago: if you had to write a web application (server-side), and could use any language you wanted, what language would you use?

I think that the answer depends on a lot of factors, but here are some of my thoughts about various languages:

C/C++

Fast, fast, fast. As long as it doesn't crash. If speed was your greatest concern, then this would probably be the answer. But being a lower-level language, you need to do more things on your own. As they say, it gives you enough rope to hang yourself with, so make sure you have really good people working on it. But with some other languages featuring JIT compilers, the speed advantage may not be as great as it used to be. And almost any language can call C libraries (and some languages can make good use of C++ libraries too), so you can code performance-critical sections in C/C++, and use a higher-level language for other parts. Personally, I would skip C and just use C++. Even if you don't use all the C++ features, C++ can smooth over some of C's nastier bits (like strings).

Perl

If C/C++ gives you enough rope to hang yourself with, Perl gives you enough rope to make the most horrid knot that nobody (not even yourself) can untangle. The language has wonderful features designed to make programs easier to read, especially if you know the idioms. Ironically, the result is that most Perl programs are undecipherable. I'm sure things have improved in recent versions, but the last time I used Perl seriously, object-oriented programming was, shall we say, odd. I would say that Perl has its uses, and it's extremely handy for those situations, so it's worth having at least some knowledge of it. And sometimes I wish that other languages had some (keyword: "some") of Perl's syntax. But aside from that, for most web applications, I would look elsewhere.

PHP

No. Just, no. The only reason to use PHP ever is that you're working on something that someone else wrote. Or your web host doesn't support anything else. For any serious web application written from scratch, PHP would not even be a consideration. The only reason I even mention it here is because of its popularity.

Python

Great language features, and lots of libraries. But sometimes it eats a lot of memory, and can be slow. Efforts such as Pypy and Unladen Swallow try to address the slowness. In contrast to Perl's "There Is More Than One Way To Do It", Python generally says that there's only one way to do it. So ironically, Python offers a multitude of ways of writing web applications, from Django, to gevent, to web.py, to TurboGears, to Twisted, etc. Of the lot, my current pick would be gevent, or maybe Tornado. Because blocking I/O is stupid.

JavaScript

JavaScript is generally associated with browser-side code, but several frameworks (Node.js being the most popular) bring it to the server side. The good thing about JavaScript is that everyone and their monkey are working on making it fast. The bad thing about it is that it is not a very well-designed language (though it isn't terrible, and it has some redeeming qualities). One notably deficiency is the lack of a built-in true map/dictionary/hash container. JavaScript is certainly a consideration, but if I had my way, I would prefer using JavaScript as an intermediate language, and code the bulk of it in something like CoffeeScript or Amber. One big advantage of JavaScript is that if you have a complex data model and/or functions that are shared between the browser and server, such as the current project that I'm working on at work, you only have to write it once.

Haskell

Fascinating language that I'd like an excuse to learn. Tempting as it is, though, it's probably not practical yet for me to learn it just for a project, when other languages are "good enough".

Java

Java eats memory for breakfast, lunch, and supper. And afternoon tea. And elevensies. And dinner. And then it complains that it doesn't have enough. I've said that Java takes the worst parts of C++ and combines them with the worst parts of SmallTalk. I would not willingly use Java for a code-from-scratch project. One of the other languages that runs on the JVM might be a consideration (such as Scala), but the JVM itself would be a disadvantage.

Lua

Extremely small and light language. And with a JIT compiler available, it can be fast too. Unfortunately, it doesn't have as much library support, and lacks some big features (like Unicode support). It's worth considering, but in the end, it probably wouldn't make the cut.

C#

I don't have any experience with it, but I've heard that it's actually a not-too-terribly-designed language. But with poor Linux support, that will probably remain the extent of my knowledge of the language.

Summary

So, basically, my finalists would be C++, Python, and JavaScript (mostly as an intermediate language), depending on the application. For most projects, Python would probably be first choice, followed by JavaScript, and then C++ for those situations where you just need as much speed as you can get. I'd also love to learn more Haskell to see how it compares with the others, and Lua is a worthwhile consideration (especially if you happen to need to integrate with a good Jabber server).

0 Comments
January 10, 2013

Refactoring

17:30 -0500

The kind of commits that I like: 143 insertions (about 40 of which are comments), vs. 211 deletions (not counting the unit tests), which means about halving the complexity of that section of code.

0 Comments
January 4, 2013

Experiences with the Buffalo LinkStation Pro Duo

18:58 -0500

Recently, one of the harddrives in our home server died. Annoyingly, it was the newest drive, and the drive that had our data on it, which wasn't backed up properly, which meant that it needed to be sent off for recovery, but that's a different story.

Rather than just replace the drive, I opted to replace the whole machine, which was getting pretty old, and I was considering replacing in the next couple of years anyways. After a bit of research, I decided on the Buffalo LinkStation Pro Duo, which is a dual-drive NAS, and is relatively (see below) hackable. I got the dual-drive version, which I set up in a RAID1 configuration — this should let us survive another hard-drive crash. The pro version has a faster processor, and more RAM (don't expect to do anything more than basic file serving with the non-pro version). It basically has the same amount of RAM, and a similar clock speed as our old server.

Of course, with devices where hacking is not officially supported, hacking can be a moving target — the community-supplied instructions didn't work fully, and I had to try some other methods. Also, it looks like nobody has successfully built a recent kernel, which means that we're stuck with whatever Buffalo gives us. Still, it's mostly good enough; I got a full Debian install running on it. (Though I managed to muck up the install at one point, and had to mount the drives in our server to fix it up. If you plan on hacking this, make sure that you have the capability to mount the drives in another machine — you'll probably want to be able to mount both drives at once, since the default configuration has the boot and root partitions on RAID1 volumes.)

In general, I'm happy with the device; it mostly does what I want it to: I can run a web server, database, file server, share our (non-network) printer, share our scanner, etc.

One major complaint that I have is that the Buffalo kernel doesn't include IPv6 support. Really, there is absolutely no reason to not support IPv6, and given that you can't compile your own kernel, this means that it's a showstopper for anyone who needs IPv6.

It also has limited filesystem support. This would kind of make sense, if it only had to access its own drives, but given that it has a USB port that is meant to accept external drives, this can be a limitation for anyone who has a drive that isn't formatted with the most common filesystems. In other words, don't expect to use your Btrfs, JFS, or even HFS+ drive with it. (Of course, this also means that you can't reformat the drive with one of those filesystems.) And again, since you can't compile your own kernel, if you need it, this box isn't for you.

I have yet to try putting any real load on the system, so I can't say how well it performs other than as a basic file server, which it seems to do a good enough job of doing. Once our old drive gets fully recovered and copied over, I'll probably start installing more software, and trying to do more stuff on it.

0 Comments
December 23, 2012

Merry Christmas

21:25 -0500

(Created using Inkscape.)

0 Comments
November 22, 2012

Collaboration, Operational Transformation, Wave

15:37 -0500

At work, we are building a real-time collaborative editor. If you've ever used Google Docs with multiple people working on the same document at the same time, that's the sort of thing we're trying to do. I don't think I'm being too bold in saying that real-time communication and collaboration will soon go from "killer feature" to "feature that people will assume that you have and are frustrated if you don't". Like WYSIWYG for word processors.

The major issue with collaborative editing is synchronization: making sure that everyone sees the same thing. In thinking about synchronization, it is important to not just consider whether everyone's copy of the document is the same, but also that the document makes sense. For example, a text-based protocol is not suitable for XML-like data, and XML is a bad way of storing text formatting. Consider two users editing: "The quick brown fox jumped over the lazy dog". One user makes "quick brown" bold, and another user makes "brown fox" italics. Using a naive XML method, you would get "The <b>quick <i>brown</b> fox</i> ...", which is invalid XML.

For collaborating on textual documents, the Wave Protocol is certainly appropriate, but it isn't appropriate for all things. For example, it wouldn't be my first choice to use for vector graphics (consider: how do you move an object forward or backwards in the drawing stack?). Even tables can cause problems unless your server understands them, has some way of cleaning them up, or you come up with a clever way of representing them. Say we have a 2x2 table:

<table>
  <tr>
    <td></td><td></td>
  </tr>
  <tr>
    <td></td><td></td>
  </tr>
</table>

One user adds a row (which adds another <tr><td></td><td></td></tr> at the end), while another user adds a column (which adds a <td></td> to each <tr>). If both edits happen at the same time, the result will be:

<table>
  <tr>
    <td></td><td></td><td></td>
  </tr>
  <tr>
    <td></td><td></td><td></td>
  </tr>
  <tr>
    <td></td><td></td>
  </tr>
</table>

That is, the first two rows will have three columns, and the third row will have two columns. There are several ways of dealing with this, but you need to know about the potential problem before you can address it.

The moral is: make sure that your synchronization method is appropriate for your document types, and/or make sure that your server understands your document type enough to make sane conflict resolution decisions.

0 Comments
September 22, 2012

Switching

20:10 -0400

This past week, we changed our Internet, home phone and cell phone providers. We are now using Eyesurf for both our Internet and home phone, and PC Telecom for our cell phone.

We had been having intermittent issues with our old home phone. One of the issues was that it was a VOIP phone, and so any problems with our Internet connectivity translated to phone problems. There were also other unknown issues. So now we have a POTS line. The Internet + phone bundle price also turned out to be less than we previously paid. It has almost the same features as our old VOIP line, which is impressive for a POTS line. The main disadvantage is that voicemails are no longer sent to our email. While VOIP is the future, it's still a bit finicky for now.

As for our cell phone, we hardly use it, so it was time to look for a better plan. PC Telecom's "Anytime plan" (pay-as-you go) seems to be the perfect fit for us. I estimate that we'll be paying about $100 - $150 per year. We also get caller ID, and free incoming text messages. PC Telecom also runs on Bell's network, so we'll have good coverage.

All in all, it looks like we'll be saving about $30 per month, and getting better service.

0 Comments