Article / 13th Nov 2012

One Change Is Not Enough

Last week, me and the rest of the Lanyrd team pulled off two major infrastructure changes in two hours - see our blog post on the move for some background information as well as an overview of the timeline.

I'd like to take a closer look at the technical details of this move and show how exactly we pulled off changing our hosting provider, database vendor and operating system version all in one go.

Hosting

For many startups, there comes a time when the switch to dedicated hardware and a more 'traditional' hosting model makes sense, and Lanyrd is one of those - we're fortunate to be at least partially a content-based site, and have two distinct advantages:

Most traffic spikes are to one or two conference pages, and are an easily cacheable pattern.
Conferences are planned well in advance, so we get good notification of when to expect high-traffic periods for more interactive use.

This means that we're not subject to a huge amount of daily variation or unpredicted traffic spikes, and even if we get some, we have enough overhead to deal with them as we've overprovisioned the new servers slightly.

Softlayer were our dedicated hosts of choice - I've had good experiences of them in the past (Minecraft Forums ran off of them for a while after we outgrew Linode) as have some of my colleagues in the ops side of the industry. In addition, they also have billed-by-the-hour virtual servers available in the same datacentres as your dedicated servers, meaning that even if we overgrow our fixed capacity we can have a new frontend box up and serving requests in about 10 minutes.

The last time I did a move like this was moving Epio from AWS to real hardware back in 2011. That was slightly tougher since it was itself a hosting service, and we had to have a few minutes of downtime per app.

Changing hosting is a well-known challenge, and one I've done several times before, so we used the tried-and-tested method of putting the old site into read-only mode, syncing the databases across to our new site, repointing the DNS (and proxying requests from the old IP) and enabling the new site.

For maximum reliability, I scripted the data sync step, and did several dry runs of it. Of course, the move was nowhere near as simple as that.

Lanyrd may or may not run on something like this now. Via scjody on Flickr.

It's well known that I'm not a fan of MySQL, but I'm also a practical person - Lanyrd's been running on it for over two years, and there's no point in fixing what isn't broken.

That said, our tables are now at the size where adding a new column or index takes several minutes, and in the case of one table over fifteen minutes - locking the entire table in the process. We can't take that kind of downtime on our core tables, and so our primary reason to move was the ability to get transactional and/or zero-time column additions out of PostgreSQL.

There are other benefits to moving - PostgreSQL will scale to the eight cores we have in our new servers a lot better, the query planner is a lot more capable at dealing with JOINs, and dump and restore is an order of magnitude faster, allowing us to test code changes against a copy of live data much more readily.

However, changing your database is one of the most difficult things you can do to yourself if you're running ops for a decently-sized website. The main difficulty is converting from MySQL to PostgreSQL format.

We bandied around a few ideas, including a wonderfully crazy idea that we could intercept the MySQL replication protocol and translate it into commands on a PostgreSQL slave, but in the end, and after a few prototypes and timing tests, I settled on a dump-convert-load strategy; we'd dump the data from MySQL, convert the dump into PostgreSQL format, and load it into a blank PostgreSQL.

The trick to making the conversion step fast is doing the minimal amount of work possible to the INSERT lines - while my first converter loaded every INSERT, converted the types (for example, boolean columns were TINYINT in MySQL but BOOLEAN in PostgreSQL) and wrote the new INSERT out, that was very slow, requiring several hours to convert on our tens of gigabytes of data.

MySQL has a 'PostgreSQL compatible' dump format, which actually isn't compatible, but a good place to start from.

Instead, I settled on just copying the old MySQL types in the new tables, re-using the old INSERT statements, and then performing type conversion at the end using an ALTER TABLE statement. This reduced the conversion time from several hours to a couple of minutes, and only added around ten minutes to the restore time from the extra ALTER statements.

The script we used to do the conversion in the end is available from Lanyrd's GitHub account - it's quite specific to Lanyrd and Django in particular, but I hope that it will prove useful if you're considering a similar move.

Lanyrd's read-only mode banner

After the other two moves, this one seemed like a doddle. We were only moving from Ubuntu 10.04 to 12.04 - ensuring we were on the latest LTS.

There are a few other Facter facts you can use for distinguishing OS releases, this was just most convenient.

We have a few custom packages and the layout of Ubuntu has changed in the intervening two years, so we edited our Puppet files in the appropriate places to depend on $lsbdistrelease, and once we'd ironed out the edge cases thanks to the extensive testing we did on this move it was relatively pain-free.

Points go to Redis and Solr for being able to read older versions of their data format, meaning we could upgrade their versions without any expensive re-writing process.

All Together Now

Of course, a move like this was planned a month ahead and rehearsed for a week - see the Lanyrd blog post about that - but it's still quite exciting on the day to execute the plan and hope it works as well as it did the day before.

Comments at the end of lines in requirements files aren't harmless - pip will try to interpret them.

Excitingly it didn't for the very first stage of the switchover - turning on read-only mode - due to a rogue comment in our requirements.txt file that had slipped in the previous day. Fortunately, we fixed that once we'd spotted it and the rest of the move went off without a hitch.

I'm not going to run off and move everything again next week, but it's always nice to have something like this go pretty much exactly to plan. All I need to do now is convince the rest of Lanyrd that we really want to run everything off of a box of Raspberry Pis powered by a hamster.

After all, it's much more green.