/ 13th Nov 2012
One Change Is Not Enough
Last week, me and the rest of the Lanyrd team pulled off two major infrastructure
changes in two hours - see our blog post on the move
for some background information as well as an overview of the timeline.
I'd like to take a closer look at the technical details of this move and show
how exactly we pulled off changing our hosting provider, database vendor and
operating system version all in one go.
For many startups, there comes a time when the switch to dedicated hardware
and a more 'traditional' hosting model makes sense, and Lanyrd is one of those -
we're fortunate to be at least partially a content-based site, and have two
- Most traffic spikes are to one or two conference pages, and are an easily cacheable pattern.
- Conferences are planned well in advance, so we get good notification of when to expect high-traffic periods for more interactive use.
This means that we're not subject to a huge amount of daily variation or
unpredicted traffic spikes, and even if we get some, we have enough overhead to
deal with them as we've overprovisioned the new servers slightly.
Softlayer were our dedicated hosts of choice - I've had good experiences of
them in the past (Minecraft Forums ran off of them for a while after we outgrew
Linode) as have some of my colleagues in the ops side of the industry. In
addition, they also have billed-by-the-hour virtual servers available in the
same datacentres as your dedicated servers, meaning that even if we overgrow
our fixed capacity we can have a new frontend box up and serving requests in
about 10 minutes.
The last time I did a move like this was moving Epio from AWS to real hardware back in 2011. That was slightly tougher since it was itself a hosting service, and we had to have a few minutes of downtime per app.
Changing hosting is a well-known challenge, and one I've done several times
before, so we used the tried-and-tested method of putting the old site into
read-only mode, syncing the databases across to our new site, repointing the
DNS (and proxying requests from the old IP) and enabling the new site.
For maximum reliability, I scripted the data sync step, and did several dry
runs of it. Of course, the move was nowhere near as simple as that.
Lanyrd may or may not run on something like this now. Via scjody on Flickr.
It's well known that I'm not a fan of MySQL, but I'm also a practical person -
Lanyrd's been running on it for over two years, and there's no point in fixing
what isn't broken.
That said, our tables are now at the size where adding a new column or index
takes several minutes, and in the case of one table over fifteen minutes - locking
the entire table in the process. We can't take that kind of downtime on our core
tables, and so our primary reason to move was the ability to get transactional
and/or zero-time column additions out of PostgreSQL.
There are other benefits to moving - PostgreSQL will scale to the eight cores
we have in our new servers a lot better, the query planner is a lot more capable
at dealing with JOINs, and dump and restore is an order of magnitude faster,
allowing us to test code changes against a copy of live data much more readily.
However, changing your database is one of the most difficult things you can do
to yourself if you're running ops for a decently-sized website. The main
difficulty is converting from MySQL to PostgreSQL format.
We bandied around a few ideas, including a wonderfully crazy idea that we could
intercept the MySQL replication protocol and translate it into commands on a
PostgreSQL slave, but in the end, and after a few prototypes and timing tests,
I settled on a dump-convert-load strategy; we'd dump the data from MySQL,
convert the dump into PostgreSQL format, and load it into a blank PostgreSQL.
The trick to making the conversion step fast is doing the minimal amount of
work possible to the INSERT lines - while my first converter loaded every
INSERT, converted the types (for example, boolean columns were TINYINT in MySQL
but BOOLEAN in PostgreSQL) and wrote the new INSERT out, that was very slow,
requiring several hours to convert on our tens of gigabytes of data.
MySQL has a 'PostgreSQL compatible' dump format, which actually isn't compatible, but a good place to start from.
Instead, I settled on just copying the old MySQL types in the new tables,
re-using the old INSERT statements, and then performing type conversion
at the end using an ALTER TABLE statement. This reduced the conversion time
from several hours to a couple of minutes, and only added around ten minutes
to the restore time from the extra ALTER statements.
The script we used to do the conversion in the end is available from Lanyrd's
GitHub account - it's quite specific to Lanyrd
and Django in particular, but I hope that it will prove useful if you're
considering a similar move.
Lanyrd's read-only mode banner
After the other two moves, this one seemed like a doddle. We were only
moving from Ubuntu 10.04 to 12.04 - ensuring we were on the latest LTS.
There are a few other Facter facts you can use for distinguishing OS releases, this was just most convenient.
We have a few custom packages and the layout of Ubuntu has changed in the
intervening two years, so we edited our Puppet files in the appropriate places
to depend on $lsbdistrelease, and once we'd ironed out the edge cases
thanks to the extensive testing we did on this move it was relatively pain-free.
Points go to Redis and Solr for being able to read older versions of their
data format, meaning we could upgrade their versions without any expensive
All Together Now
Of course, a move like this was planned a month ahead and rehearsed for a week - see the Lanyrd blog post about that - but it's still quite exciting on the
day to execute the plan and hope it works as well as it did the day before.
Comments at the end of lines in requirements files aren't harmless - pip will try to interpret them.
Excitingly it didn't for the very first stage of the switchover - turning on
read-only mode - due to a rogue comment in our requirements.txt file that
had slipped in the previous day. Fortunately, we fixed that once we'd spotted
it and the rest of the move went off without a hitch.
I'm not going to run off and move everything again next week, but it's always nice
to have something like this go pretty much exactly to plan. All I need to do
now is convince the rest of Lanyrd that we really want to run everything off of
a box of Raspberry Pis powered by a hamster.
After all, it's much more green.