Migrating my personal projects to Mercurial

Introduction

My first acquaintance with version control was soon after the beginning of my professional career, at IBM in 2000. We were using RCS at that time, and later moved to CVS. Three years ago, I started using Subversion for my personal projects at home, and since then I can't imagine not having my code safely tucked in source control for any prolonged amount of time. Lately, the excellent source code hosting service of Google has been my online repository of choice.

Staying married to a single technology or tool isn't a good strategy, however. The world of software advances quickly, and better solutions for old problems get invented all the time. Distributed version control is one such solution. It has gained a lot of popularity in the past few years and is, slowly but surely, taking over the world of source control. In this post I want to show how I discovered that Subversion is no longer good enough for my needs, and began using Mercurial in its place for managing all my personal projects and code.

The need

This week I was planning to do some self-educational hacking on the source code of Python [1], and it occurred to me that I'm going to have a problem keeping my explorations safe in a source-control system. Here's why:

Python has an official Subversion repository at python.org - you can check out a read-only copy from it, but there your benefit from source control ends. Since I don't have Python commit rights, my checked-out sandbox is just a local snapshot - I can't create branches or commit my changes anywhere.

What I could do is create a personal SVN repository, import Python into it and play around. But how to keep up with advances in Python itself? Subversion doesn't support such merging between two repositories in a convenient way.

Another, unrelated qualm with Subversion came up with my own personal repositories. It's not new - it's a sorrow that has been accumulating over a long time. The problem with SVN is that the local copy only contains the latest revision - it can show you the differences between that and your local changes quickly. For anything else, you must turn to the repository itself over the network. And that's really slow.

Unfortunately, high Internet connection speeds aren't of much help here. The bandwidth may be sufficient, but latency is the culprit. A simple ping roundtrip to code.google.com from my PC (located in Israel) takes about 100 ms. I'm sure that the time that it takes Google to dispatch my request to a SVN server, and that server to parse and understand my request isn't negligible either. Subversion has a protocol that has to send and receive multiple commands to do simple operations like see the project log, diff between older revisions and so on. These latencies add up, making me constantly stare at a frozen screen. Even a simple and commonly needed operation like viewing the repository log take a few seconds, and diffing old revisions much longer than that. This can quickly become really annoying.

Mercurial is the answer

As it turns out, the first problem I mentioned bothered the Python core developers quite a bit, so about a year ago they've decided to switch Python itself to Mercurial [2]. The official repository hasn't switched yet, but a Mercurial mirror exists, reflecting everything going on in the SVN repository practically in real time.

This made my decision much easier. A DVCS (Distributed Version Control System) addresses both my needs:

It allows each developer to have a full snapshot of the repository locally. Updates from the official repository are done by pulling, but local changes can be made with full source-control. You only really have to merge when you plan to push into the official repository. This is very convenient for people without commit priveleges, because they can experiment with the source, incrementally tweaking stuff and saving it in the local repository.
By having a local repository, everything becomes fast - you mostly work with a local copy, and only access the network to push and pull changes. Now I can leisurely explore the history of my project, diffing old revisions, all at the speed of a local hard-drive access.

But what about all the disk space? Aren't repositories huge? Isn't keeping them on every computer wasteful? Far from it, as it turns out. My local Python source directory (the py3k branch, last pulled today) is about 100 MB in size. The repository part (the .hg directory) - with all the history (thousands of revisions), takes less than half of this space - about 46 MB. This is due to Mercurial's highly optimized storage system, which is both diff-based and efficiently compressed. Is this a high price to pay for all the convenience? Hardly, with a 1 TB hard-drive available for less than $100 these days.

Mercurial has a lot of tricks in its bag when it comes to saving space. If you create a clone of a local repository, Mercurial uses hard links (even on Windows!) to bring its overhead in the new clone to almost 0. Having multiple local clones is convenient if you want to explore a separate line of development in a convenient way, or have both a maintenance branch and a development trunk easily available for your project.

Windows users used to TortoiseSVN won't be disappointed - TortoiseHg is a similar tool, and it works just as well.

Overall, Mercurial has been quick and fun to learn and start using. When a tool fits your mental model, has the solution of your problem as its goal, and performs its job well, it's a smooth, seamless experience. For me, there's only one thing left that feels funny, and this is the need to remember to push after I've committed. One of my uses for the online repository is to synchronize the same code between multiple computers. With SVN I got used to just committing on one machine, and have my changes available on the other with a simple update. With Mercurial, a couple more steps are required: after committing I must push, and then at the other machine pull and update [3]. I'm confident that this isn't a big issue, however, and I'll get used to it quickly.

Why Mercurial and not another DVCS

This question just had to surface, and the Python devs have struggled with the same dilemma. They've actually done most of the work with a great comparison of the options in PEP 374, so all I have left is to reiterate their conclusions:

I prefer a Python-based system because, well... because I like Python! It's fun reading about Mercurial's internals and then being able to peruse the Python source code that implements it. So this throws Git out of the window [4].
As for Bazaar, I don't have a strong preference so I go with the crowd. Mercurial is more popular. It's used by huge projects like Mozilla, Vim, XEmacs, and Python. The last, in particular, seals the deal. If I want to hack on Python, Mercurial is the natural choice.

Resources

Here are some resources I've found very useful in the transition, in no particular order:

Hg Init: an amazing Mercurial tutorial by Joel Spolsky. Highly recommended, to understand both the how and the why of Mercurial.
Mercurial: The Definitive Guide: A complete book, available freely online
Python PEPs 374 and 385
hgrc: Documents the Mercurial configuration file
Official Mercurial FAQ
The Google project hosting Mercurial FAQ

[1]	More specifically CPython, the "official" implementation.

[2]	The reasons for the switch, with various considerations of the competing SCMs is described in detail in PEP 374.

[3]	Pulling and updating can be done in a single step by issuing `hg pull -u`.

[4]	For the sake of fairness I must note that the C source code of Git is pretty good. I dug into it a while ago for other purposes and was pleased by its readability and overall quality.