Saturday, March 3, 2007

Cutting Edge Revision Control

I have to admit that I’m kind of a tools guy. Every now and then I get this hankering to try out new tools and see if there is a better way to work. For the past few weeks I’ve been researching the latest and greatest revision control tools that are available Free and Open Source. I’ve done some quick comparisons of what appear to be the front-runners as of this third month of 2007.

Background

Before this adventure of mine I had only used three revision control systems. At work I have used CVS on a couple projects, and currently my team uses Clearcase. At home I’ve used CVS and Subversion (svn) for my various little bits of code. All are annoying in their own special ways. You’ve probably heard the annoyances before: CVS doesn’t keep history of renaming files; Clearcase is complicated, slow, completely reliant on the LAN, and complicated; and Subversion, well, I’ll get to that in a second. First I need to talk about their good points.

All of these systems I’ve used did have good points. CVS is simple, fast, and well documented. Subversion has all that, and adds rename support and a cool web server interface. Clearcase’s good point is that it introduced me to the real fun of branching. Branching is handy in that when you need to develop a tricky new feature, you can create a branch, work on your feature and check minor changes in as you go with no regard to breaking the build or disturbing other developers, and then when you are done you simply merge the new feature back into the main branch. Other changes and bugfixes can be done at the same time on other branches. It’s very nice.

So back to Subversion’s annoyances. This may or may not be something Subversion inherited from CVS, I never looked into it, but Subversion does do branching. I tried to use a branch with Subversion for a big change on a home project of mine, and it actually went pretty well. I learned something quite disconcerting though. Subversion doesn’t track merges automatically. The hassle of tracking merged revisions manually was just too much to ask. I was discouraged from branching with Subversion ever again. There had to be a better way.

Turns out there is a better way in the world of open source revision control. About a million better ways, in fact. Check out this comparison chart.

After reading a lot of comparisons like the above, I concluded that of all those revision control choices, the ones that are being actively developed and used by fairly big and/or prominent projects are Bazaar (AKA bazaar-ng), Mercurial, Darcs, and git. And actually I’ll state right here that it’s somewhat questionable whether Darcs meets my criteria (though I wanted it to, really). What follows is my comparison of these four systems. I compared them for speed and ease of use, the two things I cared about most. I also briefly looked at how they work over a network and their windoze support, but not to a great amount of detail. That would be a good sequel to this review.

Speed

By performance, I mean that I used the UNIX time command to see how long various basic operations took. Performing the various basic operations gave me some insight into the usability of each as well. For this test I used a directory with 266 MB of files, 258 KB of which were text files, with the rest being image files. I know, kind of weird to version all those binary files, but that was the project I was interested in testing this out on. Your mileage may vary and all that. Here’s a table summarizing the real times reported by time(1):

Tool initialize repository initial file import initial commit branch/clone repository non-conflicting merge total
bzr 0m1.144s 0m0.839s 1m7.836s 0m31.145s 0m1.154s 1m42.118s
darcs 0m0.429s 12m50.321s 0m0.164s 0m5.691s 0m6.717s 13m3.322s
git 0m0.081s 1m1.918s 0m6.679s 1m37.630s 0m1.910s 2m48.218s
hg 0m0.781s 0m0.377s 0m49.015s 0m8.831s 0m0.342s 59.356s

As you can see, Mercurial (hg) was the fastest. I was a little disappointed in git, who’s whole purpose in life (depending on what you read) is to be fast. I’m thinking maybe it just doesn’t handle the binary files as well. Whatever. In the end I decided performance wasn’t that important of a feature for me. (But it still totally rocks that an app written in Python (well most of hg is python) kicked the pants off an app written in bare-metal, hard-core, “efficient” C. OK, I’ll stop being juvenile now.)

Usability

General Usage

The general workflow and command set for each is very similar. Darcs is the only outlier here, having chosen to diverge from familiar cvs-like commands in favor of “record” instead of “commit,” and “changes” instead of “log” or “history,” and “whatsnew” instead of “diff.”

The other area where they are slightly different is in handling merges. For both git and darcs, it’s a one-command operation, as long as there are no conflicts. For hg you do an hg pull from one branch to the other, and then there is an hg merge command, followed by an hg commit to finalize the merge. With bzr it’s similar, except you use bzr pull only if the branches haven’t diverged, if they have it will tell you you need to use bzr merge instead (let’s all shake our heads at that one together…if it knows, and can tell you about it, why doesn’t it just do it?). Then after your merge you have to do a bzr commit. Bazaar also throws in a few extra steps when resolving conflicts that the others don’t have. It tells you all about those when the need arises, similar to the pull vs. merge issue.

I should also mention the nefarious and notorious git index. Technically, when you make a change to a file, you can’t just commit it, you need to add it to “the index” first. I never dug deep enough to fully understand what that really was and why it was needed, because you can just add a -a to the git commit command, and then it automatically add changed files to the index and works just like everything else. But before I figured that out it was pretty annoying.

Lastly on the topic of general usability, I had a strange thing happen amidst all this version control software testing. I’d get this really happy feeling when I was using mercurial, even though it was doing some weird things like requiring three steps to merge something and having difficulty with file and directory renames (more on this later). I’d get a similar happy feeling when using git, even though it has some UI oddities of its own (namely, the index). Upon further consideration, pondering, and meta-cognition, I believe it’s because hg and git are so dang easy to type, and bzr and darcs are one-handed contortions to type. Try it:

hg st
hg ci
hg
hg
hg

git
git
git
git

bzr
bzr

darcs
darcs

Weird, but it made a difference. Pondering a little further, I realized that git and hg are just so dang fast compared to bzr and darcs as well. That makes a big difference as well, at least when running these little test cases in rapid succession one after another. There’s a software usability lesson to be learned here somewhere.

Renames

The other part of my evaluation was to see how each of these tools handled file and directory renaming. I came up with some scenarios that may seem pathological, but I’m pretty sure I’ve seen, or at least come close to seeing each one of these in my Clearcase usage (and usually it’s quite impressive in it’s handling of them).

In each case I created a repository and made a branch of that repository. I refer to the initial repository as the parent branch, or just parent, and the branch as the child branch, or just child. Read on to see how it all came out.

Scenario 1
  1. renamed a file in parent
  2. edited same file in child
  3. merged from parent to child

It was expected that the merge would preserve the edits made in child and the file would be renamed properly. That would be, “the right thing.”

How they did

bzr did the right thing. It has a bzr mv command to rename files or directories. When you do the rename and then diff it just tells you that the file was renamed. In the child branch the merge worked flawlessly.

hg started asking me confusing questions on the merge that I didn’t want to have to think about, and we both got confused. I ended up with two copies of the file in the child, one with the old name and one with the new. At least the default version that apt-get installed on Ubuntu Edgy Eft, 0.9.1, did this. I downloaded and installed the latest version, 0.9.3, and repeated the exercise. It did the right thing that time. One complaint though, is that when you hg mv a file and then do hg status it shows you a delete of the original file name and an add of the new one. An hg diff shows you the entire contents of the file, twice, once for the deleted one, once for the added one.

git did the right thing. The interesting thing about git is that you can either use the git mv command to do this operation, just like any other version control tool, or you can use the regular old UNIX ‘mv’ command. After renaming the file with mv, git will notice that you have a new file which you can then ‘git add’. It will then figure out that it’s really just a renamed version of the original file. When you go do the merge in the child directory it does the right thing, either way you do it. I should note that using just the UNIX ‘mv’ command you get the full file when you do a git diff, similar to hg. If you use ‘git mv’ then ‘git diff’ will just say the file was renamed.

darcs did the right thing, very similar to bzr but with fewer commands needed.

Scenario 2
  1. rename a directory in parent
  2. edit a file in that directory in child
  3. merge from parent to child

It was expected that the merge would rename the directory, and preserve the changes made in the file under that directory. That would be the right thing.

How they did

bzr did the right thing.

hg did the right thing (using 0.9.3 from here on out), but the child directory had two copies of the renamed the directory, the original name and the new name. If I clone the child though, it comes out with just the newly renamed directory and the correctly edited file. Kinda weird, but I would be surprised if it’s not fixed soon.

git did the right thing, using ‘git mv’ or just ‘mv’ to rename the directory. That’s just so cool how it figures these renames out like that.

darcs did it just fine.

Scenario 3
  1. move a file from one directory to another in parent
  2. edit that file in the child
  3. merge from parent to child

It was expected that the file would be moved in the child while preserving the edits made to that file in the parent.

How they did

bzr handled it just fine.

hg handled it just fine.

git handled it just fine

darcs handled it just fine

Scenario 4
  1. rename file in parent
  2. edit same file in parent
  3. make a conflicting edit of same file in child (no rename)

It was expected that on the merge, the file would be renamed with some sort of conflict resolution taking place.

How they did

A few more details this time, to try and give a feel for how working with each one is.

bzr I remembered to merge, not pull, and it informed me there was a conflict in the file, with the new filename. I then manually opened the file, found the cvs-like conflict markers it had inserted in the file, and resolved the conflict. Then I couldn’t just commit after resolving the conflict, I had to ‘bzr resolve FILE’, then ‘bzr commit’. A lot of steps , but at least it was helpful and walked me through it. It could have been even more helpful and just done it for me!

hg it said it was merging the oldfilename with the newfilename, and then popped up my three-way diff tool that I had configured (emacs ediff, awesome tool, by the way). After resolving the conflicts with that (no manual editing or cvs-like conflict markers needed) the diff showed the whole freaking file, twice. Not very helpful. Then I checked in and everything was fine.

git told me there was a conflict in the newly renamed file. I then manually opened the file, found the cvs-like conflict markers, and resolved the conflict. After resolving the conflict the diff just listed the new filename, twice, which is kinda weird. Then it just needed a ‘git commit -a’.

darcs informed me there was a conflict in the file, with the new filename. I then manually opened the file, found the cvs-like conflict markers, and resolved the conflict. Then it just needed a ‘darcs record’. Very straightforward.

Conclusion

All four of these revision control tools handle merging as well as Clearcase, without the need for a dedicated IT professional supporting a specialized server. They also do renames, as well as all the other basics you expect from a revision control tool. They also have some innovative new features beyond branching and renaming that I haven’t talked about, things like emailed patches, bisect, tarball exports, hooks, plugins, and so forth. You really should try one of them out.

But which one? All seem to be reasonably usable. Darcs still reportedly has a deep, serious bug. Don’t use it (though it is nice). The other three have slight differences. Git supports easiest renaming and moving of files, because you can just use the UNIX commands to do it all, then a single ‘git add’ to pick up all the changes. However its diffs don’t show you want happened with all the renaming as well as bzr’s. Hg’s diffs are just as unhelpful as gits, maybe even less helpful. So for a project where I expect to do a lot of renaming and moving of files, hg probably isn’t the way to go for now. I’m leaning slightly toward bzr because of the more straightforward diff output. For a project where files are pretty much going to stay put I’ll probably use hg because its so fun to type, and it’s just so dang fast. In the end you are a big boy or girl. You can decide for yourself.


P.S. O CSS wizards, my table is too wide for my blogger template. I tried really hard to get Google to tell me how to fix it. I thought maybe I could use the overflow property to make it scroll horizontally, like my preformmated text, but to no avail. Any help would be greatly appreciated.

UPDATE: Thanks for the css tips in the comments! Wrapping the table with a div and then adding the overflow:auto for the div was what finally worked. Well, at least on Firefox. Who cares about anything else, right? ;-)

22 comments:

Anonymous said...

In git, creating a branch and cloning the repository are not the same thing. You create a branch with git checkout -b branchname. That writes one small value to storage, and that takes virtually no time. And for faster local cloning, use git clone -l. If you also want to use the base object store as a reference, use git clone -l -s, but take care when pruning the original store.

As usual, the person crowing about high-level languages doesn't know what's going on beneath the hood, and so doesn't know that the timing comparison is bogus. ;)

Bryan said...

Thank you for your comment. I'm glad that you've pointed out a couple more efficient ways to use git. I hope to see more advice and corrections in the comments here.

I almost didn't put in the timing comparison because it is really dependent on what the setup and development model of your project is. What if you really need to do a lot full repository clones? What if you need to do a lot of clones over the network? What if you don't plan on doing much branching at all? That's why this kind of timing comparison is almost always bogus.

That being said, the final column of my timing table is the simple sum of all the other times for each too.. Mercurial's total was smallest, and that's why I declared it winner. It just so happens that if you completely throw out git's clone and merge times, it's still slower than hg.

But please remember, that's just for this particular setup on my particular machine.

Anonymous said...

Do you have some more details on the claim "Darcs still reportedly has a deep, serious bug. Don’t use it (though it is nice)"?

Masklinn said...

Two questions:

1. Which version of Darcs did you use? The "last stable" 1.0.8 or the 1.0.9 release candidates (which I found to be much faster, esp. on binary files)

> Darcs still reportedly has a deep, serious bug.

Would it be possible to have more informations on that?

Anonymous said...

For the darcs issue:

http://zooko.com/darcs_demystified.html

Scroll down to the bottom half of the page. These are corner cases so are unlikely to occur.

That's all I can guess the author is referring to.

Anonymous said...

while svn can't track rename itself, CL Kao's SVK tool can - and also enables local branches for offline commit and all sorts of other similar goodies. Since svn repos have become something of a lingua franca a lot of people are now using svn backend with svk on the frontend for OSS dev.

Anonymous said...

re: CSS...

I would try:

.post-body table { overflow: auto !important; }

I personally use darcs; I chose it mostly because of the features you covered here, and the fact that it's entire manual is 1/10 the size of SVN's (the online O'Reilly book). I'm well aware of the 'serious bug', and recognize it for what it is: a corner case. I would be surprised if I got bit by it, as I don't share my darcs repo with other people, and that seems to be a requirement to trigger the bug. Most importantly, if it does lock up, you can kill the process without leaving your repo in an unknown state (based on what I was able to discover)

Therefore, *I* would suggest that people should give it a look, as long as they're aware of where it can fall down (and besides, it gets so many things *right*, it's hard for me to say 'stay away').

Kurt said...

google darcs doppelganger. Look for "ICFP meetup notes".

It's not a corner case if you are a package maintainer. Here's what happens:

1. Upstream distributes by tar.gz
2. You maintain the packaged source in a darcs repo. You find a bug, and send the patch upstream.
3. Upstream makes a new release incorporating your patch.
4. You maintain a darcs repo of the pristine upstream; you update this repro from the new tar.gz.
5. You pull from the pristine repo into your packaging repo.

You've just duplicated your patch. Darcs will probably lock up on the pull.

It can also happen if two developers make identical, but obvious, changes.

Anonymous said...

Hey,

Definitely check out SVK.

It's compatible with SVN, but gives you just about all of the features that the others packages provide.

With SVN compatability, you are able to mirror other open source repos on your local box. It's a very cool feature.

Anonymous said...

Kurt,

if you pull from the packaging repo into the pristine repo, would the problem still happen?

Bryan: you can add overflow by doing two things:

1) wrap a div tag around the table
2) add .post-body div { overflow: auto; }

Bryan said...

Masklinn: It is darcs 1.0.8. I hope the other comments about the darcs bug, or corner-case as the case may be, answered your second question. I haven't personally experienced it in my small amount of testing, but it sounds nasty enough that I figured some warning was due. I hope I'm not just spreading FUD.

Anonymous #7, your css tip wins. Thank you so much.

Others, thanks for the comments. I have heard of SVK but haven't tried it out. It sounds like too weird of a hybrid for me. One of the things I love about these distributed revision control systems is that there is no need for a central server. You can enter any existing directory (your home directory, /etc, whatever) and turn it into a repository in place. Heck, you could turn it into four different repositories if you have each tool ignore the other tools' meta-data directories (that's just crazy, don't really do it). No fiddling with apache configs and webdav for network support either. Do you get any of that with svk?

Vineet said...

Subversion's branching/merging support becomes decent if you use svnmerge.py to help track merges. It stores merge metadata and allows you to cherry-pick changesets across branches, avoiding changesets that have already been merged and/or explicitly ignored. Compared to using an SCM that does merge-tracking internally, it is sort of a hack, but it actually works pretty well if you're using subversion.

Vineet said...

Also, in Dvorak, the 'h' and 'g' are both on the right-hand index finger, so it doesn't score any typeability points from me =)

Tart said...

I liked your high-level overview of the tools. I have been using hg for a while mainly because we have Win machines that need to be part of the development. Regarding the ease of typing the names... alias is your friend! I use dvorak, and hg is on the same finger. Well, that is, it _was_ that way until alias h=hg came around :)

Interesting bits about the "auto-guess-your-rename" feature in git. Not sure whether I like it or not. PFM (Pure Freaking Magic) behind my back. I'll have to check it out. Renames are the one thing I don't like in hg.

Anonymous said...

In regards to darcs, it does have some serious issues. It is also probably the nicest to use when it is not having said issues. Just read the archives for the darcs mailing list, you'll see tons of people asking about hangs, memory issues, slowness, etc.

John Goerzen said...

A couple of things about Mercurial...

#1. The "hg fetch" command automates the pull/update or pull/merge/commit process. Just use "hg fetch" instead of "hg pull" and the three steps magically turn into one.

#2. hg addremove -s uses a rename detection heuristic like git to detect renames that may have occured in the filesystem. The nice thing is that this is optional in Mercurial. You do have a choice about telling the system exactly what renamed, rather than having it guess.

#3. hg export and hg export --git give you nicer diffs than hg log -vp or hg diff will. Note esepcially the --switch-parent option to hg export, which when used on a merge changeset, switches which parent the changeset is diffed against. Very slick IMHO.

With hg export --git, you get the exact same patch format as git, which Mercurial can both import and export.

Bryan said...

John,

Thanks for your mercurial tips. Your revision control blog entry was one of my introductions to these tools.

My version of hg (0.9.3) doesn't seem to understand the fetch command, but it does have addremove which I hadn't noticed before. Very cool.

I'll have to try out your diff tips too.

Martin said...

Bazaar has tried to follow the edict of "make it work, make it right, make it fast." We're focussing on fast now, and the 0.15 release makes some operations twice as fast, with some more gains still to be picked up.

Anonymous said...

In git you get the renames/copies in diff with the flags -M/-C.

Anonymous said...

Clarification: git has the -M flag for rename detection _purely_ for compatibility reasons. The default is _not_ to use it, since for example GNU patch does not grok the format. (Many use git to work around other SCMs shortcomings, and you will never know that it was actually git under the hood.)

And the -C flag is for copy detection, which is also a nice feature.

Mark Stosberg said...

Hi,

The "serious darcs bugs" referred to have now been addressed with Darcs 2, whic in pre-release. You can read about it here

David Eisner said...

Thanks for a well written, pleasant to read article.

Though I am quite happy with your pragmatic, ymmv benchmarking, I thought you might be interested in a comment on what hg and git are doing on that initial commit. This will not be a very authoritative one.

hg stores its repository in a structure mirroring the working directory, and as you commit it calculates and stores differentially compressed logs on a per-file basis. The head copy is in fact basically the working directory copy, rather than it being stored anywhere else at this point.

The initial commit, then, and added files in subsequent commits, have nothing much to do, apart from create an essentially empty structure, O(#files) rather than the size of all content. It's more or less a lndir but with hard links, if that helps.

So, its no surprise this doesn't take very long. Not that that undermines the advantage.

git, on the other hand, makes no assumption that the previous version of a file is a good one to differentially compress against, nor that files with different paths are essentially incomparable. Rather, you might say it stores by content or at least not according to the original working directory paths. Everything is duplicately stored in the git repository directory by a hash of the content. Then, loose associations and differential compression can be made according to some clever match-testing for what would make good deltas.

An initial commit has as much work as a similarly sized subsequent commit. That is, it has to hash the path and content of every file in commit, and depending on your settings may also be packing according to this similarity testing. One of the properties that spins out of this is the magic rename handling that you observed. The other is that compression and performance gains are observed as much for inter-file similarities as much as for the intra-file-history similarities usually considered. Naturally, the usually considered cases are prominent in most source repositories and efficiency in working with these dominates in many.

So, anyway, its a minute of what some might call clever, cool indexing-fu.

You will detect that I have more experience with git and have very few complaints. I would choose git by default for a new project of just about any scale. I think hg has the edge for portability, and I keep fossil on a usb stick as a trial of that aspect.