Friday, June 28, 2013

Git Branches Are Not Branches

Git branches have confused me (someone who uses mercurial a lot and git a little) for a while, I have finally realized why. The problem is that git branch is a poorly chosen name for the thing that they really are. You see, all the changeset history in git is stored as a Directed Acyclic Graph (DAG). The code history might be simple and linear which will make the DAG have a simple path like so (o's are nodes in the graph, called changesets, -'s are references from one node to another, with time progressing from left to right):

o-o-o-o-o

Or the code history and corresponding DAG could be more complicated:

                 o-o-o
                /     
    o-o-o     o-o-o-o-o
   /     \   /         \
o-o-o-o-o-o-o-o-o-o-----o-o

Most English language speakers would agree that those parts of the DAG (code history) where a node has two children (representing two parallel lines of development) are called, branches. The above example has four branches in the history, four branches in the DAG, right? The confusion with git branches, however, is that the above diagram may actually represent a git repository with only one git branch, and the diagram above that with the linear history could represent a git repository with any number git branches. A git branch is not a branch in the DAG representation of the changeset history.

The reason this is possible is because a git branch is actually just a label attached to a changeset. It's just a name associated with a node in the DAG, and you can add labels to any node you want. You can also delete these labels any time you want as well. I believe the git developers chose to use the term branch for these labels because the labels are primarily used to keep track of DAG branches, but in practice the overloading of the term causes a lot of confusion. When a git users says he's deleting a branch, he's really just deleting the label on the branch in the DAG. When a git user shows you a linear history like in the first diagram and then starts talking about the branches contained in that history, he's really just talking about the different labels applied to various changesets in that history.

Labels such as these are very common in computer programs and there are a number of common English terms that convey a much more clear picture of their function and purpose: label, tag, pointer, and bookmark come to mind. There are pages and pages of explanation on the internet that try to explain and clarify what git branches are and what you can and can't do with them, when, I believe, using a better name would alleviate the need for most of that. Personally, I now just say label or tag or bookmark in my head whenever I read branch in a git context and things are much less confusing.

I hope that helps someone besides me who is learning git. Next week I'll talk about how the git index is nothing like an index :-)

(By the way, if you have a choice in which to use, mercurial works about the same as git and has better names for things)

11 comments:

Anonymous said...

Actually Mercurial also messed up the branch terminology.

If you use the command "hg branch", it will just mark your working copy with a stamp, or better yet: a color. A commit of this working copy will then have the color. That's all.

Many confusions of why a Mercurial named-branch can have many heads, could be discontinous, has global namespace, etc. come from that technological background: named-branches are just node-colors, nothing more.

IMHO, DVCS should not talk about commands creating or deleting/moving branches at all, because a branch in the usual meaning is just a commit with 2 or more children. And this happens implicitly in both Mercurial and Git.

Bookmark is the best term for what Git branches are, and colors is the best for what Mercurial branches are. But it is unlikely to see these mistakes fixed (and evangelists will of course vigorously debate whether it is a mistake or not), because of backwards compatibility.

Perhaps the next generation of version control tools will get it right...

Bryan said...

Being more familiar with Mercurial I hadn't thought about it as much, but you are correct, "branch" in mercurial is an overloaded term as well. I usually substitute "named branch" in my head when I read "branch" in a mercurial context (color doesn't work for me).

So, to summaryize:

git branch == bookmark

mecurial branch == named-branch

Mercurial did get one thing right (a somewhat recent addition):

mercurial bookmark == bookmark

aap said...

I see why you would say git branch == bookmark, but really, what they are most used for is swapping the working tree among different branches.

Anonymous said...

Only that they do not have to be branches at all.

o-o-master-o-mywork

Where is the branch here? Topologically, there is none. "master" is just a bookmark on a certain commit in my perfectly linear history. If I switch "branches" here, I actually switch between 2 versions in the same linear ancestry.

With Git, it gets even more complicated. You have to "merge" mywork into master in order to get the changes pushed to the remote. Just all that the "merge" is doing is "fast-forwarding" the master pointer to mywork.

Linus (and later Junio) invented many terms like that just for the heck of it. And often did so deliberately in total contrast to established terminology, because they "looked at what CVS did and made the exact opposite". Engineering by arrogance, that is. And the current UI catastrophe of Git is a direct result of that.

It is just funny how many DVCS proponents of the early days called SVN fans "brain-damaged", when a similar Stockholm-syndrome is clearly observable with Git fans these days.

Bryan said...

I might have toned down the rhetoric a little, but you make a good point about fast-forward merges. :-)

That was another point of confusion for me. Why do people say "merge" when the history is all linear? Ah, because it's not actually a merge, it's just moving those labels around.

Peter Eisentraut said...

Git comes from a Unix file system background (arguably). If you look at file systems in the way you propose, a file on a Unix system isn't really a file, it's just a "bookmark" to an inode. That's not wrong, but it's just not a practical terminology.

Also, removing a branch in Git does remove the data. Not immediately, just like removing a "file" doesn't remove the file's data. But it will be cleaned up if it's not reachable from any other "bookmark".

I would also consider that Git branches are kind of automatically moving bookmarks. A better analogy for a bookmark is a tag.

Bryan said...

Somehow the file abstraction in UNIX works a lot better than the branch abstraction in git. Much less leaky. A version control tool's data structure stores and conveys a lot more meaning than a filesystem's data structure.

Anonymous said...

Peter: indeed; "automatically moving bookmarks" is precisely what a branch is in git. Each 'branch' is just a HEAD commit reference, stored in a file under .git/refs/heads (containing the hash for the relevant 'bookmark' commit). When the branch is modified, that file is updated. (And of course a Tag is a 'bookmark' which will never change.)

Bryan said...

Peter, you say deleting a branch does delete the data (eventually). If I create a branch, make some commits, merge the branch, and then "delete" it, do the commits that are on the branch (in the DAG sense of the word) get garbage collected?

They don't right? They can't, because the merge commit is based on both those parent commits. I know, it's the same with hard links in a filesystem, right? Because the commits still have something referring to them (the merge commit) they won't get deleted, just like a file that still has hard links to it won't get deleted.

So tell me, how often do people use hard links? How often do people use git branches?

P.S. Thanks for pointing out the filesystem similarities, that is a very interesting insight into why git works the way it does.

Anonymous said...

The implementation details are certainly interesting and useful to know, but when it comes to the terminology itself, Git's use of the term "branch" is entirely consistent with all the VCS systems I've used -- branching is a way of working on divergent development paths, so that developers can work on multiple different independent versions of the code base, which may or may not be subsequently merged back together.

The fact that git doesn't require you to clone the entire repository to create a branch, or that in some instances a merge can be simplified using the "fast-forward" behaviour (which is optional) doesn't seem entirely relevant here? The approaches taken by Git vs Subversion may be worlds apart in detail, but they're both providing a solution to the same problem, which they call by the same name.

So while the word "branch" might not be the best description of what is happening inside git, it's a very reasonable way to describe the standard purpose to which these mechanisms are put. I certainly work on development "branches" not development "bookmarks", so I'm not convinced that the terminology is bad.

Anonymous said...

I work on "development topics", or a "main line". If there wasn't a VCS to begin with, I wouldn't even talk about branches at all.
You coined the more natural term "development path", too. Many other VCS actually used these terms instead of "branch". There was not only CVS and SVN out there.

I also think that git branches are entirely not consistent with all the other systems, because a branch in git (or better yet: the consistent meaning of a branch - compared to other systems - for a set of commits belonging to that branch) will not go away even when the branch pointer is deleted. You can't distinguish branches anymore once the pointer got deleted, though. Sometimes you can't even do it with the pointers at all. Other systems don't do this, and wouldn't understand it as branch to begin with.

It is arguable if this new workflow is good or not, but it is definitely not consistent with other systems, and certainly not with the meaning of the term "branch" in traditional systems. Everybody believing this will soon get hit by the internals and wonder why it behaves the way it does.

So in light of this, the term "bookmark" is simply better. Or to put it more into another beloved git (culture) term: superior.