git notes main page | gitolite main page | license

IMPORTANT NOTE: although this page has a "gitolite.com" URL, this is not about gitolite. That's just an artifact of "sitaramc.github.com" being translated to "gitolite.com" and so ALL my git related stuff gets carried over. Gitolite documentation has another /gitolite in the URL, so you can tell. My apologies for this confusion.

git – a distributed version control system

^{[Please note that this article was written in June 2008. Although I have kept the article largely the same, I have (in 2011) updated the text in one or two places to reflect more accurate/relevant information].}

Git is a distributed version control system that has been around for 3 years now. It is fast, stable, and, best of all, quite unobtrusive. While early adopters to git faced a steep learning curve, a number of tutorials (see https://git.or.cz/) and user friendly commands, including a GUI, have now made using git easy. This article is aimed at developers, so, rather than repeat things which you can easily find for yourself at the site mentioned above, I will instead illustrate the concepts behind git, and give you a sense of why it is so exciting.

Here’s a taste: Figure 1 is a screenshot from the GUI version of what SVN folks would know as svn blame.

As you can see, git says that the function called find_tree_entry was originally written in sha1_name.c and later moved to tree-walk.c (this file). However, no programmer told git all this – it inferred it by looking at the revisions. That’s an amazing level of traceability in a world where most current VCSs can’t handle even a simple file-rename without being explicitly told, such as with the svn rename command!

Some history and background

The reason git is able to work this, and many other, bits of magic is because it tackles the problem from the point of view of a file system designer. Git’s object store was designed to be a content-addressable file system (git even has an fsck command!), and the data model actually tracks content, not files.

Git was created in April 2005, by Linus Torvalds, due to a frustrating state of affairs with Bitkeeper, the VCS that was being used by the Linux kernel folks at that time. The problems with Bitkeeper were non-technical, and we will not go into them here, except to say that the workflows of Bitkeeper were a large part of the inspiration behind Git.

Git development was very rapid – it became self-hosting in just 4 days of development, but remember that’s in “Linus-days” :-), and in about 2.5 months it was helping Linus manage Linux kernel development. Currently, apart from the Linux kernel and related projects, it is used by several projects, for example Wine, x.org, OLPC, Ruby on Rails, VLC, and Samba. [Update: recent high-profile additions are Perl, and GNOME]

Git has two really key strengths: it is very smart at merging, so that the developers can do as much parallel development as they wish to, and it is extremely fast, so it never gets in your way.

I must add, at this point, a plug for the one webpage that helped me more than anything else to understand the git model quickly and easily. It is called “Git for Computer Scientists”, and it is at https://eagain.net/articles/git-for-computer-scientists/. I consider this essential reading for anyone trying to understood the git object store.

Merging – git’s first and biggest strength

The most basic function of a VCS is to give us a way to rollback to a known state, should we really mess up whatever we are working on. In that sense, a VCS is notionally the same as backup software: a check-in is like a “save” and a check-out is like a “restore”.

That function is important, but it is also boring. The real fun starts when you consider how well a VCS enables parallel development. This is where the project workflow changes from a long sequence of sequential changes, to a much more efficient model where many people are simultaneously working on the same set of files, and doing different things to them.

Some really old VCSs solve this by not allowing any concurrency at all: you have to lock any file which you want to edit, and until you release the lock no one else can edit it. This is horribly inefficient in terms of programmer time, and one wonders why such VSSs are still in use at all!

To be fair, most recent VCSs do allow branching, with varying degrees of ease of use and speed. But as Douglas Adam’s said in one of his novels, “it’s not the fall that kills you, it’s the sudden stop at the end”. Similarly, branching is not the problem; the real test of a VCS is how well it manages to merge those branches later! Almost any recent/decent system can manage to merge when the changes are not anywhere near each other. But what if, for example, one developer is making changes to a file, and another one, perhaps responsible for a much larger refactoring effort, renames and moves that file to some other directory?

Git handles such situations much better than any of the old style VCSs. It currently has 5 different merge strategies that you can use (don’t worry – you don’t have to think about them if you don’t want to!) and it’s designed so that if some new strategy to handle merging is discovered, it can be added in very easily.

The end result is a tremendous boost to a team’s productivity, as people realise that a lot of the manual co-ordination and sequencing that was till now considered a necessity, is in fact not needed. Also, and this is not obvious, parallelism is just as useful for a single developer working on a system: ask yourself if you’ve ever postponed a small, quick, change because you were in the middle of a much larger activity, or had to plan your work carefully to avoid having two separate tasks interfere with each other.

In real life, it’s often impossible to cleanly finish one job before starting another, yet most current VCSs force you to do exactly that, if you want to retain your sanity. With git, on the other hand, you can even create a separate branch for each issue you’re handling (bugfix, enhancement, whatever), and work on them almost simultaneously, switching from one to the other as you wish, secure in the knowledge that when it’s time to merge, git will do the right thing.

In addition, git has excellent GUI tools to visualise this sort of branching easily. Figure 2, for example, is from the development tree of a project called cobbler.

Figure 2: Branch visualisation — Figure 2

Merging is a very big priority for Linus Torvalds. He says, “I merge 22,000 files several times a day, and I get unhappy if a merge takes more than 5 seconds” (https://git.or.cz/gitwiki/LinusTalk200705Transcript). During the first 2 years of git, the Linux kernel has “averaged 4.5 merges per day, every single day” (same URL).

The idea behind this is simple: making many small merges at regular intervals (and resolving any conflicts if needed) is much easier than dealing with all of them together – you have to divide and conquer.

Another situation in which good branching and merging helps is, for instance, when a relatively stable product needs to have more than one major “feature” added or enhanced simultaneously. With a less powerful VCS, the team will either develop the features one by one in order to avoid merging problems, or develop each in complete isolation and prepare for a some very difficult, frustrating, manual merging as each feature reaches stability. (In systems like these, especially in the corporate world, you have special people whose job it is to integrate branches from different development streams!)

The truth is, though, that most other VCSs consider reliable merging to be so difficult that branching is hardly ever used in non-trivial situations. As a result, lots of inherent parallelism that could have been exploited goes unused.

Even where merging is available, systems like SVN have quite a few problems with how it is implemented. For a good description of these issues take a look at https://git.or.cz/gitwiki/GitSvnComparsion. Although SVN is working on these issues, I am convinced that once the productivity gains of a distributed VCS become well known, SVN will eventually fade. SVN’s aims have always been to fix the problems in CVS rather than do anything revolutionary, but today, in 2008, merely fixing problems in a system designed more than 15 years ago is too little, too late.

Rebasing instead of merging

Merging is very nice, but for long-lived branches, the frequent merges tend to make the history look like Figure 3 (note: this is not a real project!). When this branch is eventually merged into the mainline (like when it is deemed stable enough), this “stitching pattern” remains.

For some projects it may be easier to make all the branch work look as if it was done on top of the current mainline commit. This is called rebasing. While the branch is under development, instead of doing a git pull (this is somewhat like a cvs update) from the mainline, you’d do a git pull --rebase (there are other ways to do this; this is just the simplest). Each time you do that, git re-applies all your commits on top of the new tip of the mainline, and the history remains linear. When your branch is finally stable enough to be merged into the mainline, your changes can be pushed as a linear set of updates on top of the current tip, rather than a parallel set of updates followed by a merge.

The biggest advantage of rebasing instead of normal merging is when you try and track down a bug using git bisect, which we will discuss later.

Speed – git’s second big feature

The other explicit design goal was speed, which has been one of the pain points of earlier VCSs. Centralised VCSs can take minutes to do a simple diff, or tens of minutes to create a branch. It is also quite frustrating to have to schedule a code-freeze and set aside a day or two to do a merge. (I believe this is quite normal in centralised VCSs, especially the more mainframe-ish ones – consider yourself lucky if you don’t know which one I am talking about!) With git, on the other hand, it’s common to hear people say that it’s so fast they sometimes wonder if it did anything at all!

It’s not just local commands that are optimised. Even the protocol for exchanging revisions is very efficient, and can quickly and easily figure out what data needs to be sent, including compressing and sending deltas.

While git is really good at digging through a repository breadth first (meaning all files at this revision, then all files at the parent revision, etc), it is not optimised for the case where you wish to view history in a depth-first manner (meaning all revisions for one file first, then all revisions for perhaps some other file). The logic is that we’re more likely to want to examine the entire tree for the current and the previous few revisions – the further a revision is from our current one, the less interesting it usually is for normal work.

Distributed and disconnected

The fact that git is a distributed VCS means that you have your own repository on your machine. Coupled with the excellent branch/merge in git, this allows you, for instance, to quickly create a branch for some experimental code that may or may not work, test things out, and either discard or reuse the changes later. With a centralised system, you are much more limited in what you can do in these situations – you usually resort to the classic “tar/zip” method of saving and restoring your work.

If you work offline a lot, it’s a lot more useful to be able to commit each change separately, and have these commits kept separate, rather than be forced to commit all your offline work as one huge commit when you reconnect. Keeping commits to a fine granularity makes code reviews easier, but even more important, finding the source of a bug becomes much easier, as we will see later.

The reason git is good at distributed operation is because, when you do a git clone, it brings down and caches the entire repository – the full history of all the branches is now available on your machine. Even Mercurial, the next best among the new breed of VCSs, brings down only the current branch.

(This might lead you to worry that disk space is an issue. It is not – git’s pack format is quite efficient, and many months of revisions often end up taking little more than the space that a full checkout takes.)

A good analysis of how much each VCS caches on a checkout, and its implications, is at https://blogs.gnome.org/newren/2007/11/24/local-caching-a-major-distinguishing-difference-between-vcses/

Cool features – git bisect

Git has a command called bisect (this is one of those “aaaah, why didn’t I think of it?” things!) that is very nice. The idea is this: say you find a bug in version 1.6, which didn’t exist in version 1.5. If the commit history between these two revisions consisted of just a few large commits, you’ll have to go through all the changes in these commits manually to find the bug. In the git model, however, the commit history would have a few hundred small commits.

So you tell git bisect that version 1.5 is “good”, and 1.6 is “bad”. Git then finds a commit about half-way between them and checks it out for you. You test it, and tell git whether this was “good” or “bad”, and git then finds you another version to test based on your answer. Essentially, git is using binary search to zero in on the specific change that caused your bug, which is only possible and useful if commits are small.

The real power of git bisect is that, if your testing can be automated with a command that returns “0” for “good” and some non-zero value for “bad”, then git will essentially do all this by itself! Who says finding the source of a bug is a programmer’s job? Let the machines take over!

Summary

I hope this has whetted your interest in git. There’s not enough space in this article to go into all the other wonderful things about git, like git stash (save unfinished work to temporarily do something else), interactive rebasing (reorder commits, combine two commits into one, split one commit into many, etc.), cherry picking (pick and choose specific commits from some other branch to apply to the current one), and so on. I should also mention that all routine functions can be done from the GUI tools available, although the command lines are not very complex either.

There are a lot of programmers who do not like their current version control system, or regard it as a necessary evil, a burden that must be borne because it’s better than not having one at all. Bitkeeper changed all that for Linus: in his words, Bitkeeper was the system that taught him “why there’s a point to them and how you actually do things”

And for many people today, that privilege might well belong to git. I hope it does, for you.

^{[This article first appeared in Linux For You, an Indian magazine for open source enthusiasts, in July 2008. It is released under the Creative Commons Attribution Sharealike license.]}