git concepts simplified

Sitaram Chamarty (sitaramc@gmail.com)

1 preface

1.1 viewing This slideshow

This presentation uses HTML Slidy, a simple presentation software from the W3C. Although there's a help button in the footer of each presentation, it's missing some important stuff, so here's a smaller but more complete summary of the keyboard controls.

.@blue<Navigation>@

  Next slide: right arrow, page down, space
  Prev slide: left arrow, page up

  Down within slide: down arrow
  Up within slide: up arrow

  First slide: home
  Last slide: end

.@blue<Display>@

  Smaller font: "S" or "<" key
  Larger font: "B" or ">" key

  Toggle Current versus All slides: "A" key
  Toggle Table of Contents popup: "C" key
  Toggle footer: "F" key

.@blue<To search>@ for stuff in the full document using your browser's Ctrl-F, first view all slides (press the "A" key).

1.2 acknowledgements

this document

This document is vaguely inspired by http://eagain.net/articles/git-for-computer-scientists, except this page is not just for CS folks. And it's a lot more detailed. Oh, and it's actively maintained, meaning I will respond to feedback ;-)

this slide show

This slide show was all done with pandoc's slidy output mode, and a few little pre-processing tweaks. Pandoc is absolutely fantastic and bloody powerful! I wish I'd discovered it sooner but now (Dec 2013) that I have, it's eliminating a lot of my home-grown scripts to do various things like this!

The bulk of my tweaks are in terms of embedding graphviz diagrams inside the main document, so there's only one file to maintain!

2 basics

2.1 the 4 git object types

Git keeps all its data inside a special directory called .git at the top level of your repository. Somewhere in there is what we will simply call the object store (if you're not comfortable with that phrase, pretend it's some sort of database).

Git knows about 4 types of objects:

2.2 what is a SHA

A commit is uniquely identified by a 160-bit hex value (the 'SHA'). This is computed from the tree, plus the following pieces of information:

(Actually, all 4 git objects types are identified by SHAs, but of course they're computed differently for each object type. However, the SHAs of the other object types are not relevant to this discussion).

In the end, as I said, it's just a large, apparently random looking, number, which is actually a cryptographically-strong checksum. It's usually written out as 40 hex digits.

Humans are not expected to remember this number. For the purposes of this discussion, think of it as something similar to a memory address returned by malloc().

It is also GLOBALLY unique! No commit in any repo anywhere in the world will have the same SHA. (It's not a mathematical impossibility, but just so extremely improbable that we take it as fact. If you didn't understand that, just take it on faith).

An example SHA: a30236028b7ddd65f01321af42f904479eaff549

2.3 what is a repo

A repository ('repo') is a graph of commits. In our figures, we represent SHAs with numbers for convenience. We also represent time going upward (bottom to top).

a simple chain of commits

a simple chain of commits

(Hey, why are the arrows backward in your pictures?)

So why are the arrows pointing backward?

Well... every commit knows what its parent commit is (as described in the "what is a SHA" section above). But it can't know what it's child commits are -- they haven't been made yet!

Therefore a repo is like a single linked list. It cannot be a double linked list -- this is because any change to the contents would change the SHA!

3 branches and tags

3.1 branch

Traditionally, the top of a linked list has a name. That name is a BRANCH name. We show branch names in green circles.

a branch

a branch

3.2 more than one branch

(a.k.a "more than one child commit")

Remember we said a repo is a GRAPH? Specifically, more than one child node may be pointing at the same parent node. In this case, each 'leaf node' is a branch, and will have a name.

three branches

three branches

3.3 more than one parent commit

Well we can't keep creating more branches without eventually merging them back. So let's say "feature X" is now tested enough to be merged into the main branch, so you git merge feature_X. Here's what you get:

Notice that commit 8 now has 2 parents, showing that it is a "merge commit".

a merge commit

a merge commit

At this point, it's quite common to delete the feature branch, especially if you anticipate no more "large" changes. So you can run git branch -d feature_X, which gives you this:

a deleted feature branch

a deleted feature branch

3.4 current branch/checked out branch

There is a notion of a 'currently checked out' branch. This is denoted by a special ref called HEAD. HEAD is a symbolic ref, which points to the 'current branch'.

HEAD

HEAD

3.5 committing

When you make a new commit, the current branch moves. Technically, whatever branch HEAD is pointing to will move.

committing

committing

3.6 naming non-leaf nodes

It's not just 'leaf' nodes, but inner nodes can also have names. Recall the result of merging feature_X earlier (see the "more than one parent commit" section):

non-leaf

non-leaf

At this point, you could leave feature_X as it is forever. Or you could delete the branch (as we showed in that section), in which case that label would simply disappear. (The commit it points to is safely reachable from master because of the merge.)

You can also continue to develop on the feature_X branch, further refining it with a view to once again merging it at some later point in time. Although not relevant to the topic of this document, I should mention that the usual practice is to first merge master back into feature_X to make sure it has all the other stuff that master may have acquired till now (this is shown by commit 9 below) before continuing further development:

further feature development

further feature development

3.7 tags

More commonly, inner nodes are TAGS. We show tag names in yellow circles.

a tag

a tag

3.8 the difference between branches and tags

The main difference between a branch and a tag is branches move, tags don't. When you make a commit with the "master" branch currently checked out, master will move to point to the new commit.

branch versus tag

branch versus tag

4 digressions - 1

4.1 what is a git URL?

Git repos are accessed by providing a URL. There are typically 4 kinds of Git URLs:

(see 'man git-clone' for all the allowed syntaxes for git URLs).

4.2 what is a "remote"?

A remote is a short name (like an alias) used to refer to a specific git repository. Instead of always saying git fetch git://sitaramc/gitolite, you can add that as a remote and use that short name instead of the long URL.

For convenience, a 'remote' called 'origin' is automatically created when you clone a repo, pointing to the repo you cloned from.

5 local and remote repos

5.1 remote branches

Git is a distributed version control system. So when you clone someone's repo, you get all the branches in that one. Remote branches are prefixed by the name of the remote, and we show them in orange.

a remote

a remote

5.2 multiple remotes

You can have several remotes.

several remotes

several remotes

5.3 fetching and merging from another repo

Now let's say Sita's repo had a couple of new commits on its master, and you run git fetch sitas-repo. (We have pruned the graph a litle for clarity, showing only the relevant commits; the rest of the commits and branches are assumed to be present as in the previous picture).

before merge

before merge

Now you want to merge Sita's master branch into yours. Since your master does not have any commits that Sita's master doesn't have (i.e., Sita's master is like a superset of yours), running git merge sitas-repo/master will get you this:

after merge

after merge

6 digressions - 2

6.1 the object store

Git stores all your data in an "object store". There are 4 types of objects in this store: files (called "blobs"), trees (which are directories+files), commits, and tags. All objects are referenced by a 160-bit SHA.

(Details, if you like: a blob is the lowest in the hierarchy. One or more blobs and trees make a tree. A commit is a tree, plus the SHA of its parent commit(s), the commit message, author/committer names and emails, and timestamps. Under normal usage, you don't need to deal with all this).

6.2 what is a repo (again)

Earlier, we saw that a repo was a graph of commits. At the file system level, however, it is basically a directory called .git which looks somewhat like this

$ ls -al .git
total 40
drwxrwxr-x 7 sitaram sitaram 4096 Sep 14 18:54 ./
drwx------ 3 sitaram sitaram 4096 Sep 14 18:54 ../
drwxrwxr-x 2 sitaram sitaram 4096 Sep 14 18:54 branches/
-rw-rw-r-- 1 sitaram sitaram   92 Sep 14 18:54 config
-rw-rw-r-- 1 sitaram sitaram   73 Sep 14 18:54 description
-rw-rw-r-- 1 sitaram sitaram   23 Sep 14 18:54 HEAD
drwxrwxr-x 2 sitaram sitaram 4096 Sep 14 18:54 hooks/
drwxrwxr-x 2 sitaram sitaram 4096 Sep 14 18:54 info/
drwxrwxr-x 4 sitaram sitaram 4096 Sep 14 18:54 objects/
drwxrwxr-x 4 sitaram sitaram 4096 Sep 14 18:54 refs/

6.3 objects and branches/tags

Hg folks should read this section carefully. Among various crazy notions Hg has is one that encodes the branch name within the commit object in some way. Unfortunately, Hg's vaunted "ease of use" (a.k.a "we support Windows better than git", which in an ideal world would be a negative, but in this world sadly it is not) has caused enormous takeup, and dozens of otherwise excellent developers have been brain-washed into thinking that is the only/right way.

I hope this section gives at least a few of them a "light-bulb" moment.

The really, really important thing to understand is that the object store doesn't care where the commit came from or what "branch" it was part of when it entered the object store. Once it's there, it's there!

Think back to these three diagrams. The first is before you did a fetch.

before fetch

before fetch

The next two figures are after git fetch sitas-repo and git merge sitas-repo/master, respectively. The fetch command added two new commits (10 and 11) to your object store, along with any other objects those commits reference.

after fetch

after fetch

after merge

after merge

However, note that commits 10 and 11 did not change in any way simply because they are now in your local "master" branch. They continue to have the same SHA values and the object store does not change as a result of this command at all.

All you did was move a pointer from one node to another.

7 advanced operations

7.1 merging

First, let's do merging. The merge you saw earlier was what is called a "fast-forward" merge, because your local master did not have any commits that the remote branch you were merging did not have.

In practice, this is rare, especially on an active project with many developers. So let's see what that looks like. The starting point was this:

before non-ff merge

before non-ff merge

Now, you made some changes on your local master. Meanwhile, sitas-repo has had some changes which you got by doing a fetch:

after fetch

after fetch

When you merge, the end result will usually look like this:

after non-ff merge

after non-ff merge

7.2 cherry-pick

A cherry-pick is not very commonly done -- in well designed workflows it should actually be rare. However, it's a good way to illustrate an important concept in git.

We said before that a commit represents a certain set of files and directories, but since most commits have only one parent, you can think of a commit as representing a set of changes too. (In fact, most older VCSs do this).

Let's say one of your collaborators (this mythical "Sita" again!) made a whole bunch of changes to his copy of the repo. You don't like most of these changes, except one specific change which you would like to bring in to your repo.

The starting point is this: