Shallow Git Repositories

When I was getting the code in the previous post ready to share, I ran into a problem: my checkouts of LLVM and Swift were shallow clones, i.e. git repositories that don’t store the full history of each branch. Working with those locally is surprisingly easy; trying to set them up on a server using git push is a bit trickier. While trying to figure out what was going on, I was dismayed by the lack of up-to-date documentation about shallow repositories, even on my usual go-to site, git-scm.com. So here’s a collection of information I’ve gathered about shallow repositories.

What is a shallow repository?

I’ll start with the definition from man gitglossary (emphasis mine):

A shallow repository has an incomplete history some of whose commits have parents cauterized away (in other words, Git is told to pretend that these commits do not have the parents, even though they are recorded in the commit object). This is sometimes useful when you are interested only in the recent history of a project even though the real history recorded in the upstream is much larger. A shallow repository is created by giving the --depth option to git-clone(1), and its history can be later deepened with git-fetch(1).

This explains what a shallow repository is and why you might want one. The implied part of “even though the real history…is much larger” is that you want to avoid the bandwidth and/or storage costs of fetching / keeping around a whole upstream repository.

The last sentence is technically still correct, but there are several ways to get a shallow repository these days:

  • --depth <N> is the oldest way, checking out the last N commits from a branch

  • --shallow-since <DATE> includes commits from DATE and since but not older; the git internal documentation says the search is equivalent to git rev-list --max-age=<DATE>

  • --shallow-exclude <REVISION> excludes REVISION and its ancestors. You can use a remote tag or branch name here too.

  • --deepen <N> adds an additional N commits to the branch in an existing shallow repository. N must be positive, though I wouldn’t be totally surprised if support for negative values was added in the future.

These are documented under git-clone and git-fetch.

An interesting note is that you can use these options on an existing repository to adjust where your shallow history ends. Being git, you won’t see space savings from shortening your history this way without running git gc --prune=now to make sure the now-unreferenced commits get deleted.

When you run git log on a shallow repository, you’ll see the “end” commits marked as “grafted”. This refers to the implementation of shallow repositories, which used to just use the grafts feature of git-replace and now have some additional bookkeeping in the .git/shallow file. (I didn’t look into this too much.)

If you want to get the full history of a branch, you can do that with git fetch --unshallow. This has been around nearly as long as the --depth option (the first bit of shallow repositories that got implemented), and will fix pretty much any problems you have with a shallow repository…at the cost of it no longer being shallow.

Pushing a shallow repository

It’s one thing to push from a shallow repository to its original upstream, which has all the commits. It’s another to push to a new repository when you don’t have all the commits! That throws a wrench into the usual way to establish a new server-side repository:

# on the server
git init --bare the-repo
# on the client
git remote add shallow-upstream ...
git push -u shallow-upstream

If you try to do this, you’ll get this sort of response:

 ! [remote rejected]   dev -> dev (shallow update not allowed)

However, if you do have control over the server, you can enable shallow pushes with the receive.shallowUpdate config option.

# on the server
git -C the-repo config receive.shallowUpdate true

Now you’ve got a shallow upstream repository, and one with no connection to the original repository you cloned from. This has some of the benefits of a local shallow repo, but since it’s going to be a source that others clone from, it’s going to have some additional sharp edges.

Sharp edges of a shallow upstream

  • The so-called “Dumb HTTP” server setup, which is what I was previously using on https://belkadan.com/source/, doesn’t support shallow fetches or clones, but it also doesn’t support shallow upstream repositories, and doesn’t produce a good error message if you try to fetch from a shallow repository. I set up the “smart” HTTP server to take care of this; see below for more information.

  • If you try to do a shallow fetch from a shallow repository, but accidentally specify an older start point than the shallow repository’s “ends”, you’ll get a bad error message about “failing to traverse parents”.

  • If you want to “unshallow” a clone of the shallow upstream, you’ll have to do it by adding the original upstream as a second remote, and then using git fetch --unshallow original-upstream. Running git fetch --unshallow shallow-upstream still does something reasonable, though: it fetches everything the shallow upstream has.

  • Of course, if the original upstream goes away, that history is (potentially) gone forever.

I think no one’s really focused on the user experience of shallow upstream repositories (yet?), but as far as I can tell they work fine as long as you do things that will succeed (as opposed to producing error messages).

Appendix: Adding Smart HTTP support to a Gitweb setup

In a previous article I talked about setting up git hosting under Apache using gitweb; one of my criteria for success was using the same URL for web browsing as for cloning. Git’s “smart” HTTP backend is supposed to dramatically cut down on HTTP requests when fetching, and can save overall bandwidth too, but I wasn’t sure of the CPU cost of running that on my shared hosting. And it is one more moving piece. But hosting shallow repositories meant I needed a smart server, and so I set to work modifying my existing configuration.

Part 1: Running git-http-backend without being able to edit Apache’s root configuration

As I mentioned in the original article, I can’t edit my web server’s main configuration files; all I can do is add per-directory configuration. For gitweb, that meant putting the script directly in with the rest of my website files.1 But gitweb’s just a little(ish) Perl script, while git-http-backend is a whole compiled program. Do I really have to copy that into my website?

Fortunately, someone else has gone through this before. Tiago Alves Macambira documented their own approach to hosting Git repositories on a shared hosting plan (Dreamhost), and while their goals were different from mine they’ve already solved this particular problem. Their answer? Write a wrapper shell script. Here’s mine, which I just named git-http-backend.cgi:

#!/bin/sh
# Look in this directory for projects.
export GIT_PROJECT_ROOT="$PWD"
# Allow any repositories to be exported.
export GIT_HTTP_EXPORT_ALL=""

# Unfortunate hard-coding for my particular hosting.
PATH=/usr/local/cpanel/3rdparty/lib/path-bin/:$PATH
git http-backend

That PATH line is because the install of git in /usr/bin is much older than the one my hosting provides to users, and I want to use the new one.

This works, and I could test it with git ls-remote:

git ls-remote https://belkadan.com/source/git-http-backend.cgi/swift

(Note: at the time I wrote this article I left this endpoint up, but I might close it down in the future so that git-http-backend is only run through the pretty URLs.)

Part 2: Supporting pretty URLs

The final goal here was for git clone https://belkadan.com/source/swift to work, just like it did for my existing repositories. This turned out to be pretty straightforward; taking a hint from the “Accelerated static Apache 2.x” configuration in the git-http-backend docs, I added this line to my .htaccess file, ahead of my previous rules for gitweb:

RewriteRule \
  ^[^/]+/(HEAD|info/refs|objects/info/.+|git-upload-pack)$ \
  git-http-backend.cgi/$0 [L]

This basically says “send requests in an immediate subdirectory for HEAD, info/refs, git-upload-pack, and anything in objects/info/ to git-http-backend.cgi”.2 Requests for existing objects or packfiles will still be served through Apache, and any other requests will go to gitweb through the rest of my configuration. (That [L] at the end stands for “last”, which keeps the requests intended for git-http-backend from subsequently being routed to gitweb.)

Once again I tested it with git ls-remote:

git ls-remote https://belkadan.com/source/swift

and everything seems to be in order.

  1. I suppose I could have used a separate cgi-bin directory, but that always struck me as weird and also more likely to be accidentally insecure if you’re already guarding against arbitrary uploads. ↩︎

  2. The git-http-backend docs also include git-receive-pack as a possible path, but that’s only used for pushing through HTTPS, and I’m not using that. ↩︎