Musings on revision-control metadata

One of the axes along which revision-control systems differ from each other is where they choose to store their working-tree metadata.

  • CVS has a non-hidden administrative directory in each working-copy directory. That causes two problems. First, it breaks globbing. Second, it makes it hard to find the top of a working copy, especially since there’s nothing to stop you putting a working copy of one project inside a working copy of another. Even worse, that’s even useful, because it makes cvs update at the topmost level update all the children).

  • Subversion has much the same approach (unsurprisingly, since it aims to be CVS with the design bugs fixed), but using a hidden directory instead of a visible one. That fixes only the first of the problems with the CVS approach. I don’t know enough about svn to know whether it has the CVS trick of letting you have a mixed-source working copy; if it doesn’t, there’s no payoff at all for having administrative information scattered throughout the working copy.

    [Update: Subversion’s svn:externals feature seems to handle mixed-source working copies to some extent, but it sounds like there are some awkward aspects to the way it works. And the particular implementation doesn’t sound like it really relies on having administrative metadata scattered through the working-copy.]

  • SVK is completely different: an SVK working copy contains no administrative information whatsoever. That has the neat property that filesystem walking never accidentally picks up any metadata from your revision-control system. But it pays by making other things too hard. You can’t duplicate a working copy just by copying the filesystem tree, and you can’t discard a working copy just by deleting the tree. The official word is that you shouldn’t even rename the directory containing your working copy after you checkout. Instead, you have to tell SVK about every such thing you do.

  • Git (and, as far as I know, Mercurial and Bazaar and others) use a mixed approach. Revision-control metadata is part of the working tree (or “repository” as we might more reasonably describe it), so tools still need to ignore the metadata when walking the filesystem. But it’s stored in hidden directories, so globbing is workable. And there’s only one metadata directory for the entire repo; that also makes it trivial to find the top of a repo (by just scanning upwards until you find a metadata directory, or die when you hit the filesystem root).

It sounds reasonable to claim, as SVK does, that users shouldn’t need to worry about whether a tree is revision-controlled — just edit your code, and commit as needed, trusting the software to take care of preserving history. That seems to have the desirable property that you can avoid changing any of your pre-revision-control assumptions about how file trees work, because all of those assumptions still hold.

But the SVK approach for achieving that goal doesn’t actually work. Sure, you can stop worrying about what’s inside a given tree, because it really is just a vanilla tree, as you might get from unpacking a tarball. But you pay for it by having to pay precise attention to where a tree is. Other systems let you copy, move, or rename an entire working tree using standard filesystem tools; SVK forces you to ask the revision-control software to do those things for you.

Further, it seems to me that any other approach for keeping trees metadata-free would rely on necessarily-fragile heuristics. Possibly such heuristics could be constructed in such a way as to be almost always unproblematic in practice, and easily handled when they guess wrong, but that sounds tricky to me.

(Sorry, CL, I do think SVK is cool, and I’m not trying to badmouth it. But I also think that the consequences of centralising working-tree metadata are undesirable.)

I think the single-metadata-directory approach offers the best trade-off. You can to a great extent ignore the metadata directory, because it sits at the root of your tree: while it can still get in the way, it usually does so only minimally. And with Git in particular, the files under the metadata directory match the real working-tree files in neither content (because they’re compressed) nor name (because they’re packed into batches whose names are machine-friendly rather than human-friendly); that significantly reduces probability that find or grep -r will trip over them, even without using alternative tools.