Filtering Repository History

Since Subversion stores your versioned history using, at the very least, binary differencing algorithms and data compression (optionally in a completely opaque database system), attempting manual tweaks is unwise, if not quite difficult, and at any rate strongly discouraged. And once data has been stored in your repository, Subversion generally doesn't provide an easy way to remove that data. [33] But inevitably, there will be times when you would like to manipulate the history of your repository. You might need to strip out all instances of a file that was accidentally added to the repository (and shouldn't be there for whatever reason). [34] Or, perhaps you have multiple projects sharing a single repository, and you decide to split them up into their own repositories. To accomplish tasks like this, administrators need a more manageable and malleable representation of the data in their repositories—the Subversion repository dump format.

As we described in the section called “Migrating Repository Data Elsewhere”, the Subversion repository dump format is a human-readable representation of the changes that you've made to your versioned data over time. You use the svnadmin dump command to generate the dump data, and svnadmin load to populate a new repository with it (see the section called “Migrating Repository Data Elsewhere”). The great thing about the human-readability aspect of the dump format is that, if you aren't careless about it, you can manually inspect and modify it. Of course, the downside is that if you have three years' worth of repository activity encapsulated in what is likely to be a very large dump file, it could take you a long, long time to manually inspect and modify it.

That's where svndumpfilter becomes useful. This program acts as path-based filter for repository dump streams. Simply give it either a list of paths you wish to keep, or a list of paths you wish to not keep, then pipe your repository dump data through this filter. The result will be a modified stream of dump data that contains only the versioned paths you (explicitly or implicitly) requested.

Let's look a realistic example of how you might use this program. We discuss elsewhere (see the section called “Planning Your Repository Organization”) the process of deciding how to choose a layout for the data in your repositories—using one repository per project or combining them, arranging stuff within your repository, and so on. But sometimes after new revisions start flying in, you rethink your layout and would like to make some changes. A common change is the decision to move multiple projects which are sharing a single repository into separate repositories for each project.

Our imaginary repository contains three projects: calc, calendar, and spreadsheet. They have been living side-by-side in a layout like this:

/
   calc/
      trunk/
      branches/
      tags/
   calendar/
      trunk/
      branches/
      tags/
   spreadsheet/
      trunk/
      branches/
      tags/

To get these three projects into their own repositories, we first dump the whole repository:

$ svnadmin dump /var/svn/repos > repos-dumpfile
* Dumped revision 0.
* Dumped revision 1.
* Dumped revision 2.
* Dumped revision 3.
…
$

Next, run that dump file through the filter, each time including only one of our top-level directories, and resulting in three new dump files:

$ svndumpfilter include calc < repos-dumpfile > calc-dumpfile
…
$ svndumpfilter include calendar < repos-dumpfile > cal-dumpfile
…
$ svndumpfilter include spreadsheet < repos-dumpfile > ss-dumpfile
…
$

At this point, you have to make a decision. Each of your dump files will create a valid repository, but will preserve the paths exactly as they were in the original repository. This means that even though you would have a repository solely for your calc project, that repository would still have a top-level directory named calc. If you want your trunk, tags, and branches directories to live in the root of your repository, you might wish to edit your dump files, tweaking the Node-path and Node-copyfrom-path headers to no longer have that first calc/ path component. Also, you'll want to remove the section of dump data that creates the calc directory. It will look something like:

Node-path: calc
Node-action: add
Node-kind: dir
Content-length: 0
  

Warning

If you do plan on manually editing the dump file to remove a top-level directory, make sure that your editor is not set to automatically convert end-of-line characters to the native format (e.g. \r\n to \n), as the content will then not agree with the metadata. This will render the dump file useless.

All that remains now is to create your three new repositories, and load each dump file into the right repository:

$ svnadmin create calc; svnadmin load calc < calc-dumpfile
<<< Started new transaction, based on original revision 1
     * adding path : Makefile ... done.
     * adding path : button.c ... done.
…
$ svnadmin create calendar; svnadmin load calendar < cal-dumpfile
<<< Started new transaction, based on original revision 1
     * adding path : Makefile ... done.
     * adding path : cal.c ... done.
…
$ svnadmin create spreadsheet; svnadmin load spreadsheet < ss-dumpfile
<<< Started new transaction, based on original revision 1
     * adding path : Makefile ... done.
     * adding path : ss.c ... done.
…
$

Both of svndumpfilter's subcommands accept options for deciding how to deal with “empty” revisions. If a given revision contained only changes to paths that were filtered out, that now-empty revision could be considered uninteresting or even unwanted. So to give the user control over what to do with those revisions, svndumpfilter provides the following command-line options:

--drop-empty-revs

Do not generate empty revisions at all—just omit them.

--renumber-revs

If empty revisions are dropped (using the --drop-empty-revs option), change the revision numbers of the remaining revisions so that there are no gaps in the numeric sequence.

--preserve-revprops

If empty revisions are not dropped, preserve the revision properties (log message, author, date, custom properties, etc.) for those empty revisions. Otherwise, empty revisions will only contain the original datestamp, and a generated log message that indicates that this revision was emptied by svndumpfilter.

While svndumpfilter can be very useful, and a huge timesaver, there are unfortunately a couple of gotchas. First, this utility is overly sensitive to path semantics. Pay attention to whether paths in your dump file are specified with or without leading slashes. You'll want to look at the Node-path and Node-copyfrom-path headers.

…
Node-path: spreadsheet/Makefile
…

If the paths have leading slashes, you should include leading slashes in the paths you pass to svndumpfilter include and svndumpfilter exclude (and if they don't, you shouldn't). Further, if your dump file has an inconsistent usage of leading slashes for some reason, [35] you should probably normalize those paths so they all have, or lack, leading slashes.

Also, copied paths can give you some trouble. Subversion supports copy operations in the repository, where a new path is created by copying some already existing path. It is possible that at some point in the lifetime of your repository, you might have copied a file or directory from some location that svndumpfilter is excluding, to a location that it is including. In order to make the dump data self-sufficient, svndumpfilter needs to still show the addition of the new path—including the contents of any files created by the copy—and not represent that addition as a copy from a source that won't exist in your filtered dump data stream. But because the Subversion repository dump format only shows what was changed in each revision, the contents of the copy source might not be readily available. If you suspect that you have any copies of this sort in your repository, you might want to rethink your set of included/excluded paths, perhaps including the paths that served as sources of your troublesome copy operations, too.

Finally, svndumpfilter takes path filtering quite literally. If you are trying to copy the history of a project rooted at trunk/my-project and move it into a repository of its own, you would, of course, use the svndumpfilter include command to keep all the changes in and under trunk/my-project. But the resulting dump file makes no assumptions about the repository into which you plan to load this data. Specifically, the dump data might begin with the revision which added the trunk/my-project directory, but it will not contain directives which would create the trunk directory itself (because trunk doesn't match the include filter). You'll need to make sure that any directories which the new dump stream expect to exist actually do exist in the target repository before trying to load the stream into that repository.



[33] That's rather the reason you use version control at all, right?

[34] Conscious, cautious removal of certain bits of versioned data is actually supported by real use-cases. That's why an “obliterate” feature has been one of the most highly requested Subversion features, and one which the Subversion developers hope to soon provide.

[35] While svnadmin dump has a consistent leading slash policy—to not include them—other programs which generate dump data might not be so consistent.