Tools for Hacking R: Git + Subversion

In an earlier post, I discussed how to use Subversion to download, edit, and generate a patch against R's source code. Since most of us can't commit our code changes back to R's repository, we can consider alternatives to store and maintain our patch, until it is eventually incorporated into R. Of course, our changes may never be incorporated. We still ought to have a record of our work!

The biggest problem in maintaining a patch, is ensuring compatibility with upstream changes. In other words, once we've written a patch, we need to ensure that subsequent changes to R's main development branch don't conflict with our changes. The Git version control software can help us here.

Git is similar in purpose to Subversion; it's used to track changes to source code. Git has features that make it easy to maintain a patch against a larger project. In contrast with Subversion, a complete Git repository is designed to be stored locally. In addition, Git is often distributed with tools that make it easy to interact with Subversion repositories.

This is a blog post, so lets just see an example: We first need to install the Git and Git-Subversion packages. In Debian GNU Linux or Ubuntu, we can use aptitude:

$ aptitude install git-core git-svn

We can then use git svn to download and initialize a Git repository from the R Subversion repository:

$ git svn clone -r52760 http://svn.r-project.org/R/trunk R-patch

This command tells Git to download the the Subversion repository at http://svn.r-project.org/R/trunk, at revision 52760, and use it to initialize a Git repository locally in directory R-patch. The -r argument here is critical. If the revision is not provided, the entire revision history is downloaded from the Subversion repository (all ~53k revisions)! It's also important to select a revision that is current, because when the Git repository is updated, all subsequent revisions are downloaded.

Now we have a local Git repository in R-patch, we can modify this code and keep track of our changes under the normal Git conventions. Say we want to increase the number of available R connections. We can modify src/main/connections.c such that the resulting diff is:

$ git diff
diff --git a/src/main/connections.c b/src/main/connections.c
index ee01a9d..7fa73b9 100644
--- a/src/main/connections.c
+++ b/src/main/connections.c
@@ -60,7 +60,7 @@ typedef long long int _lli_t;
   extern UImode  CharacterMode;
 #endif
 
-#define NCONNECTIONS 128 /* snow needs one per slave node */
+#define NCONNECTIONS 256 /* snow needs one per slave node */
 #define NSINKS 21
 
 static Rconnection Connections[NCONNECTIONS];

and commit our changes locally with something like:

$ git commit -a -m"increase available connections"
[master d8e4b62] increase available connections
 1 files changed, 1 insertions(+), 1 deletions(-)

Now that we have a patch against the revision 52760, we need to ensure that subsequent changes in the Subversion trunk don't conflict with our code. The Git-Subversion software has a special command to deal with this, called rebase. The rebase command 'unwinds' our local work, applies the changes from the Subversion trunk, and then 'replays' our work on top of those changes. If there are conflicts, Git-Subversion will issue a notification and mark the areas in each file where a conflict occurs. At this point the rebase operation is incomplete, and you must manually resolve the conflicting code. When all conflicts are resolved, the rebase --continue command completes the rebase operation, and our patch maintenance is complete.

To illustrate:

$ git svn rebase
	M	src/main/deparse.c
r52761 = 9d0f32ca4cd8067f1ec5407b40af5c0a21cee5b4 (refs/remotes/git-svn)
	M	src/library/base/man/strptime.Rd
	M	src/main/datetime.c
	M	doc/NEWS.Rd

<snipped for blog post>

r52795 = b7c88c3bc39bf679ed8609111a3390b218823120 (refs/remotes/git-svn)
	M	doc/NEWS.Rd
r52796 = be0b53290415a43d0aa0fab2245553ce2d9e455f (refs/remotes/git-svn)
First, rewinding head to replay your work on top of it...
Applying: increase available connections
Using index info to reconstruct a base tree...
Falling back to patching base and 3-way merge...

Clearly, our local modifications did not result in a conflict, and so we have successfully maintained this trivial patch. In addition, our local commit is now at the top of the Git commit log, just after the latest Subversion commit by Peter Dalgaard, of the R core team:

$ git log
commit 295b642df92af768c3cd0813d6b3593a00061617
Author: Matt Shotwell <matt@biostatmatt.com>
Date:   Mon Aug 23 21:34:34 2010 -0400

    increase available connections

commit be0b53290415a43d0aa0fab2245553ce2d9e455f
Author: pd <pd@00db46b3-68df-0310-9c12-caf00c1e9a41>
Date:   Mon Aug 23 21:13:50 2010 +0000

    camelCase...
    
    git-svn-id: http://svn.r-project.org/R/trunk@52796 00db46b3-68df-0310-9c12-caf00c1e9a41

<snipped for blog post>

We can generate a new patch file against the latest (Subversion trunk) revision using git diff, and specifying only the revision(s) we had committed locally:

$ git diff be0b532..
diff --git a/src/main/connections.c b/src/main/connections.c
index a06d01d..7402552 100644
--- a/src/main/connections.c
+++ b/src/main/connections.c
@@ -60,7 +60,7 @@ typedef long long int _lli_t;
   extern UImode  CharacterMode;
 #endif
 
-#define NCONNECTIONS 128 /* snow needs one per slave node */
+#define NCONNECTIONS 256 /* snow needs one per slave node */
 #define NSINKS 21
 
 static Rconnection Connections[NCONNECTIONS];

where be0b532 is the (partial) Git hash code of the latest Subversion trunk revision, and be0b532.. selects the commits since this revision, i.e our local changes.