Optimizing Git’s Merge Machinery, #2

9 min readMar 8, 2021

Editor’s note: This is the second post in a series by a Palantir Software Engineer on optimizing git’s merge and rename detection machinery. Click to read the first post.

In the first post in this series, I provided some background on how the git merge machinery works, why I was working on improving it, and showed a simple optimization to speed it up that made better use of exact renames. In this post, we’ll dive into the next optimization, which can be summarized as “Find useful heuristics to guide rename detection in order to avoid (or at least reduce) brute-force exhaustive pairwise comparisons.”

Pairing files for three-way content merging

First, a quick refresher; two quotes from the last blog post:

Merging isn’t a two-way operation, but a three-way operation where git consults the merge base, your current commit, and the commit at the tip of the other branch. […] The basic idea behind the merge algorithm is thus to just do a three-way content merge for each file in the repository using the versions of each file from those three commits.

and

Rename detection is a quadratic algorithm. It involves first getting the set of all N filenames unique to the merge base, then all M filenames unique to the given side, then doing a similarity comparison between the content of all N x M files and marking the best matches that are sufficiently similar as renames. Even a small number of renames is sufficient to make rename detection the slowest part of a merge.

One interesting thing that can easily be missed from the second quote above is that git treats files with the same name in the merge base and the given side as being related — and thus automatically pairs them for three-way content merges, regardless of actual similarity of file contents. Git allows this automatic pairing to be turned off for some commands like git diff or git log using command line flags (--break-rewrites or -B); this is referred to as “break detection”. When “break detection” is turned on, git will check similarity of files with the exact same name, and if they aren’t sufficiently similar, treat them as separate files and throw both into the bucket for rename detection.

The merge machinery does not turn break detection on; it would be way, way slower if it did. This essentially means that git says that the filename is a good guide for matching up files for three-way content merging, but only if the filename is an exact match. If the filename is not an exact match, it punts and says that the only thing that matters is content similarity of two files. In short, both filename similarity and content similarity can be consulted for determining which files should be paired for three-way content merges. We’ll come back to this, but for now we’ll concentrate on the content similarity comparisons.

Okay, so what does this pairing of files mean?

Pictorially, we can view this as a matrix. Listing filenames unique to the merge base along the left, and filenames unique to the given side across the top, we could for example see a view like the following:

Then, for each cell we compute the similarity of the two files as a percentage and store it. When we’re done, we look for the best match for each file, and if the best match is sufficiently good, we mark the files as a rename. In the last blog post we noticed that we could make use of git’s storing of hashes for everything to provide a way to detect some renames quickly. But what if there are no exact matches, or there are still lots of files left after exact matching? Is there anything else we can do? Well, what if I showed this same matrix slightly differently:

Reframing this in a way that a five-year-old can solve

After looking at a long list of files for a while (much longer than this list here), I noticed something that seemed kind of obvious about matching up files. Let’s reshow this matrix just a little differently:

Is it any easier to notice potential likely rename pairs, without looking at actual file contents? I just asked my five-year-old, and she identified all four likely pairs. Sure, we can’t color code things, but that was just an easy way to highlight the fact that renames often involve moving a file into a different directory while keeping the basename (everything after the last directory separator, including the extension) of the file the same. Is this cheating? Does it help that many cases in practice? Is it a special case that only applies to certain repositories? So the big question is…

How common is this?

I looked into some real world repositories, and looked at historical renames in each. I looked for what percentage of renames did NOT change the basename of the file in question. Here’s what I found:

linux: 76%
gcc: 64%
gecko: 79%
webkit: 89%
git.git: 16%

I also checked a few others, including internal company repositories. All the repositories I checked except git.git seemed to often keep file basenames (the portion of the filename after the last directory separator, including the extension) the same when renaming files. Apparently moving files into different directories while keeping the basename the same is very common. Even in git.git’s case, I’d argue that’s a high enough percentage to suggest a useful optimization, because being able to match up to 16% of files and remove them from the matrix comparison could reduce the number of pairs to check by up to 30% (because (1–0.16)² is approximately 0.70).

Optimization: basename-guided rename detection

Given the above info, we can add an in-between set of checks (after the exact rename detection talked about in the previous blog post, but before the normal matrix comparison), consisting of the following steps:

Create a simple map of basename to fullname for rename sources (i.e. for filenames unique to the merge base). If any basename isn’t unique, just leave it out.
Do the same for rename destinations (i.e. files unique to the given side).
Find which basenames are in both sets. For each:
Compare the similarity of the content of the two full files.
If they are sufficiently similar (with a higher threshold than used for the normal matrix comparison) mark them as a rename, and remove them from the later matrix comparison.
If they are not sufficiently similar, include them in the full matrix of pairwise comparisons later.

Note that this algorithm could end up changing behavior; it’s possible that a docs/ext.txt file is sufficiently similar by the above rules to match a docs/config/ext.txt and be marked as a rename, even though its content could have been even more similar to a docs/ext.md file. So this optimization is ultimately a judgement call; do the gains in performance justify the differences in behavior? Would users understand the results of the new algorithm as well as the old? We do require files matched by basename similarity to meet a higher minimum threshold (>= 75% matching lines[1], versus a mere 50% matching lines needed in the matrix comparisons) as a form of safety factor. But ultimately, this optimization could theoretically yield different results than the old algorithm.

Get Palantir’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

The gains from this change are pretty impressive. We use the same testcase as the first blogpost, which involved a high-level directory rename — basically renaming “drivers” to “pilots” in the linux kernel on one branch, then rebasing a bunch of patches onto that branch. For this testcase, adding the simple algorithm outlined above results in the following speedup:

                        Before                  After
mega-renames:   1799.937 s ±  0.493 s   188.754 s ±  0.284 s

So this drops the time from half an hour down to just over 3 minutes. Obviously, the results will vary depending on how many renames are present in your repository and how many of those renames only move files into a new directory without changing the basename, but the statistics gathered above (in the “How common is this?” section) suggest that this optimization will be widely applicable.

Great! Can we do better, though?

What about all those files that don’t have a unique basename? Do those happen a lot? Turns out they do. Repositories may have hundreds of files with the same basename, especially build system files — .gitignore, Makefile, build.gradle, etc. It is also quite common to see source code files with a common name. The linux kernel has 100 setup.c, 87 irq.c, and 112 core.c files. Another repository I know of has many ObjectFactory.java and Plugin.java files.

One solution for handling these files would be making a mini-matrix of files with the same basename, looking for similarity percentages, and pairing up ones with the best matches that are sufficiently similar, leaving the rest to be recompared in the bigger matrix at the end. I implemented that, but there were a couple drawbacks. The biggest was that it worked quite poorly in combination with another later optimization I wanted to make, resulting in a worse solution overall.

However, looking at these non-unique basenames, they represent cases where people have moved files into new directories without renaming the file, but there are multiple possible files in different directories to choose from. If we knew how to predict which directory a given file had moved into, and there was a file in that directory with the same basename, then we’d have a likely rename pairing. Predicting which directory a file moved into sounds like the job of directory rename detection, which was something we implemented earlier and blogged about at Renaming and deep directory hierarchies in Git. There’s just one problem with re-using that code here — directory rename detection was based on aggregating rename detection of individual files within a directory. How can we use it to predict individual renames, if it is an end-result computed from the aggregation of all the renames?

The fact that rename detection is done in stages, though, provides us with an answer. We first detect exact renames, as noted in the first blog post in this series. That doesn’t give us all the renames, but we can use the directory rename detection logic with just this subset of renames to get a prediction about where other files in the same directory will be renamed to. For any given remaining file, if we get a prediction from the directory rename detection logic about where it moves to, and if the resulting directory has a file with the same name, we consider them likely rename pairs and check them for similarity. If they meet the increased threshold required for matching up similar basenames, then we mark them as renames and remove them from the later matrix pairwise comparison.

Adding this tweak to the optimization improves our timing results as follows:

                        Before                  After
mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s

Which means that overall, this pair of optimizations dropped us from half an hour, to a little over two minutes. Not bad. Anyone with a repository with lots of renames and with most of the renames just being moves of files into different directories should see similar results. If you don’t have lots of renames, then you aren’t affected by the slowness. If you have lots of renames but have a repository where most of those renames also involve basename changes, then please let me know; I’m very curious to learn of such a repository (other than git.git). However, even in that case I have more optimizations up our sleeve that I’ll cover in future posts in this series.

For more details about this pair of optimizations, see the two upstream threads where my patches were reviewed, namely the one on basic basename guided rename detection and the thread on using file basenames even more (via early directory rename detection).

The basic optimization presented here will appear in git-2.31, to be released in mid March of 2021. The follow-on tweak did not make it in time for git-2.31 but will appear in git-2.32, which will probably be released in June 2021.

Author

Elijah Newren

[1] Matching lines isn’t the actual measure but it’s close enough for a basic understanding. The real algorithm is a bit more complex. The similarity algorithm divides up each file into chunks. Chunks are defined by splitting at every newline or after 64 bytes, whichever comes first. It hashes each chunk to an integer (so frequently repeated lines, such as a simple opening or end brace, will all hash to the same integer). After mapping all chunks to an integer, you have a list of integers for each file. For any given file, it can then compare the list of integers for that file, to the list of integers of another file. When it does so, it notes the percentage of integers from our list found in the other list. There are also a few more wrinkles, such as files whose size are too different (e.g., more than a factor of 2) automatically being assumed to not be similar.

Palantir Blog