G+: I thought I'd write an update on git …

David Coles
Linus Torvalds
I thought I'd write an update on git and SHA1, since the SHA1 collision attack was so prominently in the news.

Quick overview first, with more in-depth explanation below:

(1) First off - the sky isn't falling. There's a big difference between using a cryptographic hash for things like security signing, and using one for generating a "content identifier" for a content-addressable system like git.

(2) Secondly, the nature of this particular SHA1 attack means that it's actually pretty easy to mitigate against, and there's already been two sets of patches posted for that mitigation.

(3) And finally, there's actually a reasonably straightforward transition to some other hash that won't break the world - or even old git repositories.

Anyway, that's the high-level overview, you can stop there unless you are interested in some more details (keyword: "some". If you want more, you should participate in the git mailing list discussions - I'm posting this for the casual git users that might just want to see some random comments).

Anyway, on to the "details":

(1) What's the difference between using a hash for security vs using a hash for object identifiers in source control management?

Both want to use cryptographic hashes, but they want to use them for different reasons.

A hash that is used for security is basically a statement of trust: and if you can fool somebody, you can make them trust you when they really shouldn't. The point of a cryptographic hash there is to basically be the source of trust, so in many ways the hash is supposed to fundamentally protect against people you cannot trust other ways. When such a hash is broken, the whole point of the hash basically goes away.

In contrast, in a project like git, the hash isn't used for "trust". I don't pull on peoples trees because they have a hash of a4d442663580. Our trust is in people, and then we end up having lots of technology measures in place to secure the actual data.

The reason for using a cryptographic hash in a project like git is because it pretty much guarantees that there is no accidental clashes, and it's also a really really good error detection thing. Think of it like "parity on steroids": it's not able to correct for errors, but it's really really good at detecting corrupt data.

Other SCM's have used things like CRC's for error detection, although honestly the most common error handling method in most SCM's tends to be "tough luck, maybe your data is there, maybe it isn't, I don't care".

So in git, the hash is used for de-duplication and error detection, and the "cryptographic" nature is mainly because a cryptographic hash is really good at those things.

I say "mainly", because yes, in git we also end up using the SHA1 when we use "real" cryptography for signing the resulting trees, so the hash does end up being part of a certain chain of trust. So we do take advantage of some of the actual security features of a good cryptographic hash, and so breaking SHA1 does have real downsides for us.

Which gets us to ...

(2) Why is this particular attack fairly easy to mitigate against at least within the context of using SHA1 in git?

There's two parts to this one: one is simply that the attack is not a pre-image attack, but an identical-prefix collision attach. That, in turn, has two big effects on mitigation:

(a) the attacker can't just generate any random collision, but needs to be able to control and generate both the "good" (not really) and the "bad" object.

(b) you can actually detect the signs of the attack in both sides of the collision.

In particular, (a) means that it's really hard to hide the attack in data that is transparent. What do I mean by "transparent"? I mean that you actually see and react to all of the data, rather than having some "blob" of data that acts like a black box, and you only see the end results.

In the pdf examples, the pdf format acted as the "black box", and what you see is the printout which has only a very indirect relationship to the pdf encoding.

But if you use git for source control like in the kernel, the stuff you really care about is source code, which is very much a transparent medium. If somebody inserts random odd generated crud in the middle of your source code, you will absolutely notice.

Similarly, the git internal data structures are actually very transparent too, even if most users might not consider them so. There are places you could try to hide things in (in particular, things like commits that have a NUL character that ends printout in "git log"), but "git fsck" already warns about those kinds of shenanigans.

So fundamentally, if the data you primarily care about is that kind of transparent source code, the attack is pretty limited to begin with. You'll see the attack. It's not silently switching your data under from you.

"But I track pdf files in git, and I might not notice them being replaced under me?"

That's a very valid concern, and you'd want your SCM to help you even with that kind of opaque data where you might not see how people are doing odd things to it behind your back. Which is why the second part of mitigation is that (b): it's fairly trivial to detect the fingerprints of using this attack.

So we already have patches on the git mailing list which will detect when somebody has used this attack to bring down the cost of generating SHA1 collisions. They haven't been merged yet, but the good thing about those mitigation measures is that not everybody needs to even run them: if you host your project on something like http://github.com or kernel.org, it's already sufficient if the hosting place runs the checks every once in a while - you'll get notified if somebody poisoned your well.

And finally, the "yes, git will eventually transition away from SHA1". There's a plan, it doesn't look all that nasty, and you don't even have to convert your repository. There's a lot of details to this, and it will take time, but because of the issues above, it's not like this is a critical "it has to happen now thing".


(+1's) 1
Matt Giuca
I disagree about (1)... git users do rely on the SHA-1 hash for trust. There's an implicit understanding (built into the system) that I can download git objects from any source into my local repository, and that is a safe, order-independent and side-effect-free operation. The only thing I have to manually verify is that my HEAD hash matches a trusted upstream, and I'm good. A SHA-1 collision breaks that assumption.

Sure, it's tricky to exploit. And if I'm only pulling from a single trusted source over HTTPS or SSH, then I'm fine. But consider this:

1. A malicious user forks a good repository.
2. The malicious user inserts a backdoor into the code (must be in a file that was changed recently). They use a SHA-1 collision to give the backdoored file the exact same SHA-1 hash as the original version. Because git is built on a hash chain, the tree containing the blob, the commit that introduced that change, and all child commits, will all happily refer to the modified blob without needing to be modified themselves.
3. The malicious user makes a few dummy commits which are an innocuous minor patch.
4. The malicious user uploads their modified repo to GitHub or another git hosting provider.
5. They then have to trick me into downloading their bad blob before I download the good blob from upstream. They might tell me they have a patch they want me to look at, so I pull their repo (I haven't pulled from origin/master in a few days). By pulling their repo, I download the bad blob onto my machine. Because I don't trust him, I don't compile or execute the code, I just look at the changes (and note that the bad blob isn't visible to me because it's not in the commits he added onto HEAD, it's hidden in the earlier commits).
6. Later, I pull from origin/master (the "good" upstream), and check out master. Now I'm no longer in the malicious guy's branch and have no expectation of caution, so I build and run the code.

The fatal flaw is that Step 6 did not download the good blob from upstream, because git sees that I already have its hash in my local repo. So when I check out master, git is telling me that I'm at the same hash as what everyone else is at; what GitHub web pages says is the hash for origin/master. But my code is actually different to what everyone else has, and it will remain that way until the file changes upstream. A "git diff" between my master and the malicious guy's branch won't show these changes. The only way to detect this would be to look at the diff for every single commit and compare it to what GitHub is showing.

I don't think signing helps, because as Linus points out, you only sign the commit object; it still relies on the hash chain's integrity.

Basically this attack is a bit tricky to pull off because it relies on the downstream users downloading the malicious blobs before they download the good upstream blobs. But totally feasible.

(And I don't know enough about this to say whether (2) is a mitigation...)

David Coles
I believe one tricky aspect of the ordeal is that you need control of both the "good" and "evil" versions for this particular attack on SHA1 (really they're both evil twins, one just acts innocent).

However I'd be extremely hesitant to just brush this off. Most security disasters aren't due to a giant "give me root" kind of screw up. Usually they're a collection of apparently innocent looking bugs that taken together can let a malicious attacker do something horrible.

Matt Giuca
You don't need to control good and evil versions. In my scenario, the "good" version is a legitimate blob from an upstream repo. You just need to get someone to download your evil version first.

Anyway, I tried out this attack (using the shattered PDF files) and I was thwarted by one thing ... git blob hashes aren't pure SHA-1 hashes of the file: it prepends the word "blob" and the file size to the front, then hashes. That causes the resultant SHA-1 hash to diverge and I don't think there's a trivial way to engineer a collision for these files. BUT that doesn't stop an attacker from using the same technique to create two files whose git blob hash (as opposed to the bare file hash) matches.

David Coles
The problem with the SHAttered attack is that it creates a pair of documents with common prefixes P and suffixes S. The only difference is in the 2 pairs of 'near-collision' blocks [(M_11, M_12), (M_21, M_22)], paired such that after the blocks the documents still have the same hash:

SHA-1(P ∣∣ M_11 ∣∣ M_12 || S) = SHA-1(P ∣∣ M_21 ∣∣ M_22 ∣∣ S)

Thus to mount an attack, the malicious user would have to be the one to create both the "good" and "bad" blob (because only the malicious user could generate the two pairs of blocks). That makes an attack distinctly harder than being able to generate a collision against any existing document.

Though I do agree with you that while git might be immune to a file content collision, there's nothing stopping someone generating a blob SHA-1 collision. (While your attack had identical prefixes due to the documents having the same size, I believe the values of the near-collision pairs are dependent on the value of the prefix chosen due to chaining).

I wonder if we need to start salting content hashes with something like a first-seen timestamp or HEAD commit-ID. Downside to that would be it would break a bunch of nice properties like content deduplication.

Matt Giuca
Ah OK. That makes it harder but still plausible: the bad guy just needs to author a pair of files: one with a legit minor change to an existing file, and one with a backdoor. Have the legit minor change accepted upstream, then do the above attack.

Do you think it's possible for one of the colliding files to have totally human-crafted content (while the other one presumably has some garbage that's just there to make the maths work out)? Or do both files need to have random garbage? If so I would acquiesce that the attack is infeasible because you would have to convince the maintainer to accept a "good" patch with garbage bytes in it.

David Coles
Using this attack to generate documents in a transparent format like source code or text documents would be pretty difficult. You need to insert 2 x 512 bits (2 x 64-bytes) of binary data into your document without anyone noticing (in the attack's PDF's they put it in the JPEG header and used some, as of yet, undisclosed tricks in the JPEG format to make the documents look visually distinct).

Ignoring the difficulty of hiding binary blocks in a text file (the attack might still be possible using a reduced character set), you still don't have a lot of flexibility is crafting documents. They still need to be almost identical (common prefix P and suffix S) and I'm pretty sure that pre-selecting (M_11, M_12) would make the search for (M_21, M_22) computationally much harder (the birthday paradox doesn't "work" if you pre-pick one of the candidates).

Thus you would need to hide a backdoor in both documents (in which case why bother with using a collision) or in the two near collisions blocks (which were already hard to find any candidates).

I also suspect these 2 x 512 bit blocks are pretty easy to detect since the goal of the second block is to "undo" the SHA1 difference between documents caused by the first blocks.