git中的哈希冲突

用正确的“但是”来回答这个问题，而不解释为什么这不是一个问题是不可能的。如果没有很好地理解哈希是什么，是不可能做到这一点的。它比你在计算机科学课程中接触到的简单情况要复杂得多。

There is a basic misunderstanding of information theory here. If you reduce a large amount of information into a smaller amount by discarding some amount (ie. a hash) there will be a chance of collision directly related to the length of the data. The shorter the data, the LESS likely it will be. Now, the vast majority of the collisions will be gibberish, making them that much more likely to actually happen (you would never check in gibberish...even a binary image is somewhat structured). In the end, the chances are remote. To answer your question, yes, git will treat them as the same, changing the hash algorithm won't help, it'll take a "second check" of some sort, but ultimately, you would need as much "additional check" data as the length of the data to be 100% sure...keep in mind you would be 99.99999....to a really long number of digits.... sure with a simple check like you describe. SHA-x are cryptographically strong hashes, which means is't generally hard to intentionally create two source data sets that are both VERY SIMILAR to each other, and have the same hash. One bit of change in the data should create more than one (preferably as many as possible) bits of change in the hash output, which also means it's very difficult (but not quite impossible) to work back from the hash to the complete set of collisions, and thereby pull out the original message from that set of collisions - all but a few will be gibberish, and of the ones that aren't there's still a huge number to sift through if the message length is any significant length. The downside of a crypto hash is that they are slow to compute...in general.

So, what's it all mean then for Git? Not much. The hashes get done so rarely (relative to everything else) that their computational penalty is low overall to operations. The chances of hitting a pair of collisions is so low, it's not a realistic chance to occur and not be detected immediately (ie. your code would most likely suddenly stop building), allowing the user to fix the problem (back up a revision, and make the change again, and you'll almost certainly get a different hash because of the time change, which also feeds the hash in git). There is more likely for it to be a real problem for you if you're storing arbitrary binaries in git, which isn't really what it's primary use model is. If you want to do that...you're probably better off using a traditional database.

思考这个问题并没有错——这是一个很好的问题，很多人只是把它当作“太不可能了，不值得思考”——但实际上比这要复杂一些。如果它确实发生了，它应该很容易被检测到，它不会是正常工作流程中的无声损坏。

2013-06-30 00:03:42

用正确的“但是”来回答这个问题，而不解释为什么这不是一个问题是不可能的。如果没有很好地理解哈希是什么，是不可能做到这一点的。它比你在计算机科学课程中接触到的简单情况要复杂得多。

There is a basic misunderstanding of information theory here. If you reduce a large amount of information into a smaller amount by discarding some amount (ie. a hash) there will be a chance of collision directly related to the length of the data. The shorter the data, the LESS likely it will be. Now, the vast majority of the collisions will be gibberish, making them that much more likely to actually happen (you would never check in gibberish...even a binary image is somewhat structured). In the end, the chances are remote. To answer your question, yes, git will treat them as the same, changing the hash algorithm won't help, it'll take a "second check" of some sort, but ultimately, you would need as much "additional check" data as the length of the data to be 100% sure...keep in mind you would be 99.99999....to a really long number of digits.... sure with a simple check like you describe. SHA-x are cryptographically strong hashes, which means is't generally hard to intentionally create two source data sets that are both VERY SIMILAR to each other, and have the same hash. One bit of change in the data should create more than one (preferably as many as possible) bits of change in the hash output, which also means it's very difficult (but not quite impossible) to work back from the hash to the complete set of collisions, and thereby pull out the original message from that set of collisions - all but a few will be gibberish, and of the ones that aren't there's still a huge number to sift through if the message length is any significant length. The downside of a crypto hash is that they are slow to compute...in general.

So, what's it all mean then for Git? Not much. The hashes get done so rarely (relative to everything else) that their computational penalty is low overall to operations. The chances of hitting a pair of collisions is so low, it's not a realistic chance to occur and not be detected immediately (ie. your code would most likely suddenly stop building), allowing the user to fix the problem (back up a revision, and make the change again, and you'll almost certainly get a different hash because of the time change, which also feeds the hash in git). There is more likely for it to be a real problem for you if you're storing arbitrary binaries in git, which isn't really what it's primary use model is. If you want to do that...you're probably better off using a traditional database.

思考这个问题并没有错——这是一个很好的问题，很多人只是把它当作“太不可能了，不值得思考”——但实际上比这要复杂一些。如果它确实发生了，它应该很容易被检测到，它不会是正常工作流程中的无声损坏。

2013-06-30 00:03:42

你可以在“Git如何处理一个blob上的SHA-1碰撞?”中看到一个很好的研究。

由于SHA1冲突现在是可能的(正如我在回答中用shatat .io提到的)，Git 2.13(2017年第二季度)将通过Marc Stevens (CWI)和Dan Shumow(微软)实现的SHA-1“检测试图创建冲突”的变体来改善/缓解当前的情况。

参见Jeff King (peff)的commit f5f5e7f, commit 8325e43, commit c0c2006, commit 45a574e, commit 28dc98e(2017年3月16日)。 (由Junio C Hamano—gitster—在commit 48b3693中合并，2017年3月24日)

Makefile: make DC_SHA1 the default We used to use the SHA1 implementation from the OpenSSL library by default. As we are trying to be careful against collision attacks after the recent "shattered" announcement, switch the default to encourage people to use DC_SHA1 implementation instead. Those who want to use the implementation from OpenSSL can explicitly ask for it by OPENSSL_SHA1=YesPlease when running "make". We don't actually have a Git-object collision, so the best we can do is to run one of the shattered PDFs through test-sha1. This should trigger the collision check and die.

Git是否可以改进以适应这种情况，或者我是否必须改用新的哈希算法?

2017年12月Git 2.16(2018年第一季度)更新:支持替代SHA的努力正在进行中:参见“为什么Git不使用更现代的SHA?”。

您将能够使用另一种哈希算法:SHA1不再是Git的唯一算法。

Git 2.18(2018年第二季度)记录了这个过程。

参见Ævar Arnfjörð Bjarmason (avar)提交5988eb6，提交45fa195(2018年3月26日)。 (由Junio C Hamano - gitster -在commit d877975中合并，2018年4月11日)

doc hash-function-transition: clarify what SHAttered means Attempt to clarify what the SHAttered attack means in practice for Git. The previous version of the text made no mention whatsoever of Git already having a mitigation for this specific attack, which the SHAttered researchers claim will detect cryptanalytic collision attacks. I may have gotten some of the nuances wrong, but as far as I know this new text accurately summarizes the current situation with SHA-1 in git. I.e. git doesn't really use SHA-1 anymore, it uses Hardened-SHA-1 (they just so happen to produce the same outputs 99.99999999999...% of the time). Thus the previous text was incorrect in asserting that: [...]As a result [of SHAttered], SHA-1 cannot be considered cryptographically secure any more[...] That's not the case. We have a mitigation against SHAttered, however we consider it prudent to move to work towards a NewHash should future vulnerabilities in either SHA-1 or Hardened-SHA-1 emerge.

所以现在新的文档是这样的:

Git v2.13.0 and later subsequently moved to a hardened SHA-1 implementation by default, which isn't vulnerable to the SHAttered attack. Thus Git has in effect already migrated to a new hash that isn't SHA-1 and doesn't share its vulnerabilities, its new hash function just happens to produce exactly the same output for all known inputs, except two PDFs published by the SHAttered researchers, and the new implementation (written by those researchers) claims to detect future cryptanalytic collision attacks. Regardless, it's considered prudent to move past any variant of SHA-1 to a new hash. There's no guarantee that future attacks on SHA-1 won't be published in the future, and those attacks may not have viable mitigations. If SHA-1 and its variants were to be truly broken, Git's hash function could not be considered cryptographically secure any more. This would impact the communication of hash values because we could not trust that a given hash value represented the known good version of content that the speaker intended.

注意:现在同一文档(2018年第三季度，Git 2.19)明确地将“新哈希”引用为SHA-256:参见“为什么Git不使用更现代的SHA?”。

2017-04-11 20:46:38

在10个卫星上挑选原子

SHA-1哈希是一个40十六进制字符串…也就是每个字符4比特乘以40…160位。现在我们知道10位大约是1000(确切地说是1024)，这意味着有1000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000不同的SHA-1哈希值……1048.

这相当于什么?月球是由1047个原子组成的。所以如果我们有10颗卫星……你在这些卫星中随机选择一个原子…然后继续，再随机选择一个原子……那么你两次选择同一个原子的可能性，就是两次给定的git提交将有相同的SHA-1哈希的可能性。

在此基础上，我们可以问这样一个问题……

在您开始担心冲突之前，存储库中需要多少提交?

这与所谓的“生日攻击”有关，而“生日攻击”又指的是“生日悖论”或“生日问题”，即当你从给定的集合中随机挑选时，你只需要出人意料地少挑几次，就很有可能选了两次。但“少得惊人”在这里是一个相对的说法。

维基百科上有一个关于生日悖论碰撞概率的表格。没有40字符散列的条目。但是对32个字符和48个字符的条目进行插值，结果是5*1022个git提交，碰撞概率为0.1%。这是五万亿亿亿个不同的提交，或者五十个zettcommit，在你达到哪怕0.1%的碰撞几率之前。

仅这些提交的哈希值的字节和就比地球上一年产生的所有数据还要多，也就是说，您需要以比YouTube流媒体视频更快的速度大量生成代码。祝你好运。: D

这里的重点是，除非有人故意造成碰撞，否则随机发生碰撞的概率是如此之小，以至于你可以忽略这个问题

“但当碰撞发生时，实际会发生什么?”

好吧，假设不可能发生的事情确实发生了，或者假设有人设法定制了一个刻意的SHA-1哈希碰撞。然后会发生什么?

在这种情况下，有一个很好的答案，有人用它做了实验。我将引用他的回答:

If a blob already exists with the same hash, you will not get any warnings at all. Everything seems to be ok, but when you push, someone clones, or you revert, you will lose the latest version (in line with what is explained above). If a tree object already exists and you make a blob with the same hash: Everything will seem normal, until you either try to push or someone clones your repository. Then you will see that the repo is corrupt. If a commit object already exists and you make a blob with the same hash: same as #2 - corrupt If a blob already exists and you make a commit object with the same hash, it will fail when updating the "ref". If a blob already exists and you make a tree object with the same hash. It will fail when creating the commit. If a tree object already exists and you make a commit object with the same hash, it will fail when updating the "ref". If a tree object already exists and you make a tree object with the same hash, everything will seem ok. But when you commit, all of the repository will reference the wrong tree. If a commit object already exists and you make a commit object with the same hash, everything will seem ok. But when you commit, the commit will never be created, and the HEAD pointer will be moved to an old commit. If a commit object already exists and you make a tree object with the same hash, it will fail when creating the commit.

如你所见，有些情况并不好。特别是情况#2和#3会使您的存储库混乱。但是，错误似乎停留在该存储库中，而攻击或奇怪的不可能性不会传播到其他存储库。

此外，蓄意碰撞的问题似乎被认为是一个真正的威胁，因此，例如GitHub正在采取措施防止它。

2014-04-23 19:09:30

好吧，我想我们现在知道会发生什么了——你应该预料到你的存储库会被损坏(源代码)。

2017-02-25 10:38:27

谷歌现在声称在某些前提条件下SHA-1碰撞是可能的: https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html

由于git使用SHA-1来检查文件完整性，这意味着git中的文件完整性受到了损害。

在我看来，git应该使用更好的哈希算法，因为故意碰撞现在是可能的。

2017-02-24 03:22:58

git中的哈希冲突

推荐文章

最新文章

标签