这可能在现实世界中从未发生过,也可能永远不会发生,但让我们考虑一下:假设您有一个git存储库,进行了一次提交,然后非常非常不幸:其中一个blob最终与存储库中已经存在的另一个blob具有相同的SHA-1。问题是,Git将如何处理这个问题?简单的失败?找到一种方法来连接这两个blob,并根据上下文检查哪个是需要的?

与其说这是一个真正的问题,不如说是一个脑筋急转弯,但我觉得这个问题很有趣。


根据Pro Git:

如果您提交的对象与存储库中的前一个对象散列到相同的SHA-1值,Git将在Git数据库中看到前一个对象,并假定它已经被写入。如果您试图在某个时刻再次检出该对象,则总是会得到第一个对象的数据。

所以它不会失败,但也不会保存你的新对象。 我不知道这在命令行上会是什么样子,但这肯定会令人困惑。

再往下看一点,同样的参考文献试图说明这种碰撞的可能性:

Here’s an example to give you an idea of what it would take to get a SHA-1 collision. If all 6.5 billion humans on Earth were programming, and every second, each one was producing code that was the equivalent of the entire Linux kernel history (1 million Git objects) and pushing it into one enormous Git repository, it would take 5 years until that repository contained enough objects to have a 50% probability of a single SHA-1 object collision. A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.


原始答案(2012)(见粉碎。io 2017 SHA1碰撞如下)

莱纳斯2006年的回答可能仍然适用:

Nope. If it has the same SHA1, it means that when we receive the object from the other end, we will not overwrite the object we already have. So what happens is that if we ever see a collision, the "earlier" object in any particular repository will always end up overriding. But note that "earlier" is obviously per-repository, in the sense that the git object network generates a DAG that is not fully ordered, so while different repositories will agree about what is "earlier" in the case of direct ancestry, if the object came through separate and not directly related branches, two different repos may obviously have gotten the two objects in different order. However, the "earlier will override" is very much what you want from a security standpoint: remember that the git model is that you should primarily trust only your own repository. So if you do a "git pull", the new incoming objects are by definition less trustworthy than the objects you already have, and as such it would be wrong to allow a new object to replace an old one. So you have two cases of collision: the inadvertent kind, where you somehow are very very unlucky, and two files end up having the same SHA1. At that point, what happens is that when you commit that file (or do a "git-update-index" to move it into the index, but not committed yet), the SHA1 of the new contents will be computed, but since it matches an old object, a new object won't be created, and the commit-or-index ends up pointing to the old object. You won't notice immediately (since the index will match the old object SHA1, and that means that something like "git diff" will use the checked-out copy), but if you ever do a tree-level diff (or you do a clone or pull, or force a checkout) you'll suddenly notice that that file has changed to something completely different than what you expected. So you would generally notice this kind of collision fairly quickly. In related news, the question is what to do about the inadvertent collision.. First off, let me remind people that the inadvertent kind of collision is really really really damn unlikely, so we'll quite likely never ever see it in the full history of the universe. But if it happens, it's not the end of the world: what you'd most likely have to do is just change the file that collided slightly, and just force a new commit with the changed contents (add a comment saying "/* This line added to avoid collision */") and then teach git about the magic SHA1 that has been shown to be dangerous. So over a couple of million years, maybe we'll have to add one or two "poisoned" SHA1 values to git. It's very unlikely to be a maintenance problem ;) The attacker kind of collision because somebody broke (or brute-forced) SHA1. This one is clearly a lot more likely than the inadvertent kind, but by definition it's always a "remote" repository. If the attacker had access to the local repository, he'd have much easier ways to screw you up. So in this case, the collision is entirely a non-issue: you'll get a "bad" repository that is different from what the attacker intended, but since you'll never actually use his colliding object, it's literally no different from the attacker just not having found a collision at all, but just using the object you already had (ie it's 100% equivalent to the "trivial" collision of the identical file generating the same SHA1).

使用SHA-256的问题经常被提及,但目前还没有采取行动(2012年)。 注意:从2018年和Git 2.19开始,代码将被重构为使用SHA-256。


注意(幽默):你可以强制提交一个特定的SHA1前缀,使用Brad Fitzpatrick (bradfitz)的项目gitbrute。

Gitbrute强制生成一对作者+提交者时间戳,这样得到的git提交就有您想要的前缀。

例如:https://github.com/bradfitz/deadbeef


Daniel Dinnyes在7.1 Git工具-版本选择的评论中指出,其中包括:

在同一个晚上,你的编程团队的每个成员都有可能在不相关的事件中被狼袭击并杀死。


即使是最近的(2017年2月)也破裂了。io演示了伪造SHA1碰撞的可能性: (详见我的单独回答,包括Linus Torvalds的谷歌+帖子)

a/仍然需要超过9,223,372,036,854,775,808次SHA1计算。这相当于6500年单cpu计算和110年单gpu计算的处理能力。 b/会伪造一个文件(具有相同的SHA1),但在附加约束下,其内容和大小将产生相同的SHA1(仅在内容上发生碰撞是不够的):参见“如何计算git哈希?”):blob SHA1是基于内容和大小计算的。

更多信息请参见Valerie Anita Aurora的“加密哈希函数的寿命”。 在那一页中,她写道:

谷歌花了6500年的CPU时间和110年的GPU时间来说服所有人,我们需要停止在安全关键应用程序中使用SHA-1。 也因为它很酷

详见下面我的单独回答。


我想密码学家们会庆祝的。

引用维基百科关于SHA-1的文章:

2005年2月,王晓云、尹轶群和余洪波被宣布发动攻击。 攻击可以在完整版本的SHA-1中找到冲突,只需要少于2^69次操作。(强力搜索需要2^80次操作。)


像SHA-1这样的散列有几种不同的攻击模型,但通常讨论的是碰撞搜索,包括Marc Stevens的HashClash工具。

截至2012年,对SHA-1最有效的攻击被认为是 Marc Stevens[34]的那部,估计要277万美元 通过从云服务器租用CPU来打破单个哈希值。”

正如人们指出的那样,您可以强制与git发生哈希冲突,但这样做不会覆盖另一个存储库中的现有对象。我想即使git push -f——no-thin也不会覆盖现有的对象,但不是100%肯定。

也就是说,如果你入侵了一个远程存储库,那么你可以让你的假对象成为那里的旧对象,可能会将黑客代码嵌入到github或类似的开源项目中。如果你足够谨慎,也许你可以引入一个黑客版本,让新用户下载。

然而,我怀疑项目开发人员所做的许多事情可能会暴露或意外地摧毁你数百万美元的黑客行为。特别是,如果某些开发人员(不是您入侵的)在修改了受影响的文件后运行了前面提到的git push—no-thin,有时甚至不使用—no-thin,这将是一大笔钱的浪费。


我做了一个实验,以找出Git在这种情况下的确切表现。这是2.7.9~rc0+next.20151210版本(Debian版本)。我基本上只是通过应用以下diff和重建git将哈希大小从160位减少到4位:

--- git-2.7.0~rc0+next.20151210.orig/block-sha1/sha1.c
+++ git-2.7.0~rc0+next.20151210/block-sha1/sha1.c
@@ -246,6 +246,8 @@ void blk_SHA1_Final(unsigned char hashou
    blk_SHA1_Update(ctx, padlen, 8);

    /* Output hash */
-   for (i = 0; i < 5; i++)
-       put_be32(hashout + i * 4, ctx->H[i]);
+   for (i = 0; i < 1; i++)
+       put_be32(hashout + i * 4, (ctx->H[i] & 0xf000000));
+   for (i = 1; i < 5; i++)
+       put_be32(hashout + i * 4, 0);
 }

然后我提交了几次,并注意到以下内容。

If a blob already exists with the same hash, you will not get any warnings at all. Everything seems to be ok, but when you push, someone clones, or you revert, you will lose the latest version (in line with what is explained above). If a tree object already exists and you make a blob with the same hash: Everything will seem normal, until you either try to push or someone clones your repository. Then you will see that the repo is corrupt. If a commit object already exists and you make a blob with the same hash: same as #2 - corrupt If a blob already exists and you make a commit object with the same hash, it will fail when updating the "ref". If a blob already exists and you make a tree object with the same hash. It will fail when creating the commit. If a tree object already exists and you make a commit object with the same hash, it will fail when updating the "ref". If a tree object already exists and you make a tree object with the same hash, everything will seem ok. But when you commit, all of the repository will reference the wrong tree. If a commit object already exists and you make a commit object with the same hash, everything will seem ok. But when you commit, the commit will never be created, and the HEAD pointer will be moved to an old commit. If a commit object already exists and you make a tree object with the same hash, it will fail when creating the commit.

对于#2,当你运行"git push"时,你通常会得到这样的错误:

error: object 0400000000000000000000000000000000000000 is a tree, not a blob
fatal: bad blob object
error: failed to push some refs to origin

or:

error: unable to read sha1 file of file.txt (0400000000000000000000000000000000000000)

如果你删除文件,然后运行“git checkout file.txt”。

对于#4和#6,你通常会得到这样的错误:

error: Trying to write non-commit object
f000000000000000000000000000000000000000 to branch refs/heads/master
fatal: cannot update HEAD ref

当运行“git commit”时。在这种情况下,你可以再次输入“git commit”,因为这会创建一个新的散列(因为时间戳改变了)

对于#5和#9,你通常会得到这样的错误:

fatal: 1000000000000000000000000000000000000000 is not a valid 'tree' object

当运行git commit时

如果有人试图克隆你损坏的存储库,他们通常会看到如下内容:

git clone (one repo with collided blob,
d000000000000000000000000000000000000000 is commit,
f000000000000000000000000000000000000000 is tree)

Cloning into 'clonedversion'...
done.
error: unable to read sha1 file of s (d000000000000000000000000000000000000000)
error: unable to read sha1 file of tullebukk
(f000000000000000000000000000000000000000)
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

让我“担心”的是,在两种情况(2,3)中,存储库会在没有任何警告的情况下损坏,在三种情况(1,7,8)中,一切看起来都很好,但存储库内容与您期望的不同。人们克隆或复制的内容将与你所拥有的内容不同。情况4、5、6和9是可以的,因为它将停止与一个错误。至少在所有情况下,我认为如果它失败并出现错误会更好。


为了补充我在2012年的答案,现在(5年后的2017年2月),有一个实际的SHA-1与shattered碰撞的例子。io,在这里您可以生成两个相互碰撞的PDF文件:即在第一个PDF文件上获得SHA-1数字签名,该数字签名也可以被滥用为第二个PDF文件上的有效签名。 请参见“多年来,在死亡之门,广泛使用的SHA1函数现在已经死亡”,以及此插图。

2月26日更新:Linus在谷歌+的帖子中证实了以下几点:

(1) First off - the sky isn't falling. There's a big difference between using a cryptographic hash for things like security signing, and using one for generating a "content identifier" for a content-addressable system like git. (2) Secondly, the nature of this particular SHA1 attack means that it's actually pretty easy to mitigate against, and there's already been two sets of patches posted for that mitigation. (3) And finally, there's actually a reasonably straightforward transition to some other hash that won't break the world - or even old git repositories.

关于这种转变,请参阅2018年Q1 Git 2.16添加了表示哈希算法的结构。这一过渡的实施已经开始。

从Git 2.19(2018年第三季度)开始,Git已经选择了SHA-256作为NewHash,并正在将其集成到代码中(这意味着SHA1仍然是默认的(2019年第二季度,Git 2.21),但SHA2将是继任者)


原始答案(2月25日) 但是:

This allow to forge a blob, however the SHA-1 of the tree would still changes since the size of the forged blob might not be the same as the original one:see "How is the git hash calculated?"; a blob SHA1 is computed based on the content and size. It does have some issue for git-svn though. Or rather with svn itself, as seen here. As I mentioned in my original answer, the cost of such an attempt is still prohibitive for now (6,500 CPU years and 100 GPU years) See also Valerie Anita Aurora in "Lifetimes of cryptographic hash functions". As commented before, this isn't about security or trust, but data integrity (de-duplication and error detection) which can be easily detected by a git fsck, as mentioned by Linus Torvalds today. git fsck would warn about a commit message with opaque data hidden after a NUL (although NUL isn't always present in a fraudulent file). Not everybody turns on transfer.fsck, but GitHub does: any push would be will aborted in the case of a malformed object or a broken link. Although... there is a reason this is not activated by default. a pdf file can have arbitrary binary data that you can change to generate a colliding SHA-1, as opposed as forged source code. The actual issue in creating two Git repositories with the same head commit hash and different contents. And even then, the attack remains convoluted. Linus adds: The whole point of an SCM is that it isn't about a one-time event, but about continuous history. That also fundamentally means that a successful attack needs to work over time, and not be detectable. If you can fool a SCM one time, insert your code, and it gets detected next week, you didn't actually do anything useful. You only burned yourself.

Joey Hess在Git repo中尝试了这些pdf文件,他发现:

这包括两个具有相同SHA和大小的文件 不同的blobs多亏了git将头文件前置到 内容。

joey@darkstar:~/tmp/supercollider>sha1sum  bad.pdf good.pdf 
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a  bad.pdf
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a  good.pdf
joey@darkstar:~/tmp/supercollider>git ls-tree HEAD
100644 blob ca44e9913faf08d625346205e228e2265dd12b65    bad.pdf
100644 blob 5f90b67523865ad5b1391cb4a1c010d541c816c1    good.pdf

而将相同的数据追加到这些碰撞文件确实会生成 其他碰撞,前置数据没有。

所以攻击的主要载体(伪造提交)将是:

生成一个常规提交对象; 使用整个提交对象+ NUL作为所选的前缀,并且 使用相同前缀的碰撞攻击来生成碰撞好的/坏的对象。 ... 这是无用的,因为好的和坏的提交对象仍然指向同一个树!

此外,您已经可以使用cr-marcstevens/sha1collisiondetection检测每个文件中针对SHA-1的密码分析碰撞攻击

在Git中添加类似的检查本身会有一些计算成本。

关于更改哈希,Linux注释如下:

The size of the hash and the choice of the hash algorithm are independent issues. What you'd probably do is switch to a 256-bit hash, use that internally and in the native git database, and then by default only show the hash as a 40-character hex string (kind of like how we already abbreviate things in many situations). That way tools around git don't even see the change unless passed in some special "--full-hash" argument (or "--abbrev=64" or whatever - the default being that we abbreviate to 40).

不过,转换计划(从SHA1到另一个哈希函数)仍然很复杂,但正在积极研究。 一个转换为object_id的活动正在进行中:


3月20日更新:GitHub详细说明了可能的攻击及其保护:

SHA-1 names can be assigned trust through various mechanisms. For instance, Git allows you to cryptographically sign a commit or tag. Doing so signs only the commit or tag object itself, which in turn points to other objects containing the actual file data by using their SHA-1 names. A collision in those objects could produce a signature which appears valid, but which points to different data than the signer intended. In such an attack the signer only sees one half of the collision, and the victim sees the other half.

保护:

The recent attack uses special techniques to exploit weaknesses in the SHA-1 algorithm that find a collision in much less time. These techniques leave a pattern in the bytes which can be detected when computing the SHA-1 of either half of a colliding pair. GitHub.com now performs this detection for each SHA-1 it computes, and aborts the operation if there is evidence that the object is half of a colliding pair. That prevents attackers from using GitHub to convince a project to accept the "innocent" half of their collision, as well as preventing them from hosting the malicious half.

参见Marc Stevens的“sha1collisiondetection”


同样,随着2018年第一季度Git 2.16添加了一个表示哈希算法的结构,过渡到新哈希的实现已经开始。 如上所述,新支持的Hash将是SHA-256。