使用Git管理大型二进制文件

我正在寻找如何处理我的源代码(web应用程序)依赖的大型二进制文件的意见。我们目前正在讨论几种替代方案:

Copy the binary files by hand. Pro: Not sure. Contra: I am strongly against this, as it increases the likelihood of errors when setting up a new site/migrating the old one. Builds up another hurdle to take. Manage them all with Git. Pro: Removes the possibility to 'forget' to copy a important file Contra: Bloats the repository and decreases flexibility to manage the code-base and checkouts, clones, etc. will take quite a while. Separate repositories. Pro: Checking out/cloning the source code is fast as ever, and the images are properly archived in their own repository. Contra: Removes the simpleness of having the one and only Git repository on the project. It surely introduces some other things I haven't thought about.

你对此有什么经验/想法?

还有:有人有在一个项目中使用多个Git存储库并管理它们的经验吗?

这些文件是用于生成包含这些文件的pdf文件的程序的图像。这些文件不会经常更改(例如几年)，但它们与程序非常相关。没有这些文件，程序将无法工作。

当前回答

SVN似乎比Git更有效地处理二进制增量。

我必须决定文档的版本控制系统(JPEG文件、PDF文件和.odt文件)。我刚刚测试了添加一个JPEG文件并将其旋转90度4次(以检查二进制增量的有效性)。Git的存储库增长了400%。SVN的存储库仅增长了11%。

因此，看起来SVN使用二进制文件更有效率。

所以我选择Git作为源代码，SVN作为文档之类的二进制文件。

2010-10-03 03:11:41

其他回答

我将使用子模块(如Pat Notz)或两个不同的存储库。如果你太频繁地修改二进制文件，那么我会尽量减少巨大的存储库清理历史记录的影响:

几个月前我遇到了一个非常类似的问题:~21 GB的MP3文件，未分类(糟糕的名称，糟糕的id3，不知道我是否喜欢这个MP3文件……)，并在三台计算机上复制。

我使用带有主Git存储库的外部硬盘驱动器，并将其克隆到每台计算机中。然后，我开始用习惯的方式对它们进行分类(推、拉、合并……)多次删除和重命名)。

最后，我只有~ 6gb的MP3文件和~83 GB的.git目录。我使用git-write-tree和git-commit-tree创建了一个新的提交，没有提交祖先，并启动了一个指向该提交的新分支。该分支的“git日志”只显示了一次提交。

然后，我删除了旧的分支，只保留了新的分支，删除了ref-logs，并运行“git prune”:在那之后，我的.git文件夹只重约6gb…

你可以不时地用同样的方法“清除”这个巨大的存储库:你的“git克隆”会更快。

2009-02-12 14:52:57

你也可以用git-fat。我喜欢它只依赖于stock Python和rsync。它还支持通常的Git工作流，使用以下自解释命令:

git fat init
git fat push
git fat pull

此外，您需要将.gitfat文件签入存储库，并修改.gitattributes以指定您希望gitfat管理的文件扩展名。

您可以使用普通的git add添加一个二进制文件，它会根据您的gitattributes规则调用git fat。

最后，它还有一个优点，即二进制文件实际存储的位置可以跨存储库和用户共享，并支持rsync所做的一切。

更新:如果你正在使用Git-SVN网桥，不要使用git-fat。它最终将从Subversion存储库中删除二进制文件。但是，如果您使用的是纯Git存储库，那么它的工作效果非常好。

2013-09-26 04:51:26

如果没有这些文件程序就不能工作，那么将它们分割成一个单独的repo似乎是一个坏主意。我们有大型的测试套件，我们将它们分解到一个单独的repo中，但这些都是真正的“辅助”文件。

但是，你可以在一个单独的repo中管理这些文件，然后使用git-submodule以一种合理的方式将它们拉到你的项目中。你仍然有所有源代码的完整历史但是，据我所知，你只有图像子模块的一个相关修订。git-submodule功能应该帮助您保持正确的代码版本与正确的图像版本保持一致。

下面是Git Book中关于子模块的一个很好的介绍。

2009-02-12 14:29:01

我想提出的解决方案是基于孤立分支和对标记机制的轻微滥用，因此称为*孤立标记二进制存储(OTABS)

如果你可以使用github的LFS或其他第三方，无论如何你应该。如果你不能，那么继续读下去。请注意，这个解决方案是一个黑客，应该被这样对待。

OTABS的理想属性

it is a pure git and git only solution -- it gets the job done without any 3rd party software (like git-annex) or 3rd party infrastructure (like github's LFS). it stores the binary files efficiently, i.e. it doesn't bloat the history of your repository. git pull and git fetch, including git fetch --all are still bandwidth efficient, i.e. not all large binaries are pulled from the remote by default. it works on Windows. it stores everything in a single git repository. it allows for deletion of outdated binaries (unlike bup).

OTABS的不良属性

it makes git clone potentially inefficient (but not necessarily, depending on your usage). If you deploy this solution you might have to advice your colleagues to use git clone -b master --single-branch <url> instead of git clone. This is because git clone by default literally clones entire repository, including things you wouldn't normally want to waste your bandwidth on, like unreferenced commits. Taken from SO 4811434. it makes git fetch <remote> --tags bandwidth inefficient, but not necessarily storage inefficient. You can can always advise your colleagues not to use it. you'll have to periodically use a git gc trick to clean your repository from any files you don't want any more. it is not as efficient as bup or git-bigfiles. But it's respectively more suitable for what you're trying to do and more off-the-shelf. You are likely to run into trouble with hundreds of thousands of small files or with files in range of gigabytes, but read on for workarounds.

添加二进制文件

在开始之前，请确保您已经提交了所有的更改，您的工作树是最新的，并且您的索引不包含任何未提交的更改。这可能是一个好主意，把你所有的本地分支推到你的远程(github等)以防任何灾难发生。

Create a new orphan branch. git checkout --orphan binaryStuff will do the trick. This produces a branch that is entirely disconnected from any other branch, and the first commit you'll make in this branch will have no parent, which will make it a root commit. Clean your index using git rm --cached * .gitignore. Take a deep breath and delete entire working tree using rm -fr * .gitignore. Internal .git directory will stay untouched, because the * wildcard doesn't match it. Copy in your VeryBigBinary.exe, or your VeryHeavyDirectory/. Add it && commit it. Now it becomes tricky -- if you push it into the remote as a branch all your developers will download it the next time they invoke git fetch clogging their connection. You can avoid this by pushing a tag instead of a branch. This can still impact your colleague's bandwidth and filesystem storage if they have a habit of typing git fetch <remote> --tags, but read on for a workaround. Go ahead and git tag 1.0.0bin Push your orphan tag git push <remote> 1.0.0bin. Just so you never push your binary branch by accident, you can delete it git branch -D binaryStuff. Your commit will not be marked for garbage collection, because an orphan tag pointing on it 1.0.0bin is enough to keep it alive.

签出二进制文件

How do I (or my colleagues) get the VeryBigBinary.exe checked out into the current working tree? If your current working branch is for example master you can simply git checkout 1.0.0bin -- VeryBigBinary.exe. This will fail if you don't have the orphan tag 1.0.0bin downloaded, in which case you'll have to git fetch <remote> 1.0.0bin beforehand. You can add the VeryBigBinary.exe into your master's .gitignore, so that no-one on your team will pollute the main history of the project with the binary by accident.

完全删除二进制文件

如果你决定完全清除VeryBigBinary.exe从你的本地存储库，你的远程存储库和你的同事的存储库，你可以:

Delete the orphan tag on the remote git push <remote> :refs/tags/1.0.0bin Delete the orphan tag locally (deletes all other unreferenced tags) git tag -l | xargs git tag -d && git fetch --tags. Taken from SO 1841341 with slight modification. Use a git gc trick to delete your now unreferenced commit locally. git -c gc.reflogExpire=0 -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 -c gc.rerereunresolved=0 -c gc.pruneExpire=now gc "$@". It will also delete all other unreferenced commits. Taken from SO 1904860 If possible, repeat the git gc trick on the remote. It is possible if you're self-hosting your repository and might not be possible with some git providers, like github or in some corporate environments. If you're hosting with a provider that doesn't give you ssh access to the remote just let it be. It is possible that your provider's infrastructure will clean your unreferenced commit in their own sweet time. If you're in a corporate environment you can advice your IT to run a cron job garbage collecting your remote once per week or so. Whether they do or don't will not have any impact on your team in terms of bandwidth and storage, as long as you advise your colleagues to always git clone -b master --single-branch <url> instead of git clone. All your colleagues who want to get rid of outdated orphan tags need only to apply steps 2-3. You can then repeat the steps 1-8 of Adding the Binary Files to create a new orphan tag 2.0.0bin. If you're worried about your colleagues typing git fetch <remote> --tags you can actually name it again 1.0.0bin. This will make sure that the next time they fetch all the tags the old 1.0.0bin will be unreferenced and marked for subsequent garbage collection (using step 3). When you try to overwrite a tag on the remote you have to use -f like this: git push -f <remote> <tagname>

后记

OTABS doesn't touch your master or any other source code/development branches. The commit hashes, all of the history, and small size of these branches is unaffected. If you've already bloated your source code history with binary files you'll have to clean it up as a separate piece of work. This script might be useful. Confirmed to work on Windows with git-bash. It is a good idea to apply a set of standard trics to make storage of binary files more efficient. Frequent running of git gc (without any additional arguments) makes git optimise underlying storage of your files by using binary deltas. However, if your files are unlikely to stay similar from commit to commit you can switch off binary deltas altogether. Additionally, because it makes no sense to compress already compressed or encrypted files, like .zip, .jpg or .crypt, git allows you to switch off compression of the underlying storage. Unfortunately it's an all-or-nothing setting affecting your source code as well. You might want to script up parts of OTABS to allow for quicker usage. In particular, scripting steps 2-3 from Completely Deleting Binary Files into an update git hook could give a compelling but perhaps dangerous semantics to git fetch ("fetch and delete everything that is out of date"). You might want to skip the step 4 of Completely Deleting Binary Files to keep a full history of all binary changes on the remote at the cost of the central repository bloat. Local repositories will stay lean over time. In Java world it is possible to combine this solution with maven --offline to create a reproducible offline build stored entirely in your version control (it's easier with maven than with gradle). In Golang world it is feasible to build on this solution to manage your GOPATH instead of go get. In python world it is possible to combine this with virtualenv to produce a self-contained development environment without relying on PyPi servers for every build from scratch. If your binary files change very often, like build artifacts, it might be a good idea to script a solution which stores 5 most recent versions of the artifacts in the orphan tags monday_bin, tuesday_bin, ..., friday_bin, and also an orphan tag for each release 1.7.8bin 2.0.0bin, etc. You can rotate the weekday_bin and delete old binaries daily. This way you get the best of two worlds: you keep the entire history of your source code but only the relevant history of your binary dependencies. It is also very easy to get the binary files for a given tag without getting entire source code with all its history: git init && git remote add <name> <url> && git fetch <name> <tag> should do it for you.

2015-07-13 18:32:39

看看camlistore。它不是真正基于git的，但我发现它更适合您必须做的事情。

2014-10-03 10:36:05

使用Git管理大型二进制文件

推荐文章

最新文章

标签