彻底删除git库中的文件历史记录

有时候不小心将一些文件提交到了git库中,虽然后来又删掉了,但是在git的历史记录中仍然能找到该文件,也许该文件太大浪费空间,也许该文件有敏感内容,总之必须要将改文件的痕迹彻底抹掉,git本身提供一些命令来完成这种任务,但是用起来很麻烦,而且速度很慢,推荐一个工具:bfg-repo-cleaner ,速度很快,清理很彻底。

这个工具是用scala写的,要运行首先得安装JDK。然后下载发布的jar包,准备工作就做好了。

git clone --mirror git://example.com/some-big-repo.git
cd some-big-repo.git
java -jar bfg.jar --delete-files 要删除的文件名
java -jar bfg.jar --delete-folders 要删除的文件夹名
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push

至此删除完毕,从新clone出来的git库中,再也找不到上面被删除的内容了。

以下是该工具的原文介绍:

BFG Repo-Cleaner

Removes large or troublesome blobs like git-filter-branch does, but faster. And written in Scala

View project onGitHub

$ bfg --strip-blobs-bigger-than 1M --replace-text banned.txt repo.git

an alternative to git-filter-branch

The BFG is a simpler, faster alternative to git-filter-branch for cleansing bad data out of your Git repository history:

  • Removing Crazy Big Files
  • Removing Passwords, Credentials & other Private data

The git-filter-branch command is enormously powerful and can do things that the BFG can’t – but the BFG is much better for the tasks above, because:

  • Faster : 10 – 720x faster
  • Simpler : The BFG isn’t particularily clever, but is focused on making the above tasks easy
  • Beautiful : If you need to, you can use the beautiful Scala language to customise the BFG. Which has got to be better than Bash scripting at least some of the time.

Usage

First clone a fresh copy of your repo, using the --mirror flag:

$ git clone --mirror git://example.com/some-big-repo.git

This is a bare repo, which means your normal files won’t be visible, but it is a full copy of the Git database of your repository, and at this point you should make a backup of it to ensure you don’t lose anything.

Now you can run the BFG to clean your repository up:

$ java -jar bfg.jar --strip-biggest-blobs 500 some-big-repo.git

The BFG will update your commits and all branches and tags so they are clean, but it doesn’t physically delete the unwanted stuff. Examine the repo to make sure your history has been updated, and then use the standard git gc command to strip out the unwanted dirty data, which Git will now recognise as surplus to requirements:

$ cd some-big-repo.git
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

Finally, once you’re happy with the updated state of your repo, push it back up (note that because your clone command used the --mirror flag, this push will update all refs on your remote server):

$ git push

At this point, you’re ready for everyone to ditch their old copies of the repo and do fresh clones of the nice, new pristine data. It’s best to delete all old clones, as they’ll have dirty history that you don’t want to risk pushing back into your newly cleaned repo.

Examples

In all these examples bfg is an alias for java -jar bfg.jar.

Delete all files named ‘id_rsa’ or ‘id_dsa’ :

$ bfg --delete-files id_{dsa,rsa}  my-repo.git

Remove all blobs bigger than 1 megabyte :

$ bfg --strip-blobs-bigger-than 1M  my-repo.git

Replace all passwords listed in a file (prefix lines ‘regex:’ or ‘glob:’ if required) with ***REMOVED***wherever they occur in your repository :

$ bfg --replace-text passwords.txt  my-repo.git

Remove all folders or files named ‘.git’ – a reserved filename in Git. These often become a problem when migrating to Git from other source-control systems like Mercurial :

$ bfg --delete-folders .git --delete-files .git  --no-blob-protection  my-repo.git

Your current files are sacred…

The BFG treats you like a reformed alcoholic: you’ve made some mistakes in the past, but now you’ve cleaned up your act. Thus the BFG assumes that your latest commit is a good one, with none of the dirty files you want removing from your history still in it. This assumption by the BFG protects your work, and gives you peace of mind knowing that the BFG is only changing your repo history, not meddling with thecurrent files of your project.

By default the HEAD branch is protected, and while its history will be cleaned, the very latest commit (the ‘tip’) is a protected commit and its file-hierarchy won’t be changed at all.

If you want to protect the tips of several branches or tags (not just HEAD), just name them for the BFG:

$ bfg --strip-biggest-blobs 100 --protect-blobs-from master,maint,next repo.git

Note:

  • Cleaning Git repos is about completely eradicating bad stuff from history. If something ‘bad’ (like a 10MB file, when you’re specifying --strip-blobs-bigger-than 5M) is in a protected commit, it won’tbe deleted – it’ll persist in your repository, even if the BFG deletes if from earlier commits. If you want the BFG to delete something you need to make sure your current commits are clean.
  • Note that although the files in those protected commits won’t be changed, when those commits follow on from earlier dirty commits, their commit ids will change, to reflect the changed history – only the SHA-1 id of the filesystem-tree will remain the same.

Faster…

The BFG is 10 – 720x faster than git-filter-branch, turning an overnight job into one that takes less than ten minutes.

BFG’s performance advantage is due to these factors:

  • The approach of git-filter-branch is to step through every commit in your repository, examining the complete file-hierarchy of each one. For the intended use-cases of The BFG this is wasteful, as we don’t care where in a file structure a ‘bad’ file exists – we just want it dealt with. Inherent in the nature of Git is that every file and folder is represented precisely once (and given a unique SHA-1 hash-id). The BFG takes advantage of this to process each and every file & folder exactly once – no need for extra work.
  • Taking advantage of the great support for parallelism in Scala and the JVM, the BFG does multi-core processing by default – the work of cleaning your Git repository is spread over every single core in your machine and typically consumes 100% of capacity for a substantial portion of the run.
  • All action takes place in a single process (the process of the JVM), so doesn’t require the frequent fork-and-exec-ing needed by git-filter-branch‘s mix of Bash and C code.

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注