Thursday, January 31, 2013

Removing Large Files from Git

When I've used git I've used it pretty much CVS, SVN and any other version control system I've used before - I've checked in binary files.  Whether that's dlls or jars or gems I've checked them in.  Pretty much everywhere I've worked people have said this is a problem and I've tended to argue it solves a lot of problems - one of the main ones is when where the repositories and management software along with that fails - I still have a working system from a checkout/clone/etc.

The price of this is that sometimes you need to cleanup old binary files.  Git makes this complicated but once you've found a couple of tools then it's relatively straightforward.

Stackoverflow has a Perl and Ruby script that wraps around a few git commands to list all files in a repository that's above a certain file size "Find files in git repo over x megabytes, that don't exist in HEAD".  The main gist of it is (in Ruby):

IO.popen("git rev-list #{head}", 'r') do |rev_list|
  rev_list.each_line do |commit|
    for object in `git ls-tree -zrl #{commit}`.split("\0")
      bits, type, sha, size, path = object.split(/\s+/, 5)
      size = size.to_i
      big_files[sha] = [path, size, commit] if size >= treshold

big_files.each do |sha, (path, size, commit)|
  where = `git show -s #{commit} --format='%h: %cr'`.chomp
  puts "%4.1fM\t%s\t(%s)" % [size.to_f / Megabyte, path, where]

Then to remove the old files from the repository:

git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch [full file path]' -- --all
git push --force

Then to cleanup any used space in the git repository:

rm -rf .git/refs/original/
rm -rf .git/logs/
git reflog expire --expire=now --all
git gc --aggressive --prune=now