How To Retroactively Annex Files Already in a Git Repo
UPDATE: With current versions of git, I no longer recommend git annex or git LFS unless you really need to store your large files on a separate server from your git repository. Just add your large files to git like any other file and when you clone, you can avoid downloading the full repository history with git clone --filter=blob:none
and use git as normal.
Table of Contents
How To Retroactively Annex Files Already in a Git Repo
In my last post I talked about how surprisingly easy it is to use git annex to manage your large binary files (or even small ones). In this post, I'm going to show how hard it is to go back and fix the mistake you made when you decided not to learn and use git annex at the start of your project. Learn from my mistake!
When I started developing the website for my business, I figured that editing history in git is easy, and I could just check in binary files (like the images) for now and fix it later. Well, it was starting to get a little sluggish, and I had some bigger binary files that I wanted to start keeping with the website code, so I figured the time had come. Once I decided on git annex, it was time to go edit that history.
First Tries: filter-branch, filter-repo
There is a very old page of instructions for doing this using git filter-branch
. The first thing I noticed when I tried that was this message from git:
WARNING: git-filter-branch has a glut of gotchas generating mangled history rewrites. Hit Ctrl-C before proceeding to abort, then use an alternative filtering tool such as 'git filter-repo' (https://github.com/newren/git-filter-repo/) instead. See the filter-branch manual page for more details; to squelch this warning, set FILTER_BRANCH_SQUELCH_WARNING=1.
Yikes! A warning like that from a tool (git) that is already known for its gotchas is one I decided to take seriously. Besides, I'm always down to try the new hotness, so I started reading about git-filter-repo
. The more I read and experimented, even dug into the source code, the more I came to understand that it could not do what I needed, sadly. Maybe someone will read this and correct me.
Success with git rebase –interactive
Not seeing a nice pre-built tool or command that could do this for me, I set out to manually edit the repository history using good ol' git rebase --interactive
. First, I had to find the all the binary files that are in the repo (not just the ones in the current revision). Here's how I did it:
# The --stat=1000 is so it doesn't truncate anything git log --stat=1000 | grep Bin | sort | uniq > binary-files
Note the comment. Isn't it cute that git log
truncates long lines even when stdout is not connected to your terminal? There are lots of little annoying gotchas like that throughout this process. Makes me miss mercurial, but don't worry, I will try not to mention mercurial again.
Now, you'll still have duplicates in binary-files
because the other stuff that git log --stat
spits out on each line. I personally used some emacs commands to remove everything but the filename from each line of the binary-files
file, and then did a sort and uniq again.
Next, I had to find each commit that modified any of these binary files. Here's how I did that:
for file in $(cat binary-files); do git log --pretty=oneline --follow -- $file >> commits; done
Then I did another sort
and uniq
on that. Luckily there were only about 15 commits. Phew.
Next I tried to find the earliest commit in the list I had, but that was a pain (don't…mention…mercurial…), so I just ran git rebase --interactive
and gave it one of the first commits I made in the repository. I actually used emacs magit to start the rebase, but the surgery required throughout the process made me drop to the command-line for most of it. magit did make it really easy to mark the 15 commits from my commits
file with an e
though.
OK, once the rebase got rolling I ran into a few different scenarios. Commits that added a new binary file, commits that deleted binary files, commits that modified binary files, and a commit that moved binary files.
Added binary files
When a binary file was added, git would act like I have always seen rebase interactive work, it would show the normal thing:
Stopped at 53fc550... some commit message here You can amend the commit now, with git commit --amend Once you are satisfied with your changes, run git rebase --continue
In that case I did this:
git show --stat=1000 # to see binary (Bin) files git rm --cached <the-binary-files> git add <the-binary-files> # git annex will annex them git commit --amend git rebase --continue
Easy peasy, as long as you have set up annex like my previous post explains so that annexing happens automatically.
Deleted binary files
When a binary file was deleted, git would throw up a message like this up:
$ git rebase --continue [detached HEAD 130bcc4] banner on each page now 21 files changed, 190 insertions(+), 42 deletions(-) create mode 100644 msd/webshop/static/webshop/img/common/adi-goldstein-EUsVwEOsblE-unsplash.jpg create mode 100644 msd/webshop/static/webshop/img/common/alexandre-debieve-FO7JIlwjOtU-unsplash.jpg delete mode 100644 msd/webshop/static/webshop/img/common/file-icons.png create mode 100644 msd/webshop/static/webshop/img/common/kevin-ku-w7ZyuGYNpRQ-unsplash.jpg create mode 100644 msd/webshop/static/webshop/img/common/levi-saunders-1nz-KjRdg-s-unsplash.jpg create mode 100644 msd/webshop/static/webshop/img/common/max-duzij-qAjJk-un3BI-unsplash.jpg create mode 100644 msd/webshop/static/webshop/img/common/nick-fewings-ZJAnGFg-rM4-unsplash.jpg create mode 100644 msd/webshop/static/webshop/img/common/umberto-jXd2FSvcRr8-unsplash.jpg create mode 100644 msd/webshop/static/webshop/img/common/yogesh-phuyal-mjwGKmwkDDA-unsplash.jpg CONFLICT (modify/delete): msd/webshop/static/webshop/img/common/nick-fewings-ZJAnGFg-rM4-unsplash.jpg deleted in 90d71fb... refactored banners in pricing.css to reduce code duplication and modified in HEAD. Version HEAD of msd/webshop/static/webshop/img/common/nick-fewings-ZJAnGFg-rM4-unsplash.jpg left in tree. error: could not apply 90d71fb... refactored banners in pricing.css to reduce code duplication Resolve all conflicts manually, mark them as resolved with "git add/rm <conflicted_files>", then run "git rebase --continue". You can instead skip this commit: run "git rebase --skip". To abort and get back to the state before "git rebase", run "git rebase --abort". Could not apply 90d71fb... refactored banners in pricing.css to reduce code duplication
I guess in this case it was that I had added some new files too, so the message was extra verbose. The key message in all that was: "msd/webshop/static/webshop/img/common/nick-fewings-ZJAnGFg-rM4-unsplash.jpg deleted…" Here's what you do in this case:
git rm msd/webshop/static/webshop/img/common/nick-fewings-ZJAnGFg-rM4-unsplash.jpg git diff --stat=1000 --staged # to find full paths for any Bin files git restore --staged <binary-files> git add <binary-files> git diff --stat --staged # just to double check there are no Bin files now git rebase --continue
Looks so simple (heh), but it took me a decent amount of web searching and experimentation to figure it out. All for you, dear reader, all for you.
Modified binary files
Here's one where I resized several images, git helpfully uttered:
$ git rebase --continue [detached HEAD 7dfb28c] refactored banners in pricing.css to reduce code duplication 4 files changed, 28 insertions(+), 75 deletions(-) create mode 100644 msd/webshop/static/webshop/img/common/connor-betts-QK6Iwzd5MhE-unsplash.jpg delete mode 100644 msd/webshop/static/webshop/img/common/nick-fewings-ZJAnGFg-rM4-unsplash.jpg warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/yogesh-phuyal-mjwGKmwkDDA-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels) warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/umberto-jXd2FSvcRr8-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels) warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/max-duzij-qAjJk-un3BI-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels) warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/levi-saunders-1nz-KjRdg-s-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels) warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/kevin-ku-w7ZyuGYNpRQ-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels) warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/connor-betts-QK6Iwzd5MhE-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels) warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/alexandre-debieve-FO7JIlwjOtU-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels) warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/adi-goldstein-EUsVwEOsblE-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels) Auto-merging msd/webshop/static/webshop/img/common/yogesh-phuyal-mjwGKmwkDDA-unsplash.jpg CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/yogesh-phuyal-mjwGKmwkDDA-unsplash.jpg Auto-merging msd/webshop/static/webshop/img/common/umberto-jXd2FSvcRr8-unsplash.jpg CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/umberto-jXd2FSvcRr8-unsplash.jpg Auto-merging msd/webshop/static/webshop/img/common/max-duzij-qAjJk-un3BI-unsplash.jpg CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/max-duzij-qAjJk-un3BI-unsplash.jpg Auto-merging msd/webshop/static/webshop/img/common/levi-saunders-1nz-KjRdg-s-unsplash.jpg CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/levi-saunders-1nz-KjRdg-s-unsplash.jpg Auto-merging msd/webshop/static/webshop/img/common/kevin-ku-w7ZyuGYNpRQ-unsplash.jpg CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/kevin-ku-w7ZyuGYNpRQ-unsplash.jpg Auto-merging msd/webshop/static/webshop/img/common/connor-betts-QK6Iwzd5MhE-unsplash.jpg CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/connor-betts-QK6Iwzd5MhE-unsplash.jpg Auto-merging msd/webshop/static/webshop/img/common/alexandre-debieve-FO7JIlwjOtU-unsplash.jpg CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/alexandre-debieve-FO7JIlwjOtU-unsplash.jpg Auto-merging msd/webshop/static/webshop/img/common/adi-goldstein-EUsVwEOsblE-unsplash.jpg CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/adi-goldstein-EUsVwEOsblE-unsplash.jpg error: could not apply a90710f... scaled images down to max width of 1920 pixels Resolve all conflicts manually, mark them as resolved with "git add/rm <conflicted_files>", then run "git rebase --continue". You can instead skip this commit: run "git rebase --skip". To abort and get back to the state before "git rebase", run "git rebase --abort". Could not apply a90710f... scaled images down to max width of 1920 pixels
The trick to fixing this is to notice which commit it's trying to let you edit, which is in the last line of that message, and then checkout that version of each of the unmerged binary files it mentions, like so:
git status # to get the names of the unmerged binary files git checkout a90710f <filenames>
Now you can do the same thing you did for the deleted file:
git restore --staged <filenames> git add <filenames> git diff --stat --staged # just to double check there are no Bin files now git rebase --continue
Moved binary files
When I ran git log --follow
to find all the commits that modified binary files, it flagged one where I had moved them. I'm not sure I actually had to edit that commit and I wonder if I would not have had this weird situation if I had not edited it. But for completeness, here's what I saw. Git rebase stopped to let me edit the commit and git annex printed out this message for every file that was moved:
git-annex: git status will show <filename> to be modified, since content availability has changed and git-annex was unable to update the index. This is only a cosmetic problem affecting git status; git add, git commit, etc won't be affected. To fix the git status display, you can run: git update-index -q --refresh <filename>
Sounds…quite weird. But git rebase would not continue until I did run the suggested command:
git update-index -q --refresh <filenames> git rebase --continue
Dealing with Tags
Once the rebase was done I noticed that the tags I had all still pointed to the original commits. Oops. A quick internet search led me to this post about rebasing and moving tags to the new commits (written by a former co-worker, it just so happens). Too bad I didn't look for that before I rebased. I thought about redoing the whole rebase, but in the end I just wrote my own quick python script (using snippets from Nacho's) to take care of my specific situation. Here it is:
#! /usr/bin/env python from subprocess import run, PIPE tags = run(['git', 'show-ref', '--tags'], stdout=PIPE).stdout.decode('utf-8').splitlines() tags_with_comments = {} for tag in tags: tag_hash, tag_name = tag.split(' ') tag_name = tag_name.split('/')[-1] comment = run(['git', '--no-pager', 'show', '-s', '--format=%s', tag_hash], stdout=PIPE).stdout.decode('utf-8').splitlines()[-1] print(f'{tag_name}: {comment}') tags_with_comments[tag_name] = comment commits = run(['git', 'log', '--oneline'], stdout=PIPE).stdout.decode('utf-8').splitlines() for tag_name in tags_with_comments: for c in commits: commit_hash = c.split(' ')[0] comment = c.split(' ')[1:] comment = ' '.join(comment) if comment == tags_with_comments[tag_name]: run(['git', 'tag', '--force', tag_name, commit_hash])
Clean Up and Results
Well, with all that done, it was time to see how it all turned out. My original git repo was sitting at about 1.4 GB. This new repo was…3 GB!? Something wasn't right. Here are some steps I took to clean it up after making sure there weren't any old branches or remotes laying around:
git clean -fdx git annex fsck git fsck git reflog expire --verbose --expire=0 --all git gc --prune=0
The git clean
command showed that I had a weird leftover .git
directory in another directory somehow, so I deleted that. I don't think the fsck
commands really did anything, but the gc
definitely did. Size was now down to 985 MB. Much better. Wait a minute, what if I did a git gc
on the original repo? It's size went down to 984 MB. Oh shoot. I guess it makes sense though, if both git and git annex are storing full versions of each binary file they would end up the same size. The real win is the faster git operations, especially clones.
A local git clone now happens in the blink of an eye, and its size is only 153 MB. Now, that's a little unfair because it doesn't have any of the binary files. After a git annex get
to get the binary files for the current checkout it goes up to 943 MB. Not a huge savings, but it only gets better as time goes on and more edits happen. Right? This was all worth it, wasn't it?!
Let me know in the comments if this is helpful, hurtful, or if I did this totally wrong.
Comments