How To Retroactively Annex Files Already in a Git Repo


UPDATE: With current versions of git, I no longer recommend git annex or git LFS unless you really need to store your large files on a separate server from your git repository. Just add your large files to git like any other file and when you clone, you can avoid downloading the full repository history with git clone --filter=blob:none and use git as normal.

Table of Contents

How To Retroactively Annex Files Already in a Git Repo

In my last post I talked about how surprisingly easy it is to use git annex to manage your large binary files (or even small ones). In this post, I'm going to show how hard it is to go back and fix the mistake you made when you decided not to learn and use git annex at the start of your project. Learn from my mistake!

When I started developing the website for my business, I figured that editing history in git is easy, and I could just check in binary files (like the images) for now and fix it later. Well, it was starting to get a little sluggish, and I had some bigger binary files that I wanted to start keeping with the website code, so I figured the time had come. Once I decided on git annex, it was time to go edit that history.

First Tries: filter-branch, filter-repo

There is a very old page of instructions for doing this using git filter-branch. The first thing I noticed when I tried that was this message from git:

WARNING: git-filter-branch has a glut of gotchas generating mangled history
         rewrites.  Hit Ctrl-C before proceeding to abort, then use an
         alternative filtering tool such as 'git filter-repo'
         (https://github.com/newren/git-filter-repo/) instead.  See the
         filter-branch manual page for more details; to squelch this warning,
         set FILTER_BRANCH_SQUELCH_WARNING=1.

Yikes! A warning like that from a tool (git) that is already known for its gotchas is one I decided to take seriously. Besides, I'm always down to try the new hotness, so I started reading about git-filter-repo. The more I read and experimented, even dug into the source code, the more I came to understand that it could not do what I needed, sadly. Maybe someone will read this and correct me.

Success with git rebase –interactive

Not seeing a nice pre-built tool or command that could do this for me, I set out to manually edit the repository history using good ol' git rebase --interactive. First, I had to find the all the binary files that are in the repo (not just the ones in the current revision). Here's how I did it:

# The --stat=1000 is so it doesn't truncate anything
git log --stat=1000 | grep Bin | sort | uniq > binary-files

Note the comment. Isn't it cute that git log truncates long lines even when stdout is not connected to your terminal? There are lots of little annoying gotchas like that throughout this process. Makes me miss mercurial, but don't worry, I will try not to mention mercurial again.

Now, you'll still have duplicates in binary-files because the other stuff that git log --stat spits out on each line. I personally used some emacs commands to remove everything but the filename from each line of the binary-files file, and then did a sort and uniq again.

Next, I had to find each commit that modified any of these binary files. Here's how I did that:

for file in $(cat binary-files); do
    git log --pretty=oneline --follow -- $file >> commits;
 done

Then I did another sort and uniq on that. Luckily there were only about 15 commits. Phew.

Next I tried to find the earliest commit in the list I had, but that was a pain (don't…mention…mercurial…), so I just ran git rebase --interactive and gave it one of the first commits I made in the repository. I actually used emacs magit to start the rebase, but the surgery required throughout the process made me drop to the command-line for most of it. magit did make it really easy to mark the 15 commits from my commits file with an e though.

OK, once the rebase got rolling I ran into a few different scenarios. Commits that added a new binary file, commits that deleted binary files, commits that modified binary files, and a commit that moved binary files.

Added binary files

When a binary file was added, git would act like I have always seen rebase interactive work, it would show the normal thing:

Stopped at 53fc550...  some commit message here
You can amend the commit now, with

  git commit --amend 

Once you are satisfied with your changes, run

  git rebase --continue

In that case I did this:

git show --stat=1000 # to see binary (Bin) files
git rm --cached <the-binary-files>
git add <the-binary-files> # git annex will annex them
git commit --amend
git rebase --continue

Easy peasy, as long as you have set up annex like my previous post explains so that annexing happens automatically.

Deleted binary files

When a binary file was deleted, git would throw up a message like this up:

$ git rebase --continue
[detached HEAD 130bcc4] banner on each page now
 21 files changed, 190 insertions(+), 42 deletions(-)
 create mode 100644 msd/webshop/static/webshop/img/common/adi-goldstein-EUsVwEOsblE-unsplash.jpg
 create mode 100644 msd/webshop/static/webshop/img/common/alexandre-debieve-FO7JIlwjOtU-unsplash.jpg
 delete mode 100644 msd/webshop/static/webshop/img/common/file-icons.png
 create mode 100644 msd/webshop/static/webshop/img/common/kevin-ku-w7ZyuGYNpRQ-unsplash.jpg
 create mode 100644 msd/webshop/static/webshop/img/common/levi-saunders-1nz-KjRdg-s-unsplash.jpg
 create mode 100644 msd/webshop/static/webshop/img/common/max-duzij-qAjJk-un3BI-unsplash.jpg
 create mode 100644 msd/webshop/static/webshop/img/common/nick-fewings-ZJAnGFg-rM4-unsplash.jpg
 create mode 100644 msd/webshop/static/webshop/img/common/umberto-jXd2FSvcRr8-unsplash.jpg
 create mode 100644 msd/webshop/static/webshop/img/common/yogesh-phuyal-mjwGKmwkDDA-unsplash.jpg
CONFLICT (modify/delete): msd/webshop/static/webshop/img/common/nick-fewings-ZJAnGFg-rM4-unsplash.jpg deleted in 90d71fb... refactored banners in pricing.css to reduce code duplication and modified in HEAD. Version HEAD of msd/webshop/static/webshop/img/common/nick-fewings-ZJAnGFg-rM4-unsplash.jpg left in tree.
error: could not apply 90d71fb... refactored banners in pricing.css to reduce code duplication
Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 90d71fb... refactored banners in pricing.css to reduce code duplication

I guess in this case it was that I had added some new files too, so the message was extra verbose. The key message in all that was: "msd/webshop/static/webshop/img/common/nick-fewings-ZJAnGFg-rM4-unsplash.jpg deleted…" Here's what you do in this case:

git rm msd/webshop/static/webshop/img/common/nick-fewings-ZJAnGFg-rM4-unsplash.jpg
git diff --stat=1000 --staged # to find full paths for any Bin files
git restore --staged <binary-files>
git add <binary-files>
git diff --stat --staged # just to double check there are no Bin files now
git rebase --continue

Looks so simple (heh), but it took me a decent amount of web searching and experimentation to figure it out. All for you, dear reader, all for you.

Modified binary files

Here's one where I resized several images, git helpfully uttered:

$ git rebase --continue
[detached HEAD 7dfb28c] refactored banners in pricing.css to reduce code duplication
 4 files changed, 28 insertions(+), 75 deletions(-)
 create mode 100644 msd/webshop/static/webshop/img/common/connor-betts-QK6Iwzd5MhE-unsplash.jpg
 delete mode 100644 msd/webshop/static/webshop/img/common/nick-fewings-ZJAnGFg-rM4-unsplash.jpg
warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/yogesh-phuyal-mjwGKmwkDDA-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels)
warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/umberto-jXd2FSvcRr8-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels)
warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/max-duzij-qAjJk-un3BI-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels)
warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/levi-saunders-1nz-KjRdg-s-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels)
warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/kevin-ku-w7ZyuGYNpRQ-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels)
warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/connor-betts-QK6Iwzd5MhE-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels)
warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/alexandre-debieve-FO7JIlwjOtU-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels)
warning: Cannot merge binary files: msd/webshop/static/webshop/img/common/adi-goldstein-EUsVwEOsblE-unsplash.jpg (HEAD vs. a90710f... scaled images down to max width of 1920 pixels)
Auto-merging msd/webshop/static/webshop/img/common/yogesh-phuyal-mjwGKmwkDDA-unsplash.jpg
CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/yogesh-phuyal-mjwGKmwkDDA-unsplash.jpg
Auto-merging msd/webshop/static/webshop/img/common/umberto-jXd2FSvcRr8-unsplash.jpg
CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/umberto-jXd2FSvcRr8-unsplash.jpg
Auto-merging msd/webshop/static/webshop/img/common/max-duzij-qAjJk-un3BI-unsplash.jpg
CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/max-duzij-qAjJk-un3BI-unsplash.jpg
Auto-merging msd/webshop/static/webshop/img/common/levi-saunders-1nz-KjRdg-s-unsplash.jpg
CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/levi-saunders-1nz-KjRdg-s-unsplash.jpg
Auto-merging msd/webshop/static/webshop/img/common/kevin-ku-w7ZyuGYNpRQ-unsplash.jpg
CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/kevin-ku-w7ZyuGYNpRQ-unsplash.jpg
Auto-merging msd/webshop/static/webshop/img/common/connor-betts-QK6Iwzd5MhE-unsplash.jpg
CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/connor-betts-QK6Iwzd5MhE-unsplash.jpg
Auto-merging msd/webshop/static/webshop/img/common/alexandre-debieve-FO7JIlwjOtU-unsplash.jpg
CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/alexandre-debieve-FO7JIlwjOtU-unsplash.jpg
Auto-merging msd/webshop/static/webshop/img/common/adi-goldstein-EUsVwEOsblE-unsplash.jpg
CONFLICT (content): Merge conflict in msd/webshop/static/webshop/img/common/adi-goldstein-EUsVwEOsblE-unsplash.jpg
error: could not apply a90710f... scaled images down to max width of 1920 pixels
Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply a90710f... scaled images down to max width of 1920 pixels

The trick to fixing this is to notice which commit it's trying to let you edit, which is in the last line of that message, and then checkout that version of each of the unmerged binary files it mentions, like so:

git status # to get the names of the unmerged binary files
git checkout a90710f <filenames>

Now you can do the same thing you did for the deleted file:

git restore --staged <filenames>
git add <filenames>
git diff --stat --staged # just to double check there are no Bin files now
git rebase --continue

Moved binary files

When I ran git log --follow to find all the commits that modified binary files, it flagged one where I had moved them. I'm not sure I actually had to edit that commit and I wonder if I would not have had this weird situation if I had not edited it. But for completeness, here's what I saw. Git rebase stopped to let me edit the commit and git annex printed out this message for every file that was moved:

git-annex: git status will show <filename> to be modified, since content availability has changed and git-annex was unable to update the index. This is only a cosmetic problem affecting git status; git add, git commit, etc won't be affected. To fix the git status display, you can run: git update-index -q --refresh <filename>

Sounds…quite weird. But git rebase would not continue until I did run the suggested command:

git update-index -q --refresh <filenames>
git rebase --continue

Dealing with Tags

Once the rebase was done I noticed that the tags I had all still pointed to the original commits. Oops. A quick internet search led me to this post about rebasing and moving tags to the new commits (written by a former co-worker, it just so happens). Too bad I didn't look for that before I rebased. I thought about redoing the whole rebase, but in the end I just wrote my own quick python script (using snippets from Nacho's) to take care of my specific situation. Here it is:

#! /usr/bin/env python
from subprocess import run, PIPE

tags = run(['git', 'show-ref', '--tags'],
           stdout=PIPE).stdout.decode('utf-8').splitlines()

tags_with_comments = {}
for tag in tags:
    tag_hash, tag_name = tag.split(' ')
    tag_name = tag_name.split('/')[-1]
    comment = run(['git', '--no-pager', 'show', '-s',
                   '--format=%s', tag_hash],
                  stdout=PIPE).stdout.decode('utf-8').splitlines()[-1]
    print(f'{tag_name}: {comment}')
    tags_with_comments[tag_name] = comment

commits = run(['git', 'log', '--oneline'],
              stdout=PIPE).stdout.decode('utf-8').splitlines()

for tag_name in tags_with_comments:
    for c in commits:
        commit_hash = c.split(' ')[0]
        comment = c.split(' ')[1:]
        comment = ' '.join(comment)
        if comment == tags_with_comments[tag_name]:
            run(['git', 'tag', '--force', tag_name, commit_hash])

Clean Up and Results

Well, with all that done, it was time to see how it all turned out. My original git repo was sitting at about 1.4 GB. This new repo was…3 GB!? Something wasn't right. Here are some steps I took to clean it up after making sure there weren't any old branches or remotes laying around:

git clean -fdx
git annex fsck
git fsck
git reflog expire --verbose --expire=0 --all
git gc --prune=0

The git clean command showed that I had a weird leftover .git directory in another directory somehow, so I deleted that. I don't think the fsck commands really did anything, but the gc definitely did. Size was now down to 985 MB. Much better. Wait a minute, what if I did a git gc on the original repo? It's size went down to 984 MB. Oh shoot. I guess it makes sense though, if both git and git annex are storing full versions of each binary file they would end up the same size. The real win is the faster git operations, especially clones.

A local git clone now happens in the blink of an eye, and its size is only 153 MB. Now, that's a little unfair because it doesn't have any of the binary files. After a git annex get to get the binary files for the current checkout it goes up to 943 MB. Not a huge savings, but it only gets better as time goes on and more edits happen. Right? This was all worth it, wasn't it?!

Let me know in the comments if this is helpful, hurtful, or if I did this totally wrong.

Comments

Popular posts from this blog

SystemVerilog Fork Disable "Gotchas"

'git revert' Is Not Equivalent To 'svn revert'

SystemVerilog Streaming Operator: Knowing Right from Left