Big Git Energy: Headaches with large files and GitHub, common pitfalls, and purging files from the course of history

Big Git Energy: Headaches with large files and GitHub, common pitfalls, and purging files from the course of history

Large projects often have large data files. Recently, I tried to push the repository of my master thesis to GitHub to share my project with others.

Sure enough, there were problems. Lots of them.

As you might experience some bumps in the road when sharing big data files via GitHub, I decided to collect them here.

Limitations of GitHub

Git itself does not have a limit on file sizes. However, most remote repositories, such as GitHub, do.

GitHub will block commits which are larger than 100 MB. In my case, I was dealing with several files exceeding 400 MB, and a few files larger than 1 GB.

The service is meant to be used for code sharing and not for data backup. Making these distinctions is important, as I caught myself thinking about GitHub as a more fancy Google Drive that you can simply use to dump everything you want.

An error like this reminds you, that it is not.

error: File bigfile.csv is 563 MB; this exceeds GitHub's file size limit of 100 MB

Easiest to solve before it happens

In general, only think about pushing files, which your code cannot reproduce.
For the rest, your code serves as the necessary documentation.

Add large objects (>100 MB) to your .gitignore file – before you commit them: Ignoring them from the start is key because they never enter the version control in the first place. Hence, they cannot clog up any remote pushes later!

How do I know what is too much?

Obviously, your giant bigfile.csv causes the problem. But what if there are many objects above the size limit, which hide somewhere in your repository?

A helpful tool for this question is git-sizer.
In your repository you can get an assessment of your repo’s size like this:

brew install git-sizer
git-sizer --verbose

The output gives you an overview of the size of your biggest files and the overall size of your repository. Asterisks indicate the level of concern.

A row of exclamation marks indicates an entity that breaks GitHub limitations and will cause the service to block push requests. Read more about it in the git-sizer repo.

It's too late – My file is already in version control

If git already tracks a file that is too large for your remote repository, then you need to remove it from the version control.

The catch: Simply removing the file from the repository, and making a new commit won’t cut it. Making a push to a remote repository includes pushing the history of that repo. And yes, also the history where your giant data file was part of it.

Now, things get dramatic: Removing your file is not enough, you need to purge 🔥 it from your projects' history.

For this, there are different methods. I used git-filter-repo:

brew install git-filter-repo
git filter-repo --invert-paths --path your/giant/data/file.csv

Then add the respective file to your .gitignore file before making your next commit, and commit these changes.

Have a backup of the respective file in case something goes wrong!

I cannot do without a large file: Use Git LFS

Your repo really cannot do it without that big data file? GitHub itself offers help with Git Large File Storage (Git LFS). The service is free to use, as every user gets an allotment of 1 GB of storage.

Essentially, the Git LFS only points towards the large data files from your main git repository, keeping the project itself slim, while still sharing those large data sets.

To add a specific file to Git LFS, follow their instructions:

Install the extension Git LFS, add files (or file types) which should be tracked by LFS, and commit .gitattributes

git lfs install

git lfs track "*.csv"
git lfs track "/big/data/subfolder/"

git add .gitattributes
git commit -m "made git lfs track csv files and the subfolder big data"

Side-note: Internet trouble

It can also happen, that the http protocol is the bottleneck of your push.

The aforementioned stackoverflow discussion mentions that it can make sense to downgrade your HTTP protocol, or increase its buffer size of HTTP. For the sake of completeness, some code for this:

git config --global http.postBuffer 157286400
git config --global http.version HTTP/1.1
git push 
git config --global http.version HTTP/2

📝 To summarise:

  1. Pay attention to big files from the start, .gitignore them, make your own backups for these files, and find workarounds e.g., by hosting them on Google Drive and adding a link to your repo’s README
  2. Identify files that are too large with git-sizer
  3. Purge🔥 these files from the repo with git-filter-repo
  4. Add big files which you do not need to share to .gitignore
  5. Add the other big files to Git LFS via git lfs track
  6. Commit both the .gitignore and .gitattributes files
  7. Push your repo successfully! (If not, try out HTTPS and buffer-size options)

Subscribe to ds-econ

Sign up now to get access to the library of members-only issues.
Jamie Larson