The Git project recently released Git 2.53.0. Let's look at a few notable highlights from this release, which includes contributions from the Git team at GitLab.
Geometric repacking support with promisor remotes
Newly written objects in a Git repository are often stored as individual loose files. To ensure good performance and optimal use of disk space, these loose objects are regularly compressed into so-called packfiles. The number of packfiles in a repository grows over time as a result of the userâs activities, like writing new commits or fetching from a remote. As the number of packfiles in a repository increases, Git has to do more work to look up individual objects. Therefore, to preserve optimal repository performance, packfiles are periodically repacked via git-repack(1) to consolidate the objects into fewer packfiles. When repacking there are two strategies: âall-into-oneâ and âgeometricâ.
The all-into-one strategy is fairly straightforward and the current default. As its name implies, all objects in the repository are packed into a single packfile. From a performance perspective this is great for the repository as Git only has to scan through a single packfile when looking up objects. The main downside of such a repacking strategy is that computing a single packfile for a repository can take a significant amount of time for large repositories.
The geometric strategy helps mitigate this concern by maintaining a geometric progression of packfiles based on their size instead of always repacking into a single packfile. To explain more plainly, when repacking Git maintains a set of packfiles ordered by size where each packfile in the sequence is expected to be at least twice the size of the preceding packfile. If a packfile in the sequence violates this property, packfiles are combined as needed until the progression is restored. This strategy has the advantage of still minimizing the number of packfiles in a repository while also minimizing the amount of work that must be done for most repacking operations.
One problem with the geometric repacking strategy was that it was not compatible with partial clones. Partial clones allow the user to clone only parts of a repository by, for example, skipping all blobs larger than 1 megabyte. This can significantly reduce the size of a repository, and Git knows how to backfill missing objects that it needs to access at a later point in time.
The result is a repository that is missing some objects, and any object that may not be fully connected is stored in a âpromisorâ packfile. When repacking, this promisor property needs to be retained going forward for packfiles containing a promisor object so it is known whether a missing object is expected and can be backfilled from the promisor remote. With an all-into-one repack, Git knows how to handle promisor objects properly and stores them in a separate promisor packfile. Unfortunately, the geometric repacking strategy did not know to give special treatment to promisor packfiles and instead would merge them with normal packfiles without considering whether they reference promisor objects. Luckily, due to a bug the underlying git-pack-objects(1) dies when using geometric repacking in a partial clone repository. So this means repositories in this configuration were not able to be repacked anyways which isnât great, but better than repository corruption.
With the release of Git 2.53, geometric repacking now works with partial clone repositories. When performing a geometric repack, promisor packfiles are handled separately in order to preserve the promisor marker and repacked following a separate geometric progression. With this fix, the geometric strategy moves closer towards becoming the default repacking strategy. For more information check out the corresponding mailing list thread.
This project was led by Patrick Steinhardt.
git-fast-import(1) learned to preserve only valid signatures
In our Git 2.52 release article, we covered signature related improvements to git-fast-import(1) and git-fast-export(1). Be sure to check out that post for a more detailed explanation of these commands, how they are used, and the changes being made with regards to signatures.
To quickly recap, git-fast-import(1) provides a backend to efficiently import data into a repository and is used by tools such as git-filter-repo(1) to help rewrite the history of a repository in bulk. In the Git 2.52 release, git-fast-import(1) learned the --signed-commits=<mode> option similar to the same option in git-fast-export(1). With this option, it became possible to unconditionally retain or strip signatures from commits/tags.
In situations where only part of the repository history has been rewritten, any signature for rewritten commits/tags becomes invalid. This means git-fast-import(1) is limited to either stripping all signatures or keeping all signatures even if they have become invalid. But retaining invalid signatures doesnât make much sense, so rewriting history with git-repo-filter(1) results in all signatures being stripped, even if the underlying commit/tag is not rewritten. This is unfortunate because if the commit/tag is unchanged, its signature is still valid and thus there is no real reason to strip it. What is really needed is a means to preserve signatures for unchanged objects, but strip invalid ones.
With the release of Git 2.53, the git-fast-import(1) --signed-commits=<mode> option has learned a new strip-if-invalid mode which, when specified, only strips signatures from commits that become invalid due to being rewritten. Thus, with this option it becomes possible to preserve some commit signatures when using git-fast-import(1). This is a critical step towards providing the foundation for tools like git-repo-filter(1) to preserve valid signatures and eventually re-sign invalid signatures.
This project was led by Christian Couder.
More data collected in git-repo-structure
In the Git 2.52 release, the âstructureâ subcommand was introduced to git-repo(1). The intent of this command was to collect information about the repository and eventually become a native replacement for tools such as git-sizer(1). At GitLab, we host some extremely large repositories, and having insight into the general structure of a repository is critical to understand its performance characteristics. In this release, the command now also collects total size information for reachable objects in a repository to help understand the overall size of the repository. In the output below, you can see the command now collects both the total inflated and disk sizes of reachable objects by object type.
$ git repo structure
| Repository structure | Value |
| -------------------- | ---------- |
| * References | |
| * Count | 1.78 k |
| * Branches | 5 |
| * Tags | 1.03 k |
| * Remotes | 749 |
| * Others | 0 |
| | |
| * Reachable objects | |
| * Count | 421.37 k |
| * Commits | 88.03 k |
| * Trees | 169.95 k |
| * Blobs | 162.40 k |
| * Tags | 994 |
| * Inflated size | 7.61 GiB |
| * Commits | 60.95 MiB |
| * Trees | 2.44 GiB |
| * Blobs | 5.11 GiB |
| * Tags | 731.73 KiB |
| * Disk size | 301.50 MiB |
| * Commits | 33.57 MiB |
| * Trees | 77.92 MiB |
| * Blobs | 189.44 MiB |
| * Tags | 578.13 KiB |
The keen-eyed among you may have also noticed that the size values in the table output are also now listed in a more human-friendly manner with units appended. In subsequent releases we hope to further expand this command's output to provide additional data points such as the largest individual objects in the repository.
This project was led by Justin Tobler.
Read more
This article highlighted just a few of the contributions made by GitLab and the wider Git community for this latest release. You can learn about these from the official release announcement of the Git project. Also, check out our previous Git release blog posts to see other past highlights of contributions from GitLab team members.