How to minimize the memory and data flow consumption of Git cloning? # Background Jean-Cloud is a small association providing hosting services on second-hand hardware. It is currently launching the Shlagernetes project, a software that enables services to be distributed and managed across several second-hand servers. Git is used in certain cases to install a service on a server or update it. # Objective The objective is to obtain the latest version (or a specific version) of a git repository, using as few resources as possible. By resources, we mean the data flow from the remote to the local folder, as well as the memory space occupied by the repository on the local server. The created Git repository will not send any data to the remote. It has access to tags but not history. It can keep some local untracked files in addition to its Git clones. It includes submodules if present. It can either download the last main commit (default) or a commit from a certain reference, i.e. branch or tag. # Procedure Tests on various commands were carried out on a dummy repository. The test file is transportable and can be downloaded here. Note that to run locally, you need to authorize the protocol for local files: `git config --global protocol.file.allow always`. This is not the default configuration, as it may represent a security vulnerability. The tests consist in analyzing the memory space taken up by the local repository using the bash command `du`, as well as analyzing the text produced by Git during cloning. # Final results The final chosen combination is : ## To clone : `git clone --depth=1 --recurse-submodules --remote-submodules` - depth=1 allows you to clone only the last commit along with the necessary objects. By default, it is single-branch. - recurse-submodules ensures that the contents of submodules are cloned - remote-submodules ensures submodule content is cloned from the original remote submodule - shallow-submodules ensures that only the latest submodule commit is imported (for this to work locally, specify ://file/ before the submodule path) ## To update : ``` git fetch --tags --depth=1 --prune --prune-tags origin $ref git reset --hard --recurse-submodules FETCH_HEAD git submodule update --init --recursive --force --depth=1 --remote git reflog expire --expire=now --all git gc --aggressive --prune=now [git clean -qfdx] ``` - git fetch --tags --depth=1 --prune --prune-tags origin tags is used to fetch tags, and must be specified even if a tag is fetched by reference depth=1 allows only the last commit to be considered prune deletes references that are no longer accessible from the local remote folder prune-tags not only deletes references in the local remote repository that are no longer accessible, but also deletes local tags that do not exist on the remote. - `git reset --hard --recurse-submodules origin/main` - `git submodule update --init --recursive --force --depth=1 --remote` init updates the .gitmodules file recursive applies the command to submodules of submodules etc. force ignores local changes to submodules and automatically checks out the new version depth=1 allows you to consider only the last submodule commit remote updates from the original remote submodule CAREFUL: order does matter here. Using this instruction first would make it ineffective because of the `--recurse-submodules` of the `git reset`. This option is yet kept to deal with the case of deletion of a submodule. - `git reflog expire --expire=now --all` this command marks all isolated reflogs as expired immediately instead of 90 days later. This makes for a bigger git gc clean up. git rev-list allows you to check which objects are linked and will not be marked as expired. - `git gc --aggressive --prune=now` this command removes unrelated references and reorganizes the repository to optimize it. aggressive invokes repack and takes longer. repack undoes and redoes packs, which are compression units. - `[git clean -qfdx]` if this command is omitted, files created without committing are retained. This combination does not save any changes made to our repository, apart from the creation of non-committed files if git clean is omitted, which is the case in git_update.sh. # Details Here is a summary of the different solutions we have explored to reduce the footprint of our Git repository. ## Partial vs. shallow cloning Shallow cloning means not cloning the entire repository history. A partial clone means not cloning all the files and/or folders in the repository, according to a filter. Filters may concern Binary Large Objects (blobs) or trees. If the filter concerns age, then a partial clone can also be a shallow clone. Partial clones can be created using the `git clone --filter command`. During check-out or switch operations, objects initially ignored by the `--filter` can be imported. In our case, we only want to keep one precise commit, which will in any case be let through by `git clone --filter` which is therefore irrelevant. Partial clones can also be created by `sparse-checking`. Some files and/or folders then do not appear at all in the local folder and are not affected by git porcelain (surface) operations. Nevertheless, the objects associated with these files and folders are still stored in the .git repository. A surface clone can be created using the `depth=` option, which specifies the number of commits to be kept. This option is available for both the clone and fetch commands. ## Large file storage LFS is a Git extension that lets you manipulate selected files (by name, expression or size) using a local cache. In practice, files are replaced by references in the Git repository and a local folder outside the repository is created to store the files. They are downloaded lazily, i.e. only when checked out. All older versions are stored on an online server. This is a very interesting mechanism, which we will not use for the same reason as the `--filter` clone: we only want to keep one specific version of the files, which would in any case be downloaded by LFS. ## Delete history The `git filter-branch` command is not recommended by the Git documentation. It has several security and performance flaws. It can be used to rewrite branch history using filters. The Java repo-cleaner library works, but the Git documentation considers the Python filter-repo library to be faster and more secure. We do not wish to install either Python or Java, hence we will not dig any deeper into these two possibilities here. We want to delete the entire history without filtering, so the git command `git fetch --depth=1` followed by a git checkout, reset or merge works for us. ## checkout ? merge ? reset ? Once we have fetched the changes to our local remote/ folder, what is the best way to apply them to our index and working directory? Let us compare 4 possibilities: `git merge -X`, `git merge -s`, `git reset --hard`, `git checkout -f -B`. The final results are identical, except for `git merge -X`. In the case of git merge, we do not wish to resolve conflicts manually. Remote must always take precedence over local differences. ### `git merge -X theirs` This command applies an ort strategy which, in the event of a conflict, gives precedence to theirs. However, since we are working in `--depth=1`, the two branches have no common ancestor, and the `--allow-unrelated-histories` option must be supplied. The absence of a common ancestor prevents Git from recognizing similarities within the same file. Any modification to a tracked file on ours, even on a new line, will thus cause a conflict and be overwritten. This command does, however, save newly created and committed files on ours. Newly created uncommitted files are kept unless `git clean` is run. Advantage: committed files created on ours are saved. Disadvantage: in the event of deletion of a file on theirs that already existed on ours: it will not be deleted on ours. ### `git merge -s ours` [caution: the notions of theirs and ours are reversed here, as `git merge -s theirs` does not exist]. This command applies a ours strategy that gives prevalence to ours, whether there is a conflict or not. It will ignore all changes and file creations committed to theirs. It will also ignore uncommitted modifications. Uncommitted file creations are retained unless git clean is run. This is the same result as with the `git reset --hard` command. As the `git merge -s theirs` option does not exist, we need to do a little manipulation: ``` #we want to merge origin/main on main, giving prevalence to origin/main #create a new temp temporary branch that we check out, sourced on origin/main git switch -c temp origin/main #merge main on temp, giving prevalence to temp which is identical to origin/main git merge -s ours --allow-unrelated-histories main #return to main git checkout main #merge temp on main git merge --allow-unrelated-histories temp #delete temp git branch -D temp ``` Advantage: Disadvantage: creation of a temporary branch. ### `git checkout -force -B main origin/main` This command is equivalent to `git merge -s ours` and `git reset --hard`, with the difference that you end up in detached HEAD state, which does nos cause any problem in our case since we do not want to push any changes from our repository. Advantage : Disadvantage: detached HEAD state. ### `git reset --hard` `git reset --hard` is equivalent to `git merge -s ours` and `git checkout --force -B`. Advantage: Disadvantage: Tests show that the most memory-efficient options are `git checkout -force -B`, `git merge -s ours` and `git --reset hard`, which all do the same thing. However, `git reset --hard` does not involve the creation of a temporary branch and does not end in detached HEAD state, hence it is the one we choose. ### Submodule management Submodules are initially cloned using `git clone --recurse-submodules --remote-submodules`. They are updated using `git submodule update --init --recursive --force --depth=1 –remote`. `git reset --hard` must be supplied with the `--recurse-submodules` option in order to delete submodules from the working directory. The same rules apply to submodules as to the rest of the repository. In the .gitmodules file, it is possible to specify rules for importing submodules, such as a certain tag or branch. By removing `--remote-submodules` from `git clone` and `--remote` from `git submodule update`, submodules will be identical to the repository being cloned and no longer to the original submodule repository. ## Tests ### Script description ### README extract ``` The script consists of twenty-nine tests (listed in the results below), based on three functions: generate_random_file, get_storage_used and get_bandwidth. generate_random_file uses the bash command dd and /dev/random. get_storage_used uses the bash command du. get_bandwidth retrieves the output of Git commands and extracts the traffic displayed. This does not take submodule traffic into account. The first five tests concern cloning. The following tests involve updating the repository using different commands, with three cases for each command: after adding a file, after deleting a file, after adding then deleting a file. ``` ### Help extract ``` NAME performance_tests.sh SYNOPSIS performance_tests.sh [-a] [-h] [-n number] OPTIONS -a executes all the tests. -n number executes test number. - c cleans. -h prints the help. ``` ### Results ``` **Tests on the initial populating of the repository** ============================================================= TEST0 TEST 0: classic cloning. memory usage: 22668 bandwidth usage (submodule excluded): 8.49 MiB ============================================================= TEST1 TEST 1: --single-branch cloning. memory usage: 22168 bandwidth usage (submodule excluded): 8.00 MiB ============================================================= TEST2 TEST 2: --depth=1 --no-single-branch memory usage: 17552 bandwidth usage (submodule excluded): 3.49 MiB ============================================================= TEST3 TEST 3: --depth=1 with single-branch (default) memory usage: 17052 bandwidth usage (submodule excluded): 3.00 MiB ============================================================= TEST4 TEST 4: --depth=1 with single-branch (default) and reflog and gc HEAD is now at 23700cf adding submodule_for_performance_testing module memory usage: 17056 bandwidth usage (submodule excluded): 3.00 MiB ============================================================= TEST5 TEST 5 : sparse-checking only sample0 with depth=1 memory usage: 10060 bandwidth usage (submodule excluded): unknown **Tests on the updating of the repository** **classic fetching+checking out** ============================================================= TEST6 TEST 6: after addition of a 1M file memory usage: +2108 ============================================================= TEST7 TEST 7: after removal of a 1M file memory usage: -972 ============================================================= TEST8 TEST 8: after addition then removal of a 1M file memory usage: 1088 **etching+checking out with --depth=1** ============================================================= TEST9 TEST 9: after addition of a 1M file memory usage: +2112 ============================================================= TEST10 TEST 10: after removal of a 1M file memory usage: -968 ============================================================= TEST11 TEST 11: after addition then removal of a 1M file memory usage: 48 **--depth=1 fetching+checking out reflog and gc** ============================================================= TEST12 TEST 12: after addition of a 1M file memory usage: +2052 ============================================================= TEST13 TEST 13: after removal of a 1M file memory usage: -1020 ============================================================= TEST14 TEST 14: after addition then removal of a 1M file memory usage: 4 **--depth=1 fetching+ reset --hard** ============================================================= TEST15 TEST 15: after addition of a 1M file memory usage: +2116 ============================================================= TEST16 TEST 16: after removal of a 1M file memory usage: -964 ============================================================= TEST17 TEST 17: after addition then removal of a 1M file memory usage: 52 **--depth=1 fetching+ reset --hard and reflog and gc** ============================================================= TEST18 TEST 18: after addition of a 1M file memory usage: 2056 ============================================================= TEST19 TEST 19: after removal of a 1M file memory usage: -1016 ============================================================= TEST20 TEST 20: after addition then removal of a 1M file memory usage: 8 **--depth=1 fetching+checking out after modification applied in submodule** ============================================================= TEST21 TEST 21: after addition of a 1M file memory usage: 2112 ============================================================= TEST22 TEST 22: after removal of a 1M file memory usage: -976 ============================================================= TEST23 TEST 23: after addition then removal of a 1M file memory usage: 48 **--depth=1 fetching+merging -X theirs with reflog and gc** ============================================================= TEST24 TEST 24: after addition of a 1M file memory usage: +2056 ============================================================= TEST25 TEST 25: after removal of a 1M file memory usage: 8 ============================================================= TEST26 TEST 26: after addition then removal of a 1M file memory usage: 8 **--depth=1 fetching+merging -s ours with reflog and gc** ============================================================= TEST27 TEST 27: after addition of a 1M file memory usage: +2056 ============================================================= TEST28 TEST 28: after removal of a 1M file memory usage: -1016 ============================================================= TEST29 TEST 29: after addition then removal of a 1M file memory usage: 8 ```