git_update/doc/development_explanations.md
2024-07-25 12:02:56 +02:00

16 KiB
Raw Blame History

How to minimize the memory and data flow consumption of Git cloning?

Background

Jean-Cloud is a small association providing hosting services on second-hand hardware. It is currently launching the Shlagernetes project, a software that enables services to be distributed and managed across several second-hand servers. Git is used in certain cases to install a service on a server or update it.  

Objective

The objective is to obtain the latest version (or a specific version) of a git repository, using as few resources as possible. By resources, we mean the data flow from the remote to the local folder, as well as the memory space occupied by the repository on the local server. 

The created Git repository will not send any data to the remote. It has access to tags but not history. It can keep some local untracked files in addition to its Git clones. It includes submodules if present. It can either download the last main commit (default) or a commit from a certain reference, i.e. branch or tag. 

Procedure

Tests on various commands were carried out on a dummy repository. The test file is transportable and can be downloaded here. Note that to run locally, you need to authorize the protocol for local files: git config --global protocol.file.allow always. This is not the default configuration, as it may represent a security vulnerability.

The tests consist in analyzing the memory space taken up by the local repository using the bash command "du", as well as analyzing the text produced by Git during cloning.

Final results

The final chosen combination is :

To clone :

git clone --depth=1 --recurse-submodules --remote-submodules

depth=1 allows you to clone only the last commit along with the necessary objects. By default, it is single-branch.
recurse-submodules ensures that the contents of submodules are cloned
remote-submodules ensures submodule content is cloned from the original remote submodule
shallow-submodules ensures that only the latest submodule commit is imported (for this to work locally, specify ://file/ before the submodule path)

To update :

git fetch --tags --depth=1 --prune --prune-tags origin $ref git reset --hard --recurse-submodules FETCH_HEAD git submodule update --init --recursive --force --depth=1 --remote git reflog expire --expire=now --all git gc --aggressive --prune=now [git clean -qfdx]

git fetch --tags --depth=1 --prune --prune-tags origin

tags is used to fetch tags, and must be specified even if a tag is fetched by reference
depth=1 allows only the last commit to be considered
prune deletes references that are no longer accessible from the local remote folder
prune-tags not only deletes references in the local remote repository that are no longer accessible, but also deletes local tags that do not exist on the remote.

git reset --hard --recurse-submodules origin/main

git submodule update --init --recursive --force --depth=1 --remote 

init updates the .gitmodules file
recursive applies the command to submodules of submodules etc.
force ignores local changes to submodules and automatically checks out the new version
depth=1 allows you to consider only the last submodule commit
remote updates from the original remote submodule
CAREFUL: order does matter here. Using this instruction first would make it ineffective because of the --recurse-submodules of the git reset. This option is yet kept to deal with the case of deletion of a submodule.

git reflog expire --expire=now --all

this command marks all isolated reflogs as expired immediately instead of 90 days later. This makes for a bigger git gc clean up. git rev-list allows you to check which objects are linked and will not be marked as expired.    

git gc --aggressive --prune=now

this command removes unrelated references and reorganizes the repository to optimize it. 
aggressive invokes repack and takes longer. repack undoes and redoes packs, which are compression units.

[git clean -qfdx] if this command is omitted, files created without committing are retained.

This combination does not save any changes made to our repository, apart from the creation of non-committed files if git clean is omitted.

Details

Here is a summary of the different solutions we have explored to reduce the footprint of our Git repository.

Partial vs. shallow cloning

Shallow cloning means not cloning the entire repository history.

A partial clone means not cloning all the files and/or folders in the repository, according to a filter. Filters may concern Binary Large Objects (blobs) or trees. If the filter concerns age, then a partial clone can also be a shallow clone. Partial clones can be created using the git clone --filter command. During check-out or switch operations, objects initially ignored by the --filter clone can be imported. In our case, we only want to keep one precise commit, which will in any case be let through by git clone --filter which is therefore irrelevant. Partial clones can also be created by sparse-checking. Some files and/or folders then do not appear at all in the local folder and are not affected by git porcelain (surface) operations. Nevertheless, the objects associated with these files and folders are still stored in the .git repository.

A surface clone can be created using the depth= option, which specifies the number of commits to be kept. This option is available for both the clone and fetch commands.

Large file storage

LFS is a Git extension that lets you manipulate selected files (by name, expression or size) using a local cache. In practice, files are replaced by references in the Git repository and a local folder outside the repository is created to store the files. They are downloaded lazily, i.e. only when checked out. All older versions are stored on an online server. This is a very interesting mechanism, which we will not use for the same reason as the --filter clone: we only want to keep one specific version of the files, which would in any case be downloaded by LFS.

Delete history

The git filter-branch command is not recommended by the Git documentation. It has several security and performance flaws. It can be used to rewrite branch history using filters.

The Java repo-cleaner library works, but the Git documentation considers the Python filter-repo library to be faster and more secure. We do not wish to install either Python or Java, hence we will not dig any deeper into these two possibilities here.

We want to delete the entire history without filtering, so the git command fetch --depth=1 followed by a git checkout, reset or merge works for us.

checkout ? merge ? reset ?

Once we have fetched the changes to our local remote/ folder, what is the best way to apply them to our index and working directory? Let us compare 4 possibilities: git merge -X, git merge -s, git reset --hard, git checkout -f -B. The final results are identical, except for git merge -X.

In the case of git merge, we do not wish to resolve conflicts manually. Remote must always take precedence over local differences.

git merge -X theirs

This command applies an ort strategy which, in the event of a conflict, gives precedence to theirs. However, since we are working in --depth=1, the two branches have no common ancestor, and the --allow-unrelated-histories option must be supplied. The absence of a common ancestor prevents Git from recognizing similarities within the same file. Any modification to a tracked file on ours, even on a new line, will thus cause a conflict and be overwritten. This command does, however, save newly created and committed files on ours. Newly created uncommitted files are kept unless git clean is run. Advantage: committed files created on ours are saved. Disadvantage: in the event of deletion of a file on theirs that already existed on ours: it will not be deleted on ours.

git merge -s ours

[caution: the notions of theirs and ours are reversed here, as git merge -s theirs does not exist]. This command applies a ours strategy that gives prevalence to ours, whether there is a conflict or not. It will ignore all changes and file creations committed to theirs. It will also ignore uncommitted modifications. Uncommitted file creations are retained unless git clean is run. This is the same result as with the git reset --hard command. As the git merge -s theirs option does not exist, we need to do a little manipulation: #we want to merge origin/main on main, giving prevalence to origin/main #create a new temp temporary branch that we check out, sourced on origin/main git switch -c temp origin/main #merge main on temp, giving prevalence to temp which is identical to origin/main git merge -s ours --allow-unrelated-histories main #return to main git checkout main #merge temp on main git merge --allow-unrelated-histories temp #delete temp git branch -D temp Advantage: Disadvantage: creation of a temporary branch.

git checkout -force -B main origin/main

This command is equivalent to git merge -s ours and git reset --hard, with the difference that you end up in detached HEAD state, which does nos cause any problem in our case since we do not want to push any changes from our repository. Advantage : Disadvantage: detached HEAD state.

git reset --hard

git reset --hard is equivalent to git merge -s ours and git checkout --force -B. Advantage: Disadvantage:

Tests show that the most memory-efficient options are git checkout -force -B, git merge -s ours and git --reset hard, which all do the same thing. However, git reset --hard does not involve the creation of a temporary branch and does not end in detached HEAD state, hence it is the one we choose.

Submodule management

Submodules are initially cloned using git clone --recurse-submodules --remote-submodules. They are updated using git submodule update --init --recursive --force --depth=1 remote. Git reset --hard must be supplied with the --recurse-submodules option in order to delete submodules from the working directory. The same rules apply to submodules as to the rest of the repository. In the .gitmodules file, it is possible to specify rules for importing submodules, such as a certain tag or branch. By removing --remote-submodules from git clone and --remote from git submodule update, submodules will be identical to the repository being cloned and no longer to the original submodule repository.

##Tests

Script description

README extract

The script consists of twenty-nine tests (listed in the results below), based on three functions: generate_random_file, get_storage_used and get_bandwidth.

generate_random_file uses the bash command dd and /dev/random. get_storage_used uses the bash command du. get_bandwidth retrieves the output of Git commands and extracts the traffic displayed. This does not take submodule traffic into account.

The first five tests concern cloning. The following tests involve updating the repository using different commands, with three cases for each command: after adding a file, after deleting a file, after adding then deleting a file.

Help extract

NAME performance_tests.sh SYNOPSIS performance_tests.sh [-a] [-h] [-n number] OPTIONS -a executes all the tests. -n number executes test number. - c cleans. -h prints the help.

Results

**Tests on the initial populating of the repository**  
============================================================= TEST0  
TEST 0: classic cloning.  
memory usage: 22668  
bandwidth usage (submodule excluded):  8.49 MiB   
============================================================= TEST1   
TEST 1: --single-branch cloning.  
memory usage: 22168  
bandwidth usage (submodule excluded):  8.00 MiB   
============================================================= TEST2  
TEST 2: --depth=1 --no-single-branch  
memory usage: 17552  
bandwidth usage (submodule excluded):  3.49 MiB   
============================================================= TEST3  
TEST 3: --depth=1 with single-branch (default)  
memory usage: 17052  
bandwidth usage (submodule excluded):  3.00 MiB   
============================================================= TEST4   
TEST 4: --depth=1 with single-branch (default) and reflog and gc  
HEAD is now at 23700cf adding submodule_for_performance_testing module  
memory usage: 17056  
bandwidth usage (submodule excluded):  3.00 MiB   
============================================================= TEST5   
TEST 5 : sparse-checking only sample0 with depth=1  
memory usage: 10060  
bandwidth usage (submodule excluded): unknown  
**Tests on the updating of the repository**  
**classic fetching+checking out** 
============================================================= TEST6   
TEST 6: after addition of a 1M file  
memory usage: +2108  
============================================================= TEST7   
TEST 7: after removal of a 1M file  
memory usage: -972  
============================================================= TEST8   
TEST 8: after addition then removal of a 1M file  
memory usage: 1088  
**etching+checking out with --depth=1**  
============================================================= TEST9   
TEST 9: after addition of a 1M file  
memory usage: +2112  
============================================================= TEST10   
TEST 10: after removal of a 1M file  
memory usage: -968  
============================================================= TEST11   
TEST 11: after addition then removal of a 1M file  
memory usage: 48  
**--depth=1 fetching+checking out reflog and gc**  
============================================================= TEST12   
TEST 12: after addition of a 1M file  
memory usage: +2052  
============================================================= TEST13   
TEST 13: after removal of a 1M file  
memory usage: -1020  
============================================================= TEST14   
TEST 14: after addition then removal of a 1M file  
memory usage: 4  
**--depth=1 fetching+ reset --hard**  
============================================================= TEST15   
TEST 15: after addition of a 1M file  
memory usage: +2116  
============================================================= TEST16   
TEST 16: after removal of a 1M file  
memory usage: -964  
============================================================= TEST17   
TEST 17: after addition then removal of a 1M file  
memory usage: 52  
**--depth=1 fetching+ reset --hard and reflog and gc**  
============================================================= TEST18   
TEST 18: after addition of a 1M file  
memory usage: 2056  
============================================================= TEST19   
TEST 19: after removal of a 1M file  
memory usage: -1016  
============================================================= TEST20   
TEST 20: after addition then removal of a 1M file  
memory usage: 8  
**--depth=1 fetching+checking out after modification applied in submodule**  
============================================================= TEST21   
TEST 21: after addition of a 1M file  
memory usage: 2112  
============================================================= TEST22   
TEST 22: after removal of a 1M file  
memory usage: -976  
============================================================= TEST23   
TEST 23: after addition then removal of a 1M file  
memory usage: 48  
**--depth=1 fetching+merging -X theirs with reflog and gc**  
============================================================= TEST24   
TEST 24: after addition of a 1M file  
memory usage: +2056  
============================================================= TEST25   
TEST 25: after removal of a 1M file  
memory usage: 8  
============================================================= TEST26   
TEST 26: after addition then removal of a 1M file  
memory usage: 8  
**--depth=1 fetching+merging -s ours with reflog and gc**  
============================================================= TEST27   
TEST 27: after addition of a 1M file  
memory usage: +2056  
============================================================= TEST28   
TEST 28: after removal of a 1M file  
memory usage: -1016  
============================================================= TEST29    
TEST 29: after addition then removal of a 1M file  
memory usage: 8