git_update/doc/development_explanations.md

297 lines
16 KiB
Markdown
Raw Normal View History

2024-07-23 19:34:37 +00:00
How to minimize the memory and data flow consumption of Git cloning?
# Background
2024-07-25 10:17:49 +00:00
Jean-Cloud is a small association providing hosting services on second-hand hardware. It is currently launching the Shlagernetes project, a software that enables services to be distributed and managed across several second-hand servers. Git is used in certain cases to install a service on a server or update it.
2024-07-23 19:34:37 +00:00
# Objective
2024-07-25 10:17:49 +00:00
The objective is to obtain the latest version (or a specific version) of a git repository, using as few resources as possible. By resources, we mean the data flow from the remote to the local folder, as well as the memory space occupied by the repository on the local server.
2024-07-23 19:34:37 +00:00
2024-07-25 10:17:49 +00:00
The created Git repository will not send any data to the remote. It has access to tags but not history. It can keep some local untracked files in addition to its Git clones. It includes submodules if present. It can either download the last main commit (default) or a commit from a certain reference, i.e. branch or tag.
2024-07-23 19:34:37 +00:00
# Procedure
2024-07-25 10:17:49 +00:00
Tests on various commands were carried out on a dummy repository. The test file is transportable and can be downloaded here. Note that to run locally, you need to authorize the protocol for local files: `git config --global protocol.file.allow always`. This is not the default configuration, as it may represent a security vulnerability.
2024-07-23 19:34:37 +00:00
2024-07-25 10:17:49 +00:00
The tests consist in analyzing the memory space taken up by the local repository using the bash command `du`, as well as analyzing the text produced by Git during cloning.
2024-07-23 19:34:37 +00:00
# Final results
2024-07-25 10:17:49 +00:00
2024-07-23 19:34:37 +00:00
The final chosen combination is :
2024-07-25 10:17:49 +00:00
2024-07-23 19:34:37 +00:00
## To clone :
2024-07-25 10:17:49 +00:00
`git clone --depth=1 --recurse-submodules --remote-submodules`
- depth=1 allows you to clone only the last commit along with the necessary objects. By default, it is single-branch.
- recurse-submodules ensures that the contents of submodules are cloned
- remote-submodules ensures submodule content is cloned from the original remote submodule
- shallow-submodules ensures that only the latest submodule commit is imported (for this to work locally, specify ://file/ before the submodule path)
2024-07-23 19:34:37 +00:00
## To update :
2024-07-25 10:17:49 +00:00
```
2024-07-23 19:34:37 +00:00
git fetch --tags --depth=1 --prune --prune-tags origin $ref
git reset --hard --recurse-submodules FETCH_HEAD
git submodule update --init --recursive --force --depth=1 --remote
git reflog expire --expire=now --all
git gc --aggressive --prune=now
[git clean -qfdx]
2024-07-25 10:17:49 +00:00
```
- git fetch --tags --depth=1 --prune --prune-tags origin
2024-07-23 19:34:37 +00:00
2024-07-25 10:17:49 +00:00
tags is used to fetch tags, and must be specified even if a tag is fetched by reference
depth=1 allows only the last commit to be considered
prune deletes references that are no longer accessible from the local remote folder
prune-tags not only deletes references in the local remote repository that are no longer accessible, but also deletes local tags that do not exist on the remote.
2024-07-23 19:34:37 +00:00
2024-07-25 10:17:49 +00:00
- `git reset --hard --recurse-submodules origin/main`
2024-07-23 19:34:37 +00:00
2024-07-25 10:17:49 +00:00
- `git submodule update --init --recursive --force --depth=1 --remote`
2024-07-23 19:34:37 +00:00
2024-07-25 10:17:49 +00:00
init updates the .gitmodules file
recursive applies the command to submodules of submodules etc.
force ignores local changes to submodules and automatically checks out the new version
depth=1 allows you to consider only the last submodule commit
remote updates from the original remote submodule
CAREFUL: order does matter here. Using this instruction first would make it ineffective because of the `--recurse-submodules` of the `git reset`. This option is yet kept to deal with the case of deletion of a submodule.
2024-07-23 19:34:37 +00:00
2024-07-25 10:17:49 +00:00
- `git reflog expire --expire=now --all`
2024-07-23 19:34:37 +00:00
2024-07-25 10:17:49 +00:00
this command marks all isolated reflogs as expired immediately instead of 90 days later. This makes for a bigger git gc clean up. git rev-list allows you to check which objects are linked and will not be marked as expired.
2024-07-23 19:34:37 +00:00
2024-07-25 10:17:49 +00:00
- `git gc --aggressive --prune=now`
2024-07-23 19:34:37 +00:00
2024-07-25 10:17:49 +00:00
this command removes unrelated references and reorganizes the repository to optimize it.
aggressive invokes repack and takes longer. repack undoes and redoes packs, which are compression units.
2024-07-23 19:34:37 +00:00
2024-07-25 10:17:49 +00:00
- `[git clean -qfdx]`
if this command is omitted, files created without committing are retained.
This combination does not save any changes made to our repository, apart from the creation of non-committed files if git clean is omitted, which is the case in git_update.sh.
2024-07-23 19:34:37 +00:00
# Details
2024-07-25 10:17:49 +00:00
2024-07-23 19:34:37 +00:00
Here is a summary of the different solutions we have explored to reduce the footprint of our Git repository.
2024-07-25 10:17:49 +00:00
2024-07-23 19:34:37 +00:00
## Partial vs. shallow cloning
2024-07-25 10:17:49 +00:00
2024-07-23 19:34:37 +00:00
Shallow cloning means not cloning the entire repository history.
A partial clone means not cloning all the files and/or folders in the repository, according to a filter. Filters may concern Binary Large Objects (blobs) or trees. If the filter concerns age, then a partial clone can also be a shallow clone.
2024-07-25 10:17:49 +00:00
Partial clones can be created using the `git clone --filter command`.
During check-out or switch operations, objects initially ignored by the `--filter` can be imported. In our case, we only want to keep one precise commit, which will in any case be let through by `git clone --filter` which is therefore irrelevant.
Partial clones can also be created by `sparse-checking`. Some files and/or folders then do not appear at all in the local folder and are not affected by git porcelain (surface) operations. Nevertheless, the objects associated with these files and folders are still stored in the .git repository.
2024-07-23 19:34:37 +00:00
2024-07-25 10:17:49 +00:00
A surface clone can be created using the `depth=<number>` option, which specifies the number of commits to be kept. This option is available for both the clone and fetch commands.
2024-07-23 19:34:37 +00:00
## Large file storage
LFS is a Git extension that lets you manipulate selected files (by name, expression or size) using a local cache. In practice, files are replaced by references in the Git repository and a local folder outside the repository is created to store the files. They are downloaded lazily, i.e. only when checked out. All older versions are stored on an online server.
2024-07-25 10:17:49 +00:00
This is a very interesting mechanism, which we will not use for the same reason as the `--filter` clone: we only want to keep one specific version of the files, which would in any case be downloaded by LFS.
2024-07-23 19:34:37 +00:00
## Delete history
The git filter-branch command is not recommended by the Git documentation. It has several security and performance flaws. It can be used to rewrite branch history using filters.
The Java repo-cleaner library works, but the Git documentation considers the Python filter-repo library to be faster and more secure. We do not wish to install either Python or Java, hence we will not dig any deeper into these two possibilities here.
We want to delete the entire history without filtering, so the git command fetch --depth=1 followed by a git checkout, reset or merge works for us.
## checkout ? merge ? reset ?
Once we have fetched the changes to our local remote/ folder, what is the best way to apply them to our index and working directory?
2024-07-25 10:17:49 +00:00
Let us compare 4 possibilities: `git merge -X`, `git merge -s`, `git reset --hard`, `git checkout -f -B`. The final results are identical, except for `git merge -X`.
2024-07-23 19:34:37 +00:00
In the case of git merge, we do not wish to resolve conflicts manually. Remote must always take precedence over local differences.
2024-07-25 10:17:49 +00:00
### `git merge -X theirs`
2024-07-23 19:34:37 +00:00
This command applies an ort strategy which, in the event of a conflict, gives precedence to theirs.
2024-07-25 10:17:49 +00:00
However, since we are working in `--depth=1`, the two branches have no common ancestor, and the `--allow-unrelated-histories` option must be supplied. The absence of a common ancestor prevents Git from recognizing similarities within the same file. Any modification to a tracked file on ours, even on a new line, will thus cause a conflict and be overwritten. This command does, however, save newly created and committed files on ours.
Newly created uncommitted files are kept unless `git clean` is run.
2024-07-23 19:34:37 +00:00
Advantage: committed files created on ours are saved.
Disadvantage: in the event of deletion of a file on theirs that already existed on ours: it will not be deleted on ours.
2024-07-25 10:17:49 +00:00
### `git merge -s ours`
[caution: the notions of theirs and ours are reversed here, as `git merge -s theirs` does not exist].
This command applies a ours strategy that gives prevalence to ours, whether there is a conflict or not. It will ignore all changes and file creations committed to theirs. It will also ignore uncommitted modifications. Uncommitted file creations are retained unless git clean is run. This is the same result as with the `git reset --hard` command.
As the `git merge -s theirs` option does not exist, we need to do a little manipulation:
```
2024-07-23 19:34:37 +00:00
#we want to merge origin/main on main, giving prevalence to origin/main
#create a new temp temporary branch that we check out, sourced on origin/main
git switch -c temp origin/main
#merge main on temp, giving prevalence to temp which is identical to origin/main
git merge -s ours --allow-unrelated-histories main
#return to main
git checkout main
#merge temp on main
git merge --allow-unrelated-histories temp
#delete temp
git branch -D temp
2024-07-25 10:17:49 +00:00
```
2024-07-23 19:34:37 +00:00
Advantage:
Disadvantage: creation of a temporary branch.
2024-07-25 10:17:49 +00:00
### `git checkout -force -B main origin/main`
This command is equivalent to `git merge -s ours` and `git reset --hard`, with the difference that you end up in detached HEAD state, which does nos cause any problem in our case since we do not want to push any changes from our repository.
2024-07-23 19:34:37 +00:00
Advantage :
Disadvantage: detached HEAD state.
2024-07-25 10:17:49 +00:00
### `git reset --hard`
`git reset --hard` is equivalent to `git merge -s ours` and `git checkout --force -B`.
2024-07-23 19:34:37 +00:00
Advantage:
Disadvantage:
2024-07-25 10:17:49 +00:00
Tests show that the most memory-efficient options are `git checkout -force -B`, `git merge -s ours` and `git --reset hard`, which all do the same thing. However, `git reset --hard` does not involve the creation of a temporary branch and does not end in detached HEAD state, hence it is the one we choose.
2024-07-23 19:34:37 +00:00
### Submodule management
2024-07-25 10:17:49 +00:00
Submodules are initially cloned using `git clone --recurse-submodules --remote-submodules`.
They are updated using `git submodule update --init --recursive --force --depth=1 remote`.
`git reset --hard` must be supplied with the `--recurse-submodules` option in order to delete submodules from the working directory.
The same rules apply to submodules as to the rest of the repository. In the .gitmodules file, it is possible to specify rules for importing submodules, such as a certain tag or branch. By removing `--remote-submodules` from `git clone` and `--remote` from `git submodule update`, submodules will be identical to the repository being cloned and no longer to the original submodule repository.
## Tests
2024-07-23 19:34:37 +00:00
### Script description
2024-07-25 10:17:49 +00:00
2024-07-24 11:57:26 +00:00
### README extract
2024-07-25 10:17:49 +00:00
```
2024-07-23 19:34:37 +00:00
The script consists of twenty-nine tests (listed in the results below), based on three functions: generate_random_file, get_storage_used and get_bandwidth.
generate_random_file uses the bash command dd and /dev/random.
get_storage_used uses the bash command du.
get_bandwidth retrieves the output of Git commands and extracts the traffic displayed. This does not take submodule traffic into account.
The first five tests concern cloning.
The following tests involve updating the repository using different commands, with three cases for each command: after adding a file, after deleting a file, after adding then deleting a file.
2024-07-25 10:17:49 +00:00
```
2024-07-24 11:57:26 +00:00
### Help extract
2024-07-25 10:17:49 +00:00
```
2024-07-24 11:57:26 +00:00
NAME
performance_tests.sh
SYNOPSIS
performance_tests.sh [-a] [-h] [-n number]
OPTIONS
-a executes all the tests.
-n number executes test number.
- c cleans.
-h prints the help.
2024-07-25 10:17:49 +00:00
```
2024-07-23 19:34:37 +00:00
### Results
2024-07-25 10:02:56 +00:00
```
**Tests on the initial populating of the repository**
============================================================= TEST0
TEST 0: classic cloning.
memory usage: 22668
bandwidth usage (submodule excluded): 8.49 MiB
============================================================= TEST1
TEST 1: --single-branch cloning.
memory usage: 22168
bandwidth usage (submodule excluded): 8.00 MiB
============================================================= TEST2
TEST 2: --depth=1 --no-single-branch
memory usage: 17552
bandwidth usage (submodule excluded): 3.49 MiB
============================================================= TEST3
TEST 3: --depth=1 with single-branch (default)
memory usage: 17052
bandwidth usage (submodule excluded): 3.00 MiB
============================================================= TEST4
TEST 4: --depth=1 with single-branch (default) and reflog and gc
HEAD is now at 23700cf adding submodule_for_performance_testing module
memory usage: 17056
bandwidth usage (submodule excluded): 3.00 MiB
============================================================= TEST5
TEST 5 : sparse-checking only sample0 with depth=1
memory usage: 10060
bandwidth usage (submodule excluded): unknown
**Tests on the updating of the repository**
**classic fetching+checking out**
============================================================= TEST6
TEST 6: after addition of a 1M file
memory usage: +2108
============================================================= TEST7
TEST 7: after removal of a 1M file
memory usage: -972
============================================================= TEST8
TEST 8: after addition then removal of a 1M file
memory usage: 1088
**etching+checking out with --depth=1**
============================================================= TEST9
TEST 9: after addition of a 1M file
memory usage: +2112
============================================================= TEST10
TEST 10: after removal of a 1M file
memory usage: -968
============================================================= TEST11
TEST 11: after addition then removal of a 1M file
memory usage: 48
**--depth=1 fetching+checking out reflog and gc**
============================================================= TEST12
TEST 12: after addition of a 1M file
memory usage: +2052
============================================================= TEST13
TEST 13: after removal of a 1M file
memory usage: -1020
============================================================= TEST14
TEST 14: after addition then removal of a 1M file
memory usage: 4
**--depth=1 fetching+ reset --hard**
============================================================= TEST15
TEST 15: after addition of a 1M file
memory usage: +2116
============================================================= TEST16
TEST 16: after removal of a 1M file
memory usage: -964
============================================================= TEST17
TEST 17: after addition then removal of a 1M file
memory usage: 52
**--depth=1 fetching+ reset --hard and reflog and gc**
============================================================= TEST18
TEST 18: after addition of a 1M file
memory usage: 2056
============================================================= TEST19
TEST 19: after removal of a 1M file
memory usage: -1016
============================================================= TEST20
TEST 20: after addition then removal of a 1M file
memory usage: 8
**--depth=1 fetching+checking out after modification applied in submodule**
============================================================= TEST21
TEST 21: after addition of a 1M file
memory usage: 2112
============================================================= TEST22
TEST 22: after removal of a 1M file
memory usage: -976
============================================================= TEST23
TEST 23: after addition then removal of a 1M file
memory usage: 48
**--depth=1 fetching+merging -X theirs with reflog and gc**
============================================================= TEST24
TEST 24: after addition of a 1M file
memory usage: +2056
============================================================= TEST25
TEST 25: after removal of a 1M file
memory usage: 8
============================================================= TEST26
TEST 26: after addition then removal of a 1M file
memory usage: 8
**--depth=1 fetching+merging -s ours with reflog and gc**
============================================================= TEST27
TEST 27: after addition of a 1M file
memory usage: +2056
============================================================= TEST28
TEST 28: after removal of a 1M file
memory usage: -1016
============================================================= TEST29
TEST 29: after addition then removal of a 1M file
memory usage: 8
```