diff --git a/doc/development_explanations.md b/doc/development_explanations.md index a5b5c79..812da4d 100644 --- a/doc/development_explanations.md +++ b/doc/development_explanations.md @@ -2,87 +2,96 @@ How to minimize the memory and data flow consumption of Git cloning? # Background - Jean-Cloud is a small association providing hosting services on second-hand hardware. It is currently launching the Shlagernetes project, a software that enables services to be distributed and managed across several second-hand servers. Git is used in certain cases to install a service on a server or update it. +Jean-Cloud is a small association providing hosting services on second-hand hardware. It is currently launching the Shlagernetes project, a software that enables services to be distributed and managed across several second-hand servers. Git is used in certain cases to install a service on a server or update it. # Objective - The objective is to obtain the latest version (or a specific version) of a git repository, using as few resources as possible. By resources, we mean the data flow from the remote to the local folder, as well as the memory space occupied by the repository on the local server. +The objective is to obtain the latest version (or a specific version) of a git repository, using as few resources as possible. By resources, we mean the data flow from the remote to the local folder, as well as the memory space occupied by the repository on the local server. - The created Git repository will not send any data to the remote. It has access to tags but not history. It can keep some local untracked files in addition to its Git clones. It includes submodules if present. It can either download the last main commit (default) or a commit from a certain reference, i.e. branch or tag. +The created Git repository will not send any data to the remote. It has access to tags but not history. It can keep some local untracked files in addition to its Git clones. It includes submodules if present. It can either download the last main commit (default) or a commit from a certain reference, i.e. branch or tag. # Procedure - Tests on various commands were carried out on a dummy repository. The test file is transportable and can be downloaded here. Note that to run locally, you need to authorize the protocol for local files: git config --global protocol.file.allow always. This is not the default configuration, as it may represent a security vulnerability. +Tests on various commands were carried out on a dummy repository. The test file is transportable and can be downloaded here. Note that to run locally, you need to authorize the protocol for local files: `git config --global protocol.file.allow always`. This is not the default configuration, as it may represent a security vulnerability. - The tests consist in analyzing the memory space taken up by the local repository using the bash command "du", as well as analyzing the text produced by Git during cloning. +The tests consist in analyzing the memory space taken up by the local repository using the bash command `du`, as well as analyzing the text produced by Git during cloning. # Final results -The final chosen combination is : -## To clone : -git clone --depth=1 --recurse-submodules --remote-submodules - depth=1 allows you to clone only the last commit along with the necessary objects. By default, it is single-branch. - recurse-submodules ensures that the contents of submodules are cloned - remote-submodules ensures submodule content is cloned from the original remote submodule - shallow-submodules ensures that only the latest submodule commit is imported (for this to work locally, specify ://file/ before the submodule path) +The final chosen combination is : + +## To clone : + +`git clone --depth=1 --recurse-submodules --remote-submodules` + +- depth=1 allows you to clone only the last commit along with the necessary objects. By default, it is single-branch. +- recurse-submodules ensures that the contents of submodules are cloned +- remote-submodules ensures submodule content is cloned from the original remote submodule +- shallow-submodules ensures that only the latest submodule commit is imported (for this to work locally, specify ://file/ before the submodule path) ## To update : + +``` git fetch --tags --depth=1 --prune --prune-tags origin $ref git reset --hard --recurse-submodules FETCH_HEAD git submodule update --init --recursive --force --depth=1 --remote git reflog expire --expire=now --all git gc --aggressive --prune=now [git clean -qfdx] +``` +- git fetch --tags --depth=1 --prune --prune-tags origin - git fetch --tags --depth=1 --prune --prune-tags origin +tags is used to fetch tags, and must be specified even if a tag is fetched by reference +depth=1 allows only the last commit to be considered +prune deletes references that are no longer accessible from the local remote folder +prune-tags not only deletes references in the local remote repository that are no longer accessible, but also deletes local tags that do not exist on the remote. - tags is used to fetch tags, and must be specified even if a tag is fetched by reference - depth=1 allows only the last commit to be considered - prune deletes references that are no longer accessible from the local remote folder - prune-tags not only deletes references in the local remote repository that are no longer accessible, but also deletes local tags that do not exist on the remote. - - git reset --hard --recurse-submodules origin/main +- `git reset --hard --recurse-submodules origin/main` - git submodule update --init --recursive --force --depth=1 --remote +- `git submodule update --init --recursive --force --depth=1 --remote` - init updates the .gitmodules file - recursive applies the command to submodules of submodules etc. - force ignores local changes to submodules and automatically checks out the new version - depth=1 allows you to consider only the last submodule commit - remote updates from the original remote submodule - CAREFUL: order does matter here. Using this instruction first would make it ineffective because of the --recurse-submodules of the git reset. This option is yet kept to deal with the case of deletion of a submodule. +init updates the .gitmodules file +recursive applies the command to submodules of submodules etc. +force ignores local changes to submodules and automatically checks out the new version +depth=1 allows you to consider only the last submodule commit +remote updates from the original remote submodule +CAREFUL: order does matter here. Using this instruction first would make it ineffective because of the `--recurse-submodules` of the `git reset`. This option is yet kept to deal with the case of deletion of a submodule. - git reflog expire --expire=now --all +- `git reflog expire --expire=now --all` - this command marks all isolated reflogs as expired immediately instead of 90 days later. This makes for a bigger git gc clean up. git rev-list allows you to check which objects are linked and will not be marked as expired. +this command marks all isolated reflogs as expired immediately instead of 90 days later. This makes for a bigger git gc clean up. git rev-list allows you to check which objects are linked and will not be marked as expired. - git gc --aggressive --prune=now +- `git gc --aggressive --prune=now` - this command removes unrelated references and reorganizes the repository to optimize it. - aggressive invokes repack and takes longer. repack undoes and redoes packs, which are compression units. - - [git clean -qfdx] if this command is omitted, files created without committing are retained. +this command removes unrelated references and reorganizes the repository to optimize it. +aggressive invokes repack and takes longer. repack undoes and redoes packs, which are compression units. -This combination does not save any changes made to our repository, apart from the creation of non-committed files if git clean is omitted. +- `[git clean -qfdx]` +if this command is omitted, files created without committing are retained. + +This combination does not save any changes made to our repository, apart from the creation of non-committed files if git clean is omitted, which is the case in git_update.sh. # Details + Here is a summary of the different solutions we have explored to reduce the footprint of our Git repository. + ## Partial vs. shallow cloning + Shallow cloning means not cloning the entire repository history. A partial clone means not cloning all the files and/or folders in the repository, according to a filter. Filters may concern Binary Large Objects (blobs) or trees. If the filter concerns age, then a partial clone can also be a shallow clone. -Partial clones can be created using the git clone --filter command. -During check-out or switch operations, objects initially ignored by the --filter clone can be imported. In our case, we only want to keep one precise commit, which will in any case be let through by git clone --filter which is therefore irrelevant. -Partial clones can also be created by sparse-checking. Some files and/or folders then do not appear at all in the local folder and are not affected by git porcelain (surface) operations. Nevertheless, the objects associated with these files and folders are still stored in the .git repository. +Partial clones can be created using the `git clone --filter command`. +During check-out or switch operations, objects initially ignored by the `--filter` can be imported. In our case, we only want to keep one precise commit, which will in any case be let through by `git clone --filter` which is therefore irrelevant. +Partial clones can also be created by `sparse-checking`. Some files and/or folders then do not appear at all in the local folder and are not affected by git porcelain (surface) operations. Nevertheless, the objects associated with these files and folders are still stored in the .git repository. -A surface clone can be created using the depth= option, which specifies the number of commits to be kept. This option is available for both the clone and fetch commands. +A surface clone can be created using the `depth=` option, which specifies the number of commits to be kept. This option is available for both the clone and fetch commands. ## Large file storage LFS is a Git extension that lets you manipulate selected files (by name, expression or size) using a local cache. In practice, files are replaced by references in the Git repository and a local folder outside the repository is created to store the files. They are downloaded lazily, i.e. only when checked out. All older versions are stored on an online server. -This is a very interesting mechanism, which we will not use for the same reason as the --filter clone: we only want to keep one specific version of the files, which would in any case be downloaded by LFS. +This is a very interesting mechanism, which we will not use for the same reason as the `--filter` clone: we only want to keep one specific version of the files, which would in any case be downloaded by LFS. ## Delete history The git filter-branch command is not recommended by the Git documentation. It has several security and performance flaws. It can be used to rewrite branch history using filters. @@ -93,21 +102,22 @@ We want to delete the entire history without filtering, so the git command fetch ## checkout ? merge ? reset ? Once we have fetched the changes to our local remote/ folder, what is the best way to apply them to our index and working directory? -Let us compare 4 possibilities: git merge -X, git merge -s, git reset --hard, git checkout -f -B. The final results are identical, except for git merge -X. +Let us compare 4 possibilities: `git merge -X`, `git merge -s`, `git reset --hard`, `git checkout -f -B`. The final results are identical, except for `git merge -X`. In the case of git merge, we do not wish to resolve conflicts manually. Remote must always take precedence over local differences. -### git merge -X theirs +### `git merge -X theirs` This command applies an ort strategy which, in the event of a conflict, gives precedence to theirs. -However, since we are working in --depth=1, the two branches have no common ancestor, and the --allow-unrelated-histories option must be supplied. The absence of a common ancestor prevents Git from recognizing similarities within the same file. Any modification to a tracked file on ours, even on a new line, will thus cause a conflict and be overwritten. This command does, however, save newly created and committed files on ours. -Newly created uncommitted files are kept unless git clean is run. +However, since we are working in `--depth=1`, the two branches have no common ancestor, and the `--allow-unrelated-histories` option must be supplied. The absence of a common ancestor prevents Git from recognizing similarities within the same file. Any modification to a tracked file on ours, even on a new line, will thus cause a conflict and be overwritten. This command does, however, save newly created and committed files on ours. +Newly created uncommitted files are kept unless `git clean` is run. Advantage: committed files created on ours are saved. Disadvantage: in the event of deletion of a file on theirs that already existed on ours: it will not be deleted on ours. -### git merge -s ours -[caution: the notions of theirs and ours are reversed here, as git merge -s theirs does not exist]. -This command applies a ours strategy that gives prevalence to ours, whether there is a conflict or not. It will ignore all changes and file creations committed to theirs. It will also ignore uncommitted modifications. Uncommitted file creations are retained unless git clean is run. This is the same result as with the git reset --hard command. -As the git merge -s theirs option does not exist, we need to do a little manipulation: +### `git merge -s ours` +[caution: the notions of theirs and ours are reversed here, as `git merge -s theirs` does not exist]. +This command applies a ours strategy that gives prevalence to ours, whether there is a conflict or not. It will ignore all changes and file creations committed to theirs. It will also ignore uncommitted modifications. Uncommitted file creations are retained unless git clean is run. This is the same result as with the `git reset --hard` command. +As the `git merge -s theirs` option does not exist, we need to do a little manipulation: +``` #we want to merge origin/main on main, giving prevalence to origin/main #create a new temp temporary branch that we check out, sourced on origin/main git switch -c temp origin/main @@ -119,30 +129,36 @@ git checkout main git merge --allow-unrelated-histories temp #delete temp git branch -D temp +``` + Advantage: Disadvantage: creation of a temporary branch. -### git checkout -force -B main origin/main -This command is equivalent to git merge -s ours and git reset --hard, with the difference that you end up in detached HEAD state, which does nos cause any problem in our case since we do not want to push any changes from our repository. +### `git checkout -force -B main origin/main` +This command is equivalent to `git merge -s ours` and `git reset --hard`, with the difference that you end up in detached HEAD state, which does nos cause any problem in our case since we do not want to push any changes from our repository. Advantage : Disadvantage: detached HEAD state. -### git reset --hard -git reset --hard is equivalent to git merge -s ours and git checkout --force -B. +### `git reset --hard` +`git reset --hard` is equivalent to `git merge -s ours` and `git checkout --force -B`. Advantage: Disadvantage: -Tests show that the most memory-efficient options are git checkout -force -B, git merge -s ours and git --reset hard, which all do the same thing. However, git reset --hard does not involve the creation of a temporary branch and does not end in detached HEAD state, hence it is the one we choose. +Tests show that the most memory-efficient options are `git checkout -force -B`, `git merge -s ours` and `git --reset hard`, which all do the same thing. However, `git reset --hard` does not involve the creation of a temporary branch and does not end in detached HEAD state, hence it is the one we choose. ### Submodule management -Submodules are initially cloned using git clone --recurse-submodules --remote-submodules. -They are updated using git submodule update --init --recursive --force --depth=1 –remote. -Git reset --hard must be supplied with the --recurse-submodules option in order to delete submodules from the working directory. -The same rules apply to submodules as to the rest of the repository. In the .gitmodules file, it is possible to specify rules for importing submodules, such as a certain tag or branch. By removing --remote-submodules from git clone and --remote from git submodule update, submodules will be identical to the repository being cloned and no longer to the original submodule repository. +Submodules are initially cloned using `git clone --recurse-submodules --remote-submodules`. +They are updated using `git submodule update --init --recursive --force --depth=1 –remote`. +`git reset --hard` must be supplied with the `--recurse-submodules` option in order to delete submodules from the working directory. +The same rules apply to submodules as to the rest of the repository. In the .gitmodules file, it is possible to specify rules for importing submodules, such as a certain tag or branch. By removing `--remote-submodules` from `git clone` and `--remote` from `git submodule update`, submodules will be identical to the repository being cloned and no longer to the original submodule repository. + +## Tests -##Tests ### Script description + ### README extract + +``` The script consists of twenty-nine tests (listed in the results below), based on three functions: generate_random_file, get_storage_used and get_bandwidth. generate_random_file uses the bash command dd and /dev/random. @@ -151,7 +167,11 @@ get_bandwidth retrieves the output of Git commands and extracts the traffic disp The first five tests concern cloning. The following tests involve updating the repository using different commands, with three cases for each command: after adding a file, after deleting a file, after adding then deleting a file. +``` + ### Help extract + +``` NAME performance_tests.sh SYNOPSIS @@ -161,6 +181,7 @@ OPTIONS -n number executes test number. - c cleans. -h prints the help. +``` ### Results