A mining tool to collect merge scenarios from Git repositories. In three-way merging, each merge scenario contains the two versions to be merged (called ours and theirs respectively), and their nearest common ancestor in the commit history (called base).
How to tell which version is ours and which is theirs?
See the following example:
$ git branch
develop
* master
$ git merge develop
$ git log
commit 5aa63defd7d552544348deaad88a22d212c43038 (HEAD -> master)
Merge: 011eeae bdd631b
Author: Symbolk <symbolk@163.com>
Date: Sat Jul 6 17:12:25 2019 +0800
Merge branch 'develop'
The current branch
master
is called ours, on which the merge commit will be submitted. The branchdevelop
in thegit-merge
command is called theirs, which will not be affected by the merging.
- Windows /Linux/macOS
- Python 3.7
- Git 2.18.0
- PyCharm
- Open the clone repository as a project with PyCharm;
- Under the root directory of the cloned repository, run the following command in the terminal:
pip install -r requirements.txt
Collect Java files involved in merge scenarios that contain merge conflict(s) from the whole commit history.
A Git repository with the name of the target branch (usually master
).
Edit the main.py
to set the necessary variables, then run it:
if __name__ == "__main__":
repo_name = "cassandra"
# get the default branch
branch_name = "trunk"
# if the repo is not present in the repo_dir, it will be cloned, but better to clone in advance
git_url = "https://github.com/apache/cassandra"
repo_dir = os.path.join(home, "coding/data/repos", repo_name)
result_dir = os.path.join(home, "coding/data/merges", repo_name)
# Usage1: Collect Java files involved in merge scenarios that contain merge conflict(s) from the whole commit history
statistic_path = result_dir + "/statistics.csv"
git_service = GitService(repo_name, git_url, repo_dir, branch_name, result_dir)
git_service.collect_from_repo(statistic_path)
During the mining process, the brief summary of each merge scenario with merge conflict(s) is printed in the Run console of PyCharm:
Cloning into 'D:\github\rep\javaparser'...
POST git-upload-pack (gzip 7425 to 3775 bytes)
remote: Enumerating objects: 47, done.
remote: Counting objects: 100% (47/47), done.
remote: Compressing objects: 100% (21/21), done.
remote: Total 92708 (delta 10), reused 43 (delta 10), pack-reused 92661
Receiving objects: 100% (92708/92708), 21.11 MiB | 830.00 KiB/s, done.
Resolving deltas: 100% (49120/49120), done.
Checking out files: 100% (2028/2028), done.
Ready to process repo: javaparser at branch: master
Commit: e6063bb10d6d41cb2b258540bb47edbd18b4646b, #Unmerged_blobs: 4, #Conflict java files: 2, #Conflict blocks: 2
Commit: 0258e273bfd2dca550a27d3204cf22227a41e772, #Unmerged_blobs: 5, #Conflict java files: 4, #Conflict blocks: 4
Commit: 25c4bbf796034c987e7517d4d3c596026a0142e3, #Unmerged_blobs: 16, #Conflict java files: 2, #Conflict blocks: 3
...
Collected data will be saved in the result_dir
, which contains:
-
Sub-folders named with merge commit ids, each of them contains conflicting Java files in that merge scenario.
-
A csv file that provides a statistical summary of each merge scenario, which consist of 4 commit ids (merge commit, HEAD of ours/theirs branch, base commit), the number of conflicting Java files and their paths, and the number of conflict blocks.
In the column #conflict blocks, the numbers denotes the number of conflict blocks in every conflicting Java file. For example, in the first row, there are 5 conflicting Java files, the first file
javaparser-core-serialization/src/main/java/com/github/javaparser/serialization/JavaParserJsonSerializer.java
has 1 conflict block inside it.
Collect Java files involved in merge scenarios that contain refactoring-related merge conflict(s) from the csv file generated by https://github.com/Symbolk/RefConfMiner.git (python scripts to analyze MySql data generated by https://github.com/ualberta-smr/RefactoringsInMergeCommits).
- A Git repository with the name of the target branch (usually the main branch, like
master
). - The csv file that records refactoring-related merge commit ids, generated from the tool RefactoringsInMergeCommits (https://github.com/ualberta-smr/RefactoringsInMergeCommits).
Edit the main.py
to set the necessary variables, then run it:
if __name__ == "__main__":
repo_name = "cassandra"
# get the default branch
branch_name = "trunk"
# if the repo is not present in the repo_dir, it will be cloned, but better to clone in advance
git_url = "https://github.com/apache/cassandra"
repo_dir = os.path.join(home, "coding/data/repos", repo_name)
result_dir = os.path.join(home, "coding/data/merges", repo_name)
# Usage2: Collect Java files involved in merge scenarios that contain refactoring-related merge conflict(s)
# from the csv file generated by https://github.com/Symbolk/RefConfMiner.git
result_dir = os.path.join(home, "coding/data/ref_conflicts", repo_name)
# csv_file = "merge_scenarios_involved_refactorings_test.csv"
csv_file = os.path.join(home, "coding/data/merge_scenarios_involved_refactorings", repo_name + ".csv")
git_service = GitService(repo_name, repo_dir, branch_name, result_dir)
git_service.collect_from_csv(csv_file)
The output is basically same with that in Usage 1.