MergeScenarioMiner

A mining tool to collect merge scenarios from Git repositories. In three-way merging, each merge scenario contains the two versions to be merged (called ours and theirs respectively), and their nearest common ancestor in the commit history (called base).

How to tell which version is ours and which is theirs?

See the following example:

$ git branch
      develop
    * master
$ git merge develop
$ git log
commit 5aa63defd7d552544348deaad88a22d212c43038 (HEAD -> master)
Merge: 011eeae bdd631b
Author: Symbolk <symbolk@163.com>
Date:   Sat Jul 6 17:12:25 2019 +0800

    Merge branch 'develop'

The current branch master is called ours, on which the merge commit will be submitted. The branch develop in the git-merge command is called theirs, which will not be affected by the merging.

Getting Started

Requirements

Windows /Linux/macOS
Python 3.7
Git 2.18.0
PyCharm

Installation

Open the clone repository as a project with PyCharm;
Under the root directory of the cloned repository, run the following command in the terminal:

pip install -r requirements.txt

Usage

Usage 1: Collect all merge scenarios with merge conflict(s)

Collect Java files involved in merge scenarios that contain merge conflict(s) from the whole commit history.

Input

A Git repository with the name of the target branch (usually master).

Edit the main.py to set the necessary variables, then run it:

if __name__ == "__main__":
    repo_name = "cassandra"
    # get the default branch
    branch_name = "trunk"
    # if the repo is not present in the repo_dir, it will be cloned, but better to clone in advance
    git_url = "https://github.com/apache/cassandra"
    repo_dir = os.path.join(home, "coding/data/repos", repo_name)
    result_dir = os.path.join(home, "coding/data/merges", repo_name)

    # Usage1: Collect Java files involved in merge scenarios that contain merge conflict(s) from the whole commit history
    statistic_path = result_dir + "/statistics.csv"
    git_service = GitService(repo_name, git_url, repo_dir, branch_name, result_dir)
    git_service.collect_from_repo(statistic_path)

Output

During the mining process, the brief summary of each merge scenario with merge conflict(s) is printed in the Run console of PyCharm:

Cloning into 'D:\github\rep\javaparser'...
POST git-upload-pack (gzip 7425 to 3775 bytes)
remote: Enumerating objects: 47, done.        
remote: Counting objects: 100% (47/47), done.        
remote: Compressing objects: 100% (21/21), done.        
remote: Total 92708 (delta 10), reused 43 (delta 10), pack-reused 92661        
Receiving objects: 100% (92708/92708), 21.11 MiB | 830.00 KiB/s, done.
Resolving deltas: 100% (49120/49120), done.
Checking out files: 100% (2028/2028), done.
Ready to process repo: javaparser at branch: master
Commit: e6063bb10d6d41cb2b258540bb47edbd18b4646b, #Unmerged_blobs: 4, #Conflict java files: 2, #Conflict blocks: 2
Commit: 0258e273bfd2dca550a27d3204cf22227a41e772, #Unmerged_blobs: 5, #Conflict java files: 4, #Conflict blocks: 4
Commit: 25c4bbf796034c987e7517d4d3c596026a0142e3, #Unmerged_blobs: 16, #Conflict java files: 2, #Conflict blocks: 3
...

Collected data will be saved in the result_dir, which contains:

Sub-folders named with merge commit ids, each of them contains conflicting Java files in that merge scenario.
A csv file that provides a statistical summary of each merge scenario, which consist of 4 commit ids (merge commit, HEAD of ours/theirs branch, base commit), the number of conflicting Java files and their paths, and the number of conflict blocks.

In the column #conflict blocks, the numbers denotes the number of conflict blocks in every conflicting Java file. For example, in the first row, there are 5 conflicting Java files, the first file javaparser-core-serialization/src/main/java/com/github/javaparser/serialization/JavaParserJsonSerializer.java has 1 conflict block inside it.

Usage 2: Collect only merge scenarios with refactoring-related conflict(s)

Collect Java files involved in merge scenarios that contain refactoring-related merge conflict(s) from the csv file generated by https://github.com/Symbolk/RefConfMiner.git (python scripts to analyze MySql data generated by https://github.com/ualberta-smr/RefactoringsInMergeCommits).

Input

A Git repository with the name of the target branch (usually the main branch, like master).
The csv file that records refactoring-related merge commit ids, generated from the tool RefactoringsInMergeCommits (https://github.com/ualberta-smr/RefactoringsInMergeCommits).

Edit the main.py to set the necessary variables, then run it:

if __name__ == "__main__":
    repo_name = "cassandra"
    # get the default branch
    branch_name = "trunk"
    # if the repo is not present in the repo_dir, it will be cloned, but better to clone in advance
    git_url = "https://github.com/apache/cassandra"
    repo_dir = os.path.join(home, "coding/data/repos", repo_name)
    result_dir = os.path.join(home, "coding/data/merges", repo_name)

    # Usage2: Collect Java files involved in merge scenarios that contain refactoring-related merge conflict(s) 
    # from the csv file generated by https://github.com/Symbolk/RefConfMiner.git
    result_dir = os.path.join(home, "coding/data/ref_conflicts", repo_name)
    # csv_file = "merge_scenarios_involved_refactorings_test.csv"
    csv_file = os.path.join(home, "coding/data/merge_scenarios_involved_refactorings", repo_name + ".csv")
    git_service = GitService(repo_name, repo_dir, branch_name, result_dir)
    git_service.collect_from_csv(csv_file)

Output

The output is basically same with that in Usage 1.

MergeScenarioMiner

Commits

Merge pull request #5 from Symbolk/dependabot/pip/gitpython-3.1.37

build(deps): bump gitpython from 3.1.35 to 3.1.37

Merge pull request #4 from Symbolk/dependabot/pip/gitpython-3.1.35

build(deps): bump gitpython from 3.1.34 to 3.1.35

Merge pull request #3 from Symbolk/dependabot/pip/gitpython-3.1.34

build(deps): bump gitpython from 3.1.32 to 3.1.34

README

MergeScenarioMiner

A mining tool to collect merge scenarios from Git repositories. In three-way merging, each merge scenario contains the two versions to be merged (called ours and theirs respectively), and their nearest common ancestor in the commit history (called base).

Getting Started

Requirements

Installation

Usage

Usage 1: Collect all merge scenarios with merge conflict(s)

Input

Output

Usage 2: Collect only merge scenarios with refactoring-related conflict(s)

Input

Output