SAGA

SAGA is a large-scale code clone detection tool. The name comes from "Suffix-Array based clone detection with GPU Acceleration". SAGA is able to detect Type-1/2/3 clones in 100 million lines of code within 11 minutes, with comparable precision and recall to other state-of-the-art tools.

How to use SAGA

Requirements

Software

Supported OS: Linux/Windows/Mac. Ensure you use the corresponding executable file for your operating system.
Environment:
- Git
- JDK 1.8 or higher
- NVCC (recommended version: release 12.4, V12.4.99)
- Maven (recommended version: 3.6.3)

We have tested SAGA with JDK 1.8.0_181, NVCC V12.4.99 and Maven 3.6.3 on Ubuntu 20.04.6 LTS.

Hardware

Nvidia GPU with at least 4 GB of graphic memory is recommended if using GPU acceleration feature.

Running SAGA

If you encounter issues with path configurations, use absolute paths for testing.

Get the source code of SAGA

git clone [email protected]:FudanSELab/SAGACloneDetector.git

Compile the detection core

cd /path/to/SAGACloneDetector
nvcc -o executable/sa_gpu scripts/suffix-construct.cu --expt-extended-lambda

This will generate an executable file sa_gpu in the SAGACloneDetector/scripts directory. Part of the directory structure should appear as follows:

scripts
├── sa_gpu
└── suffix-construct.cu

Note that this core executable utilizes GPU for code clone detection. Nvidia GPU is required.

Package `SAGACloneDetector.jar`

cd /path/to/SAGACloneDetector
mvn package -Dmaven.test.skip=true
mv target/SAGACloneDetector.jar .

Generate config.properties

cd /path/to/SAGACloneDetector
# This will generate config.properties in the SAGACloneDetector directory.
java -jar SAGACloneDetector.jar testcase/code/java

❗config.properties file must be in the Working directory.

# config.properties
process-build=1
sep-num=100000000
min-line=2
mlcc=20
language=java
threshold=0.7
use-long-type=0
granularity=method
extensions=java
process-parse=1
tokenize=1
thread-num=8
open-string-hash=1
mlc=50
exe=
process-tokenize=1

Functions of the parameters are as follows:

#process-build: Whether execute build process,0 for off, 1 for on
#sep-num: The separator number of token piece
#min-line: The minimum line number of a method
#mlcc: The minimum token number of a snippet
#language: The type of source files(java, c, cpp, py, js, go, common)
#threshold: The threshold of clone detection(0 ~ 1)
#use-long-type: Use long data type to store middle data,0 for int, 1 for long
#granularity: The detection granularity, including file, method, snippet
#extentions: The comma-separated file suffixes
#process-parse: Whether execute parse process,0 for off, 1 for on
#tokenize: Whether tokenize sources files or not, 0 for off, 1 for on
#thread-num: The thread num of parallel suffix array construction
#open-string-hash: Whether open string hash in tokenize progress,0 for close, 1 for open
#mlc: The minimum token number of a method
#exe: The path of executable file
#process-tokenize: Whether execute tokenize process,0 for off, 1 for on

Update the `exe` Parameter in `config.properties`

Set exe to the path of the compiled executable from the Compile the Executable step:

# config.properties
...
exe=executable/sa_gpu
...

Run SAGA

cd /path/to/SAGACloneDetector
java -jar SAGACloneDetector.jar /path/to/DetectedRepoDirectory

Check the outputs (description based on the default paths).

logs/: Stores log files, which are automatically backed up by date.
tokenData/: Intermediate files, with no direct value.
result/: Directory for detection result files.
- files.txt: Contains paths of the detected files, filtered based on the configured extensions.
- MeasureIndex.csv (see details below)
- Method-level detection result files (see details below):
  - type123_method_result.csv
  - type123_method_group_result.csv
- Snippet-level detection result files (see details below):
  - type12_snippet_result.csv
  - type3_snippet_result.csv

MeasureIndex.csv

Stores information about all the detected methods.
Data format: MethodID, File Path, Start Line, End Line.

Method-Level Detection Result Files
- type123_method_result.csv: Contains the results of detected clone pairs.
- Data format: Method1_ID, Method2_ID, Similarity.
- type123_method_group_result.csv: Contains the results of clone groups, which are merged based on clone pairs.
- Data format: Method1_ID, Method2_ID, Method3_ID, ....
Snippet-Level Detection Result Files
- type12_snippet_result.csv: Contains Type-1 and Type-2 clone detection results (for the definition of clone types, see the notes below).
- type3_snippet_result.csv: Contains Type-3 clone detection results.
- The data format for both files is identical:
  CloneGroupID, MethodID, File Path, Method Start Line, Method End Line, Snippet Start Position in Method Token Sequence, Snippet End Position in Method Token Sequence, Snippet Start Line, Snippet End Line.
- Snippet-level detection results are presented as clone pairs. Due to the lack of clear boundaries in snippets, it is not recommended to combine them into clone groups. If needed, you can manually perform the combination.

Clone Type Definitions

Type-1: Identical code fragments, differing only in whitespace, layout, and comments.
Type-2: Syntactically identical fragments, differing in identifiers, literals, types, whitespace, layout, and comments.
Type-3: Copied fragments with additional modifications, such as added, changed, or removed statements, along with differences in identifiers, literals, types, whitespace, layout, and comments.

Reproducing the results presented in the paper

Clone BigCloneEval

git clone [email protected]:jeffsvajlenko/BigCloneEval.git

Refer to BigCloneEval documentation for steps such as initializing the database and registering tools.

PS: Before Step 4 of BigCloneEval, Modify the code in file src/cloneMatchingAlgorithms/CoverageMatcher.java around line 100.

The original code is

		if(tolerence != null) {
			stmt.setInt(12, f1.getStartline() - tolerence);
			stmt.setInt(13, f1.getEndline() + tolerence);
			stmt.setInt(14, f2.getStartline() - tolerence);
			stmt.setInt(15, f2.getEndline() + tolerence);
		} else if(dtolerence != null) {
			stmt.setInt(12, f1.getEndline());
			stmt.setInt(13, f1.getStartline());
			stmt.setInt(14, f2.getEndline());
			stmt.setInt(15, f2.getStartline());
		}

It should be

		if(tolerence != null) {
			stmt.setInt(11, f1.getStartline() - tolerence);
			stmt.setInt(12, f1.getEndline() + tolerence);
			stmt.setInt(13, f2.getStartline() - tolerence);
			stmt.setInt(14, f2.getEndline() + tolerence);
		} else if(dtolerence != null) {
			stmt.setInt(11, f1.getEndline());
			stmt.setInt(12, f1.getStartline());
			stmt.setInt(13, f2.getEndline());
			stmt.setInt(14, f2.getStartline());
		}

Copy Scripts to BigCloneEval

cp /path/to/SAGACloneDetector/scripts/import /path/to/BigCloneEval/commands

Detect Clone Data

# scan_dir: directory of repos to be scanned
# base_dir: where SAGACloneDetector.jar locates
cd /path/to/SAGACloneDetector/scripts
python detect_merge.py --scan_dir=/path/to/BigCloneEval/ijadataset/bcb_reduced --base_dir=/path/to/SAGACloneDetector

Import Clone Data

cd /path/to/BigCloneEval/commands
# Ensure directory Result_2, Result_3 ... exist in /path/to/SAGACloneDetector/verify_result.
./import <YourToolID> /path/to/SAGACloneDetector/verify_result

Export Report

Refer to the original BigCloneEval repository for parameter details.

./evaluateTool -t <YourToolID> -o <ReportPath> --st BOTH -m "CoverageMatcher 0.7 line 4" --mit 50 --mip 6

Verify data

less <ReportPath>

About

This repository is maintained by CodeWisdom Team of Fudan University.

You may report bugs by submitting an issue to the GitHub repository or sending an email to ([email protected]).

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
assets		assets
doc		doc
executable		executable
scripts		scripts
src		src
testcase		testcase
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
CloneDetector.iml		CloneDetector.iml
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pom.xml		pom.xml
remove_output.bat		remove_output.bat
remove_output.sh		remove_output.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SAGA

How to use SAGA

Requirements

Software

Hardware

Running SAGA

Get the source code of SAGA

Compile the detection core

Package `SAGACloneDetector.jar`

Generate config.properties

Update the `exe` Parameter in `config.properties`

Run SAGA

Check the outputs (description based on the default paths).

MeasureIndex.csv

Clone Type Definitions

Reproducing the results presented in the paper

Clone BigCloneEval

Copy Scripts to BigCloneEval

Detect Clone Data

Import Clone Data

Export Report

Verify data

About

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

License

FudanSELab/SAGACloneDetector

Folders and files

Latest commit

History

Repository files navigation

SAGA

How to use SAGA

Requirements

Software

Hardware

Running SAGA

Get the source code of SAGA

Compile the detection core

Package SAGACloneDetector.jar

Generate config.properties

Update the exe Parameter in config.properties

Run SAGA

Check the outputs (description based on the default paths).

MeasureIndex.csv

Clone Type Definitions

Reproducing the results presented in the paper

Clone BigCloneEval

Copy Scripts to BigCloneEval

Detect Clone Data

Import Clone Data

Export Report

Verify data

About

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Package `SAGACloneDetector.jar`

Update the `exe` Parameter in `config.properties`

Packages