SAGA is a large-scale code clone detection tool. The name comes from "Suffix-Array based clone detection with GPU Acceleration". SAGA is able to detect Type-1/2/3 clones in 100 million lines of code within 11 minutes, with comparable precision and recall to other state-of-the-art tools.
- Supported OS: Linux/Windows/Mac. Ensure you use the corresponding executable file for your operating system.
- Environment:
-
- Git
- JDK 1.8 or higher
- NVCC (recommended version: release 12.4, V12.4.99)
- Maven (recommended version: 3.6.3)
We have tested SAGA with JDK 1.8.0_181, NVCC V12.4.99 and Maven 3.6.3 on Ubuntu 20.04.6 LTS.
- Nvidia GPU with at least 4 GB of graphic memory is recommended if using GPU acceleration feature.
If you encounter issues with path configurations, use absolute paths for testing.
git clone [email protected]:FudanSELab/SAGACloneDetector.gitcd /path/to/SAGACloneDetector
nvcc -o executable/sa_gpu scripts/suffix-construct.cu --expt-extended-lambdaThis will generate an executable file sa_gpu in the SAGACloneDetector/scripts directory. Part of the directory structure should appear as follows:
scripts
├── sa_gpu
└── suffix-construct.cu
Note that this core executable utilizes GPU for code clone detection. Nvidia GPU is required.
cd /path/to/SAGACloneDetector
mvn package -Dmaven.test.skip=true
mv target/SAGACloneDetector.jar .cd /path/to/SAGACloneDetector
# This will generate config.properties in the SAGACloneDetector directory.
java -jar SAGACloneDetector.jar testcase/code/java❗config.properties file must be in the Working directory.
# config.properties
process-build=1
sep-num=100000000
min-line=2
mlcc=20
language=java
threshold=0.7
use-long-type=0
granularity=method
extensions=java
process-parse=1
tokenize=1
thread-num=8
open-string-hash=1
mlc=50
exe=
process-tokenize=1Functions of the parameters are as follows:
#process-build: Whether execute build process,0 for off, 1 for on
#sep-num: The separator number of token piece
#min-line: The minimum line number of a method
#mlcc: The minimum token number of a snippet
#language: The type of source files(java, c, cpp, py, js, go, common)
#threshold: The threshold of clone detection(0 ~ 1)
#use-long-type: Use long data type to store middle data,0 for int, 1 for long
#granularity: The detection granularity, including file, method, snippet
#extentions: The comma-separated file suffixes
#process-parse: Whether execute parse process,0 for off, 1 for on
#tokenize: Whether tokenize sources files or not, 0 for off, 1 for on
#thread-num: The thread num of parallel suffix array construction
#open-string-hash: Whether open string hash in tokenize progress,0 for close, 1 for open
#mlc: The minimum token number of a method
#exe: The path of executable file
#process-tokenize: Whether execute tokenize process,0 for off, 1 for onSet exe to the path of the compiled executable from the Compile the Executable step:
# config.properties
...
exe=executable/sa_gpu
...cd /path/to/SAGACloneDetector
java -jar SAGACloneDetector.jar /path/to/DetectedRepoDirectory-
logs/: Stores log files, which are automatically backed up by date.
-
tokenData/: Intermediate files, with no direct value.
-
result/: Directory for detection result files.
-
files.txt: Contains paths of the detected files, filtered based on the configured extensions.
-
MeasureIndex.csv (see details below)
-
Method-level detection result files (see details below):
- type123_method_result.csv
- type123_method_group_result.csv
-
Snippet-level detection result files (see details below):
- type12_snippet_result.csv
- type3_snippet_result.csv
-
-
Stores information about all the detected methods.
-
Data format:
MethodID, File Path, Start Line, End Line.
-
Method-Level Detection Result Files
- type123_method_result.csv: Contains the results of detected clone pairs.
- Data format:
Method1_ID, Method2_ID, Similarity.
-
type123_method_group_result.csv: Contains the results of clone groups, which are merged based on clone pairs.
-
Data format:
Method1_ID, Method2_ID, Method3_ID, ....
-
Snippet-Level Detection Result Files
- type12_snippet_result.csv: Contains Type-1 and Type-2 clone detection results (for the definition of clone types, see the notes below).
- type3_snippet_result.csv: Contains Type-3 clone detection results.
- The data format for both files is identical:
CloneGroupID, MethodID, File Path, Method Start Line, Method End Line, Snippet Start Position in Method Token Sequence, Snippet End Position in Method Token Sequence, Snippet Start Line, Snippet End Line.
- Snippet-level detection results are presented as clone pairs. Due to the lack of clear boundaries in snippets, it is not recommended to combine them into clone groups. If needed, you can manually perform the combination.
- Type-1: Identical code fragments, differing only in whitespace, layout, and comments.
- Type-2: Syntactically identical fragments, differing in identifiers, literals, types, whitespace, layout, and comments.
- Type-3: Copied fragments with additional modifications, such as added, changed, or removed statements, along with differences in identifiers, literals, types, whitespace, layout, and comments.
git clone [email protected]:jeffsvajlenko/BigCloneEval.gitRefer to BigCloneEval documentation for steps such as initializing the database and registering tools.
PS: Before Step 4 of BigCloneEval, Modify the code in file src/cloneMatchingAlgorithms/CoverageMatcher.java around line 100.
The original code is
if(tolerence != null) {
stmt.setInt(12, f1.getStartline() - tolerence);
stmt.setInt(13, f1.getEndline() + tolerence);
stmt.setInt(14, f2.getStartline() - tolerence);
stmt.setInt(15, f2.getEndline() + tolerence);
} else if(dtolerence != null) {
stmt.setInt(12, f1.getEndline());
stmt.setInt(13, f1.getStartline());
stmt.setInt(14, f2.getEndline());
stmt.setInt(15, f2.getStartline());
}It should be
if(tolerence != null) {
stmt.setInt(11, f1.getStartline() - tolerence);
stmt.setInt(12, f1.getEndline() + tolerence);
stmt.setInt(13, f2.getStartline() - tolerence);
stmt.setInt(14, f2.getEndline() + tolerence);
} else if(dtolerence != null) {
stmt.setInt(11, f1.getEndline());
stmt.setInt(12, f1.getStartline());
stmt.setInt(13, f2.getEndline());
stmt.setInt(14, f2.getStartline());
}cp /path/to/SAGACloneDetector/scripts/import /path/to/BigCloneEval/commands# scan_dir: directory of repos to be scanned
# base_dir: where SAGACloneDetector.jar locates
cd /path/to/SAGACloneDetector/scripts
python detect_merge.py --scan_dir=/path/to/BigCloneEval/ijadataset/bcb_reduced --base_dir=/path/to/SAGACloneDetectorcd /path/to/BigCloneEval/commands
# Ensure directory Result_2, Result_3 ... exist in /path/to/SAGACloneDetector/verify_result.
./import <YourToolID> /path/to/SAGACloneDetector/verify_result
Refer to the original BigCloneEval repository for parameter details.
./evaluateTool -t <YourToolID> -o <ReportPath> --st BOTH -m "CoverageMatcher 0.7 line 4" --mit 50 --mip 6less <ReportPath>This repository is maintained by CodeWisdom Team of Fudan University.
You may report bugs by submitting an issue to the GitHub repository or sending an email to ([email protected]).



