Skip to content

Conversation

@bangqipropel
Copy link
Collaborator

Reason for Change:

Requirements

  • added unit tests and e2e tests (if applicable).

Issue Fixed:

Notes for Reviewers:

@bangqipropel bangqipropel changed the title feat: code benchmarking support : code benchmarking support Dec 8, 2025
@kaito-pr-agent
Copy link

kaito-pr-agent bot commented Dec 8, 2025

Title

Add RAG-based code issue resolver with benchmark suite


Description

  • Added RAG-based issue resolver for code modifications

  • Implemented TOP-4 relevance filtering for token efficiency

  • Added comprehensive benchmark documentation and guides

  • Integrated test validation with Go/Python unit tests


Changes walkthrough 📝

Relevant files
Enhancement
1 files
rag_solution.py
Implement RAG-based code modification tool with TOP-4 filtering
+1277/-0
Documentation
2 files
GETTING_STARTED.md
Add quick start guide for code benchmark suite                     
+125/-0 
README.md
Add overview documentation for code benchmark suite           
+52/-0   
Additional files
6 files
CODE_BENCHMARK_ARCHITECTURE.md +664/-0 
CODE_BENCHMARK_GUIDE.md +562/-0 
code_benchmark.py +544/-0 
generate_issues.py +529/-0 
make_presentation.py +538/-0 
resolve_issues_baseline.py +1199/-0

Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Dec 8, 2025

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 No relevant tests
    🔒 Security concerns

    Path traversal risk:
    The code writes files based on RAG-provided paths without sanitization (lines 729-734). Malicious path values could lead to arbitrary file write vulnerabilities. Command injection: The test execution uses subprocess with dynamic arguments (lines 761,832) without input sanitization.

    ⚡ Recommended focus areas for review

    Path Validation

    The file modification logic writes to paths without validating if they're within the repository directory, which could lead to unintended file system modifications if RAG returns absolute paths.

    file_path = self.repo_path / file_info['path']
    new_content = file_info['content']
    
    if not file_path.exists():
        print(f"  ⚠️  File not found: {file_path}")
        continue
    Error Handling

    The RAG API error handling only retries on HTTP 500 errors. Other non-200 status codes like 401/403/404 are not handled with retries and immediately fail the request.

    elif response.status_code == 500 and retry < 2:
        print(f"  ⚠️  RAG API HTTP 500 error, retrying in {2 ** retry} seconds... (attempt {retry + 1}/3)")
        time.sleep(2 ** retry)  # 指数退避: 1s, 2s
        continue
    else:
        print(f"  ✗ RAG API request failed: HTTP {response.status_code}")
        print(f"    Response: {response.text}")
        return None
    Test Reliability

    The test runner executes commands directly without sandboxing. Changes applied during tests could potentially persist if the revert mechanism fails, affecting subsequent test runs.

    def run_go_tests(self, packages: List[str]) -> Dict:
        """Run Go tests for specified packages."""
        print(f"  🧪 Running Go tests for packages: {', '.join(packages)}")
    
        all_output = []
        all_passed = True
        tested_packages = []
    
        for pkg in packages:
            print(f"    Testing Go package {pkg}...")
            try:
                result = subprocess.run(
                    ['go', 'test', '-v', pkg],
                    cwd=self.repo_path,
                    capture_output=True,
                    text=True,
                    timeout=300
                )
    
                output = result.stdout + result.stderr
                all_output.append(f"=== Go Package: {pkg} ===\n{output}\n")
    
                if result.returncode != 0:
                    all_passed = False
                    print(f"    ✗ Go tests failed for {pkg}")
                else:
                    print(f"    ✓ Go tests passed for {pkg}")
    
                tested_packages.append(pkg)
    
            except subprocess.TimeoutExpired:
                print(f"    ⚠️  Go test timeout for {pkg}")
                all_output.append(f"=== Go Package: {pkg} ===\nTIMEOUT\n")
                all_passed = False
            except Exception as e:
                print(f"    ⚠️  Go test error for {pkg}: {e}")
                all_output.append(f"=== Go Package: {pkg} ===\nERROR: {e}\n")
                all_passed = False
    
        return {
            "status": "passed" if all_passed else "failed",
            "packages": tested_packages,
            "output": "\n".join(all_output)
        }
    
    def run_python_tests(self, targets: Set[Tuple[str, str]]) -> Dict:
        """Run Python tests for specified targets."""
        print(f"  🐍 Running Python tests...")
    
        all_output = []
        all_passed = True
        tested_files = []
    
        for target, test_type in targets:
            if test_type == 'syntax_only':
                print(f"    Checking Python syntax: {target}...")
                try:
                    result = subprocess.run(
                        ['python3', '-m', 'py_compile', target],
                        cwd=self.repo_path,
                        capture_output=True,
                        text=True,
                        timeout=30
                    )
    
                    if result.returncode != 0:
                        all_passed = False
                        print(f"    ✗ Syntax error in {target}")
                        all_output.append(f"=== Python Syntax: {target} ===\n{result.stderr}\n")
                    else:
                        print(f"    ✓ Python syntax OK: {target}")
                        all_output.append(f"=== Python Syntax: {target} ===\nOK\n")
    
                except Exception as e:
                    print(f"    ⚠️  Syntax check error: {e}")
                    all_output.append(f"=== Python Syntax: {target} ===\nERROR: {e}\n")
                    all_passed = False
    
            elif test_type == 'file':
                print(f"    Testing Python file {target}...")
                try:
                    result = subprocess.run(
                        ['python3', '-m', 'pytest', target, '-v', '--tb=short'],
                        cwd=self.repo_path,
                        capture_output=True,
                        text=True,
                        timeout=300
                    )
    
                    output = result.stdout + result.stderr
                    all_output.append(f"=== Python Test: {target} ===\n{output}\n")
    
                    if result.returncode != 0:
                        all_passed = False
                        print(f"    ✗ Python tests failed for {target}")
                    else:
                        print(f"    ✓ Python tests passed for {target}")
    
                    tested_files.append(target)
    
                except FileNotFoundError:
                    print(f"    ⚠️  pytest not found, checking syntax only...")
                    result = subprocess.run(
                        ['python3', '-m', 'py_compile', target],
                        cwd=self.repo_path,
                        capture_output=True,
                        text=True
                    )
                    if result.returncode != 0:
                        all_passed = False
                        all_output.append(f"=== Python: {target} ===\nSyntax Error\n{result.stderr}\n")
                    else:
                        all_output.append(f"=== Python: {target} ===\nSyntax OK (pytest not available)\n")
    
                except subprocess.TimeoutExpired:
                    print(f"    ⚠️  Python test timeout for {target}")
                    all_output.append(f"=== Python Test: {target} ===\nTIMEOUT\n")
                    all_passed = False
                except Exception as e:
                    print(f"    ⚠️  Python test error: {e}")
                    all_output.append(f"=== Python Test: {target} ===\nERROR: {e}\n")
                    all_passed = False
    
        return {
            "status": "passed" if all_passed else "failed",
            "files": list(tested_files),
            "output": "\n".join(all_output)
        }

    @kaito-pr-agent
    Copy link

    kaito-pr-agent bot commented Dec 8, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Security
    Prevent path traversal vulnerabilities

    Add path traversal protection before writing files. Currently, the code writes files
    based on RAG-provided paths without verifying they're within the repository. This
    could allow malicious path manipulation. Use os.path.relpath() to ensure paths stay
    within the repo directory.

    code_benchmark/rag_solution.py [728-734]

     for file_info in modifications['files']:
    -    file_path = self.repo_path / file_info['path']
    -    new_content = file_info['content']
    -    
    -    if not file_path.exists():
    -        print(f"  ⚠️  File not found: {file_path}")
    +    try:
    +        # Resolve and validate path stays within repo
    +        file_path = (self.repo_path / file_info['path']).resolve()
    +        if not file_path.is_relative_to(self.repo_path.resolve()):
    +            print(f"  ⚠️  Path traversal attempt blocked: {file_info['path']}")
    +            continue
    +    except Exception as e:
    +        print(f"  ⚠️  Invalid path: {file_info['path']}: {e}")
             continue
    Suggestion importance[1-10]: 9

    __

    Why: Adds critical security protection against path traversal attacks by ensuring files are written only within the repository directory.

    High
    Possible issue
    Add input size limits

    Add a maximum character limit to prevent denial-of-service from extremely large
    inputs. The current loop could process unlimited content, consuming excessive
    resources. Set a reasonable limit like 100,000 characters.

    code_benchmark/rag_solution.py [353-391]

     def _extract_string_content(self, text: str, start_pos: int) -> Optional[str]:
         content_chars = []
         i = start_pos
         escape_next = False
    +    MAX_CHARS = 100000  # Prevent DoS from huge inputs
         
    -    while i < len(text):
    -        char = text[i]
    +    while i < len(text) and len(content_chars) < MAX_CHARS:
             ...
    -        i += 1
    +    if len(content_chars) >= MAX_CHARS:
    +        return None  # Or handle truncation
    Suggestion importance[1-10]: 8

    __

    Why: Prevents potential DoS attacks by limiting input size, addressing a critical security vulnerability in string extraction.

    Medium
    Avoid injecting truncation notes into code

    Adding a truncation note directly in code files could cause syntax errors in some
    languages. Instead, add this as a separate comment outside the file content to avoid
    corrupting the actual code.

    code_benchmark/resolve_issues_baseline.py [153-155]

     if self.head_lines and self.head_lines > 0:
    -    # Stream only first N lines
         collected_lines = []
         with open(full_path, 'r', encoding='utf-8') as f:
             for i, line in enumerate(f, 1):
                 collected_lines.append(line)
                 if i >= self.head_lines:
                     break
    -    truncated_note = f"\n/* Truncated to first {self.head_lines} lines for brevity */\n"
    -    contents[file_path] = "".join(collected_lines) + truncated_note
    +    contents[file_path] = "".join(collected_lines)
    +    print(f"  ✓ Loaded (truncated to {self.head_lines} lines): {file_path}")
    Suggestion importance[1-10]: 8

    __

    Why: Prevents potential syntax corruption by removing injected comments in code files. This is critical as the truncation note could break valid code syntax.

    Medium
    Normalize path comparisons

    Normalize paths consistently before comparison. The current approach uses different
    normalization methods (lstrip vs full resolution), which could cause mismatches. Use
    os.path.normpath() for reliable path comparisons.

    code_benchmark/rag_solution.py [494-498]

    -rag_path = rag_path.lstrip('./')
    +rag_path = os.path.normpath(rag_path.lstrip('./'))
     ...
    -if rag_path in real_paths:
    +if os.path.normpath(rag_path) in [os.path.normpath(p) for p in real_paths]:
         return rag_path
    Suggestion importance[1-10]: 7

    __

    Why: Improves path matching reliability by normalizing paths, fixing potential mismatches in file path handling.

    Medium
    Add parser safety limits

    The current loop could run indefinitely if the string never ends with a quote. Add a
    maximum length check to prevent potential infinite loops when parsing malformed
    responses.

    code_benchmark/resolve_issues_baseline.py [446-476]

     def _extract_string_content(self, text: str, start_pos: int) -> Optional[str]:
         content_chars = []
         i = start_pos
         escape_next = False
    +    max_length = 100000  # Prevent infinite loops
         
    -    while i < len(text):
    -        char = text[i]
    +    while i < len(text) and len(content_chars) < max_length:
             ...
    -        i += 1
    +    return None  # If we exceed max length
    Suggestion importance[1-10]: 7

    __

    Why: Adds important safeguard against infinite loops during JSON parsing. While the existing 50k character limit provides some protection, the explicit iteration limit is more robust.

    Medium
    Add robust subprocess error handling

    The subprocess call only catches CalledProcessError but misses other exceptions like
    FileNotFoundError if the script is missing. Add a broader exception handler to
    prevent benchmark crashes from unexpected errors.

    code_benchmark/code_benchmark.py [83-93]

     try:
         result = subprocess.run(
             cmd,
             cwd=str(self.repo_path),
             check=True,
             capture_output=False  # Show output in real-time
         )
         print("\n✅ Baseline completed successfully\n")
         return True
     except subprocess.CalledProcessError as e:
         print(f"\n❌ Baseline failed with exit code {e.returncode}\n")
         return False
    +except Exception as e:
    +    print(f"\n❌ Unexpected error during baseline run: {e}")
    +    return False
    Suggestion importance[1-10]: 7

    __

    Why: Prevents benchmark crashes from unexpected errors, but only adds basic error handling without specific recovery.

    Medium
    Handle file write errors

    Add error handling for file write operations. Currently, file writes could fail
    silently due to permissions or disk space issues. Wrap in try/except blocks and
    provide meaningful error messages.

    code_benchmark/rag_solution.py [962-964]

    -test_file = self.repo_path / file_info['path']
    -...
    -with open(test_file, 'w', encoding='utf-8') as f:
    -    f.write(test_results['output'])
    +try:
    +    with open(test_file, 'w', encoding='utf-8') as f:
    +        f.write(test_results['output'])
    +except OSError as e:
    +    print(f"  ❌ Failed to write test output: {e}")
    Suggestion importance[1-10]: 6

    __

    Why: Adds basic error handling for file operations to prevent silent failures during test output saving.

    Low
    Improve file error handling specificity

    The current error handling catches all exceptions but doesn't differentiate between
    file not found and other errors. This could mask critical issues like permission
    errors. Use more specific exception handling to distinguish between file access
    errors and other issues.

    code_benchmark/resolve_issues_baseline.py [170-171]

     try:
         with open(full_path, 'r', encoding='utf-8') as f:
             contents[file_path] = f.read()
    +except FileNotFoundError:
    +    print(f"  ⚠️  File not found: {file_path}")
    +except PermissionError:
    +    print(f"  ⚠️  Permission denied: {file_path}")
     except Exception as e:
         print(f"  ⚠️  Error reading {file_path}: {e}")
    Suggestion importance[1-10]: 6

    __

    Why: Adds specific exception handling for common file errors which improves debuggability. However, the current FileNotFoundError is already handled earlier at line 141.

    Low
    Handle corrupted JSON reports

    Missing JSON decoding error handling could crash the benchmark if reports are
    corrupted. Add try-except blocks around json.load() calls to handle malformed files
    gracefully.

    code_benchmark/code_benchmark.py [140-142]

     if baseline_report.exists():
    -    with open(baseline_report, 'r', encoding='utf-8') as f:
    -        baseline_data = json.load(f)
    +    try:
    +        with open(baseline_report, 'r', encoding='utf-8') as f:
    +            baseline_data = json.load(f)
    +    except json.JSONDecodeError:
    +        print(f"⚠️  Corrupted baseline report: {baseline_report}")
    +        baseline_data = None
     ...
     if rag_report.exists():
    -    with open(rag_report, 'r', encoding='utf-8') as f:
    -        rag_data = json.load(f)
    +    try:
    +        with open(rag_report, 'r', encoding='utf-8') as f:
    +            rag_data = json.load(f)
    +    except json.JSONDecodeError:
    +        print(f"⚠️  Corrupted RAG report: {rag_report}")
    +        rag_data = None
    Suggestion importance[1-10]: 6

    __

    Why: Adds resilience against malformed reports, though corruption is unlikely in normal operation.

    Low
    General
    Improve number formatting

    Large token values are hard to read in console output. Format token counts with
    commas for thousands separators to improve readability of large numbers.

    code_benchmark/code_benchmark.py [377]

     print(f"📄 Summary report saved: {summary_file}\n")
     ...
    -print(f"📊 Token Ratio: {improvements['token_ratio']:.2f}x")
    +print(f"📊 Token Ratio: {improvements['token_ratio']:,.2f}x")  # Added comma formatting
    Suggestion importance[1-10]: 4

    __

    Why: Token ratios are floats (typically 0.0-2.0) where comma formatting is unnecessary and would reduce readability.

    Low

    Signed-off-by: Bangqi Zhu <[email protected]>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    Status: No status

    Development

    Successfully merging this pull request may close these issues.

    1 participant