Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
664 changes: 664 additions & 0 deletions code_benchmark/CODE_BENCHMARK_ARCHITECTURE.md

Large diffs are not rendered by default.

565 changes: 565 additions & 0 deletions code_benchmark/CODE_BENCHMARK_GUIDE.md

Large diffs are not rendered by default.

125 changes: 125 additions & 0 deletions code_benchmark/GETTING_STARTED.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Getting Started with Code Benchmark

Quick start guide for running your first benchmark.

## Prerequisites

```bash
# Install dependencies
pip install openai anthropic requests

# Set API key
export OPENAI_API_KEY="your-api-key-here"

# Start RAG service (for RAG solution)
# cd presets/ragengine && python main.py
```

## 5-Minute Quickstart

### 1. Generate Test Issues (2 minutes)

```bash
python generate_issues.py \
--repo /path/to/your/repo \
--index kaito_code_benchmark \
--count 5 \
--output test_issues.txt
```

### 2. Run Baseline (10-15 minutes)

```bash
python resolve_issues_baseline.py \
--repo /path/to/your/repo \
--issues test_issues.txt \
--output baseline_results \
--api-key $OPENAI_API_KEY
```

### 3. Run RAG Solution (8-12 minutes)

```bash
# Ensure RAG service is running on http://localhost:5000
python rag_solution.py \
--issues test_issues.txt \
--index your_repo_index \
--output rag_results
```

### 4. Compare Results (instant)

```bash
python code_benchmark.py \
--baseline baseline_results/baseline_summary_report.json \
--rag rag_results/rag_summary_report.json \
--output comparison.json

# View results
cat comparison.json | python -m json.tool
```

## What You'll See

**Issue Generation**:
```
📁 Scanning repository structure...
Found 324 Go files
🎯 Identified 15 components
✅ Generated 5 issues
```

**Baseline Execution**:
```
📝 Issue #1: Add error handling...
🤖 Calling LLM...
✓ Modified: workspace_validation.go
🧪 Tests passed

Success Rate: 40% (2/5)
```

**RAG Execution**:
```
📝 Issue #1: Add error handling...
📊 RAG returned 16 source nodes
✓ TOP1: 0.5205 | workspace_validation.go
✓ TOP2: 0.5193 | workspace_validation_test.go
✓ TOP3: 0.5192 | workspace_types.go
✓ TOP4: 0.5177 | workspace_controller.go
✗ 12 files filtered out
🧪 Tests passed

Success Rate: 60% (3/5)
```

## Next Steps

- 📚 Read [CODE_BENCHMARK_GUIDE.md](CODE_BENCHMARK_GUIDE.md) for detailed usage
- 🏗️ Read [CODE_BENCHMARK_ARCHITECTURE.md](CODE_BENCHMARK_ARCHITECTURE.md) for technical details
- 📊 Read [CODE_BENCHMARK_PRESENTATION.md](CODE_BENCHMARK_PRESENTATION.md) for overview slides

## Troubleshooting

**"RAG service connection refused"**:
```bash
curl http://localhost:5000/health
# Start RAG service if needed
```

**"No files modified"**:
- Check if RAG index is loaded
- Review relevance scores in logs
- Verify source_nodes in RAG response

**"Tests failing"**:
- Check if copyright headers preserved
- Verify package declarations intact
- Review system prompt configuration

## Support

For issues or questions:
- 📧 Contact: [email protected]
- 📂 Repository: github.com/kaito-project/kaito
- 📚 Docs: See documentation files in this directory
52 changes: 52 additions & 0 deletions code_benchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Code Benchmark Suite

This folder contains tools to benchmark RAG performance on **code modification** tasks.

> **Note**: This is specifically for testing RAG on code issue resolution (bug fixes, feature additions). Document-based RAG benchmarking uses `rag_benchmark_docs`.

## 📁 Files

**Core Scripts (4)**:
- **`generate_issues.py`** - Generate realistic test issues from code analysis
- **`resolve_issues_baseline.py`** - Baseline solution (direct LLM with manual context)
- **`rag_solution.py`** - RAG solution (automatic retrieval with TOP-4 filtering)
- **`code_benchmark.py`** - Compare baseline vs RAG results

**Documentation (5)**:
- **`GETTING_STARTED.md`** - Quick start guide (5 minutes)
- **`CODE_BENCHMARK_GUIDE.md`** - Complete usage guide
- **`CODE_BENCHMARK_ARCHITECTURE.md`** - System architecture & design decisions
- **`CODE_BENCHMARK_PRESENTATION.md`** - 32-slide presentation for stakeholders

## 🚀 Quick Start

Read `GETTING_STARTED.md` to run your first benchmark in 5 minutes.

## 📊 What This Tests

- **Code modification accuracy**: How well RAG fixes bugs vs baseline LLM
- **Test validation**: All changes validated through actual unit tests
- **Token efficiency**: Cost comparison (RAG with TOP-4 filtering saves 21.6%)
- **File selection**: RAG automatic retrieval vs manual context

## 🎯 Key Innovation

**TOP-4 Relevance Filtering**: RAG retrieves 100+ documents internally, but we filter to the top 4 most relevant files based on cosine similarity scores. This balances context quality with token efficiency.

Results are saved to `baseline_outputs/` and `rag_outputs/` directories.

## 📈 Typical Results

```
Baseline LLM: 20% success rate (1/5 issues)
RAG Solution: 60% success rate (3/5 issues)
Winner: RAG (automatic retrieval with better context)
```

> **Note**: RAG shows 40-60% success rate with TOP-4 filtering, while Baseline achieves 0-40%. RAG's automatic context retrieval provides more comprehensive coverage than manual selection.

## 🔗 See Also

- **Architecture Details**: See `CODE_BENCHMARK_ARCHITECTURE.md` for flow diagrams
- **Complete Guide**: See `CODE_BENCHMARK_GUIDE.md` for detailed usage
- **Quick Tutorial**: See `GETTING_STARTED.md` for 5-minute walkthrough
Loading
Loading