This repository contains tools for testing Ollama's performance under concurrent load conditions.
This project benchmarks Ollama's performance when handling multiple concurrent requests with 7B parameter models. It measures response times, throughput, and resource utilization under different concurrency levels.
Running 100 concurrent requests to 7B parameter models requires significant hardware resources:
Resource | Minimum Requirement | Recommended |
---|---|---|
GPU VRAM | 24GB+ | 48GB+ (A100, H100, multiple GPUs) |
RAM | 64GB | 128GB+ |
CPU | 16+ cores | 32+ cores |
Storage | 100GB SSD | 1TB NVMe SSD |
Cost Analysis:
- Cloud GPU instance (A100): $3-5 per hour (~$2,500-$3,600/month)
- On-premise server with A100: $15,000-$20,000 initial investment
- Multiple consumer GPUs configuration: $5,000-$10,000 initial investment
-
Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
-
Pull the model:
ollama pull mistral:7b # or any other 7B model you want to test
-
Set up Python environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
- Open Terminal and set the concurrency level, e.g.,
export OLLAMA_NUM_PARALLEL=4
- Optionally, set queue size for testing, e.g.,
to see when requests are rejected.
export OLLAMA_MAX_QUEUE=1
- If using the macOS app, set variables with
and restart the app.
launchctl setenv OLLAMA_NUM_PARALLEL 4
Run ollama serve in Terminal to start the server with your settings:
ollama serve
# Run basic test with 10 concurrent requests
python concurrency_test.py --model mistral:7b --concurrent 10
# Test with 100 concurrent requests (requires significant hardware)
python concurrency_test.py --model mistral:7b --concurrent 100 --max-tokens 100
The testing tools support various configurations:
--concurrent
: Number of concurrent requests (1-100+)--model
: Model to test (e.g., mistral:7b, llama2:7b)--max-tokens
: Maximum tokens to generate per request--timeout
: Request timeout in seconds--prompt-file
: File containing prompts to use
Concurrency | Avg. Response Time | Throughput (req/s) | GPU Memory | Hardware Setup |
---|---|---|---|---|
1 | 1-2s | ~0.5-1 | ~7GB | RTX 3090 |
10 | 4-8s | ~1.5-3 | ~12GB | RTX 3090 |
50 | 15-25s | ~2-4 | ~20GB | A100 40GB |
100 | 30-50s | ~2-5 | ~40GB | A100 80GB/Multiple GPUs |
Note: Actual performance varies based on prompt length, token generation settings, and specific hardware configurations.
Feel free to test out the code for your proper use-case