Ollama Concurrency Testing

This repository contains tools for testing Ollama's performance under concurrent load conditions.

Overview

This project benchmarks Ollama's performance when handling multiple concurrent requests with 7B parameter models. It measures response times, throughput, and resource utilization under different concurrency levels.

Hardware Requirements for 100 Concurrent Requests (7B Models)

Running 100 concurrent requests to 7B parameter models requires significant hardware resources:

Resource	Minimum Requirement	Recommended
GPU VRAM	24GB+	48GB+ (A100, H100, multiple GPUs)
RAM	64GB	128GB+
CPU	16+ cores	32+ cores
Storage	100GB SSD	1TB NVMe SSD

Cost Analysis:

Cloud GPU instance (A100): $3-5 per hour (~$2,500-$3,600/month)
On-premise server with A100: $15,000-$20,000 initial investment
Multiple consumer GPUs configuration: $5,000-$10,000 initial investment

Running Concurrency Tests

Setup

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Pull the model:

ollama pull mistral:7b
# or any other 7B model you want to test

Set up Python environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Configure Environment Variables

Open Terminal and set the concurrency level, e.g.,
```
export OLLAMA_NUM_PARALLEL=4
```
Optionally, set queue size for testing, e.g.,
```
export OLLAMA_MAX_QUEUE=1
```
to see when requests are rejected.
If using the macOS app, set variables with
```
launchctl setenv OLLAMA_NUM_PARALLEL 4
```
and restart the app.

Start the Server

Run ollama serve in Terminal to start the server with your settings:

ollama serve

Running Tests

# Run basic test with 10 concurrent requests
python concurrency_test.py --model mistral:7b --concurrent 10

# Test with 100 concurrent requests (requires significant hardware)
python concurrency_test.py --model mistral:7b --concurrent 100 --max-tokens 100

Configuration Options

The testing tools support various configurations:

--concurrent: Number of concurrent requests (1-100+)
--model: Model to test (e.g., mistral:7b, llama2:7b)
--max-tokens: Maximum tokens to generate per request
--timeout: Request timeout in seconds
--prompt-file: File containing prompts to use

Performance Comparison

Concurrency	Avg. Response Time	Throughput (req/s)	GPU Memory	Hardware Setup
1	1-2s	~0.5-1	~7GB	RTX 3090
10	4-8s	~1.5-3	~12GB	RTX 3090
50	15-25s	~2-4	~20GB	A100 40GB
100	30-50s	~2-5	~40GB	A100 80GB/Multiple GPUs

Note: Actual performance varies based on prompt length, token generation settings, and specific hardware configurations.

Feel free to test out the code for your proper use-case

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ollama Concurrency Testing

Overview

Hardware Requirements for 100 Concurrent Requests (7B Models)

Running Concurrency Tests

Setup

Configure Environment Variables

Start the Server

Running Tests

Configuration Options

Performance Comparison

About

Uh oh!

Releases

Packages

Uh oh!

Languages

umerkhan95/ollama-concurrency-benchmark

Folders and files

Latest commit

History

Repository files navigation

Ollama Concurrency Testing

Overview

Hardware Requirements for 100 Concurrent Requests (7B Models)

Running Concurrency Tests

Setup

Configure Environment Variables

Start the Server

Running Tests

Configuration Options

Performance Comparison

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages