Skip to content

A Python toolkit for benchmarking Ollama's parallel processing performance. Measures response times, throughput, and resource utilization when handling multiple concurrent LLM inference requests.

Notifications You must be signed in to change notification settings

umerkhan95/ollama-concurrency-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Ollama Concurrency Testing

This repository contains tools for testing Ollama's performance under concurrent load conditions.

Overview

This project benchmarks Ollama's performance when handling multiple concurrent requests with 7B parameter models. It measures response times, throughput, and resource utilization under different concurrency levels.

Hardware Requirements for 100 Concurrent Requests (7B Models)

Running 100 concurrent requests to 7B parameter models requires significant hardware resources:

Resource Minimum Requirement Recommended
GPU VRAM 24GB+ 48GB+ (A100, H100, multiple GPUs)
RAM 64GB 128GB+
CPU 16+ cores 32+ cores
Storage 100GB SSD 1TB NVMe SSD

Cost Analysis:

  • Cloud GPU instance (A100): $3-5 per hour (~$2,500-$3,600/month)
  • On-premise server with A100: $15,000-$20,000 initial investment
  • Multiple consumer GPUs configuration: $5,000-$10,000 initial investment

Running Concurrency Tests

Setup

  1. Install Ollama:

    curl -fsSL https://ollama.com/install.sh | sh
  2. Pull the model:

    ollama pull mistral:7b
    # or any other 7B model you want to test
  3. Set up Python environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    pip install -r requirements.txt

Configure Environment Variables

  1. Open Terminal and set the concurrency level, e.g.,
    export OLLAMA_NUM_PARALLEL=4
  2. Optionally, set queue size for testing, e.g.,
    export OLLAMA_MAX_QUEUE=1
    to see when requests are rejected.
  3. If using the macOS app, set variables with
    launchctl setenv OLLAMA_NUM_PARALLEL 4
    and restart the app.

Start the Server

Run ollama serve in Terminal to start the server with your settings:

ollama serve

Running Tests

# Run basic test with 10 concurrent requests
python concurrency_test.py --model mistral:7b --concurrent 10

# Test with 100 concurrent requests (requires significant hardware)
python concurrency_test.py --model mistral:7b --concurrent 100 --max-tokens 100

Configuration Options

The testing tools support various configurations:

  • --concurrent: Number of concurrent requests (1-100+)
  • --model: Model to test (e.g., mistral:7b, llama2:7b)
  • --max-tokens: Maximum tokens to generate per request
  • --timeout: Request timeout in seconds
  • --prompt-file: File containing prompts to use

Performance Comparison

Concurrency Avg. Response Time Throughput (req/s) GPU Memory Hardware Setup
1 1-2s ~0.5-1 ~7GB RTX 3090
10 4-8s ~1.5-3 ~12GB RTX 3090
50 15-25s ~2-4 ~20GB A100 40GB
100 30-50s ~2-5 ~40GB A100 80GB/Multiple GPUs

Note: Actual performance varies based on prompt length, token generation settings, and specific hardware configurations.

Feel free to test out the code for your proper use-case

About

A Python toolkit for benchmarking Ollama's parallel processing performance. Measures response times, throughput, and resource utilization when handling multiple concurrent LLM inference requests.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages