Stillwater Knowledge Processing Unit (KPU) Simulator

A high-performance C++ functional simulator for the Stillwater KPU with Python bindings for easy testing and education.

Architecture Overview

The KPU simulator models a specialized hardware accelerator with the following components:

Core Components

Main Memory: Large capacity memory (default 1GB) with thread-safe access
Scratchpad Memory: Fast, software-managed memory (default 1MB) optimized for compute operations
DMA Engine: Asynchronous data movement between main memory and scratchpad
Compute Engine: Matrix multiplication accelerator with single-precision floating-point support

Key Features

Modern C++17 Implementation: Uses RAII, smart pointers, and modern STL features
Thread-Safe Design: All components support concurrent access with appropriate locking
Zero-Copy Operations: Direct memory access where possible for maximum performance
Comprehensive Error Handling: Robust error detection and reporting
Python Integration: Easy-to-use Python API with NumPy integration

Building the Simulator

Prerequisites

# Ubuntu/Debian
sudo apt install build-essential cmake libomp-dev python3-dev python3-pip

# macOS
brew install cmake libomp python3
xcode-select --install

# Python dependencies
pip3 install numpy matplotlib pybind11

Build Steps

Generate build files:

python3 kpu_test_examples.py --build-files

Build the C++ library:

chmod +x build_kpu.sh
./build_kpu.sh

Install Python package (optional):

pip3 install -e .

Manual Build (Alternative)

# Create build directory
mkdir build && cd build

# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON

# Build
make -j$(nproc)

# The shared library will be in build/libkpu_simulator.so (Linux) or build/libkpu_simulator.dylib (macOS)

Usage Examples

C++ API

#include "kpu_simulator.hpp"
using namespace stillwater::kpu;

int main() {
    // Create simulator
    KPUSimulator kpu(1ULL << 30, 1ULL << 20);  // 1GB main, 1MB scratchpad
    
    // Generate test matrices
    std::vector<float> A(100 * 200), B(200 * 150);
    // ... fill matrices ...
    
    // Write to scratchpad
    kpu.scratchpad().write(0, A.data(), A.size() * sizeof(float));
    kpu.scratchpad().write(A.size() * sizeof(float), B.data(), B.size() * sizeof(float));
    
    // Perform matrix multiplication
    MatrixDim dimA{100, 200}, dimB{200, 150};
    kpu.compute_engine().matmul_f32(0, A.size() * sizeof(float), 
                                    (A.size() + B.size()) * sizeof(float), 
                                    dimA, dimB);
    
    return 0;
}

Python API

import numpy as np
from kpu_simulator import KPUSimulator

# Create simulator
with KPUSimulator() as kpu:
    # Generate test matrices
    A = np.random.randn(100, 200).astype(np.float32)
    B = np.random.randn(200, 150).astype(np.float32)
    
    # Perform matrix multiplication
    C = kpu.matmul(A, B)
    
    # Verify against NumPy
    C_numpy = A @ B
    print(f"Results match: {np.allclose(C, C_numpy)}")

Advanced Examples

Neural Network Layer

def neural_layer(kpu, inputs, weights, bias):
    """Compute neural network layer: output = inputs @ weights + bias"""
    # Matrix multiplication
    output = kpu.matmul(inputs, weights)
    
    # Add bias using broadcasting via matrix multiplication
    batch_size = inputs.shape[0]
    ones = np.ones((batch_size, 1), dtype=np.float32)
    bias_reshaped = bias.reshape(1, -1)
    
    # Accumulate bias
    return kpu.matmul(ones, bias_reshaped, addr_C=output_addr, accumulate=True)

Memory Hierarchy Management

def efficient_large_matmul(kpu, A_main_addr, B_main_addr, A_shape, B_shape):
    """Efficiently multiply large matrices using DMA"""
    M, K = A_shape
    K2, N = B_shape
    
    # DMA transfer to scratchpad
    A_size = M * K * 4  # float32
    B_size = K * N * 4
    
    kpu.dma_transfer(A_main_addr, 0, A_size, 
                     src_main_memory=True, dst_main_memory=False)
    kpu.dma_transfer(B_main_addr, A_size, B_size, 
                     src_main_memory=True, dst_main_memory=False)
    
    # Compute in scratchpad
    A = kpu.read_scratchpad(0, np.float32, (M, K))
    B = kpu.read_scratchpad(A_size, np.float32, (K, N))
    C = kpu.matmul(A, B)
    
    return C

Testing and Validation

Running Tests

# Run comprehensive test suite
python3 kpu_test_examples.py --test

# Run educational demonstrations
python3 kpu_test_examples.py --demo

# Run everything (tests + demos)
python3 kpu_test_examples.py --all

Test Coverage

The test suite covers:

Memory Operations: Read/write operations for various data types and sizes
DMA Operations: Transfer between main memory and scratchpad
Compute Operations: Matrix multiplication with various dimensions
Error Handling: Boundary conditions and error cases
Performance: Benchmarking against NumPy reference implementation

Educational Examples

Neural Network Layer: Demonstrates weight-input multiplication and bias addition
Matrix Chain Multiplication: Shows optimization of operation ordering
Memory Hierarchy Demo: Illustrates efficient use of DMA for data movement
Performance Analysis: Benchmarks across different matrix sizes
Blocked Matrix Multiplication: Demonstrates tiling for large matrices

API Reference

C++ Classes

`KPUSimulator`

Main simulator class that orchestrates all components.

Constructor:

KPUSimulator(Size main_memory_size = 1GB, Size scratchpad_size = 1MB)

Methods:

MainMemory& main_memory(): Access to main memory
Scratchpad& scratchpad(): Access to scratchpad memory
DMAEngine& dma_engine(): Access to DMA engine
ComputeEngine& compute_engine(): Access to compute engine

`MainMemory` / `Scratchpad`

Memory interfaces with thread-safe read/write operations.

Methods:

void read(Address addr, void* data, Size size): Read data
void write(Address addr, const void* data, Size size): Write data
Size size() const: Get memory size

`DMAEngine`

Handles data movement between memory spaces.

Methods:

void transfer_sync(...): Synchronous transfer
void transfer_async(...): Asynchronous transfer with callback

`ComputeEngine`

Matrix multiplication accelerator.

Methods:

void matmul_f32(...): Matrix multiplication
void matmul_accumulate_f32(...): Matrix multiplication with accumulation

Python Classes

`KPUSimulator`

Python wrapper with NumPy integration.

Constructor:

KPUSimulator(main_memory_size=1<<30, scratchpad_size=1<<20)

Methods:

write_main_memory(addr, data): Write NumPy array to main memory
read_main_memory(addr, dtype, shape): Read NumPy array from main memory
write_scratchpad(addr, data): Write NumPy array to scratchpad
read_scratchpad(addr, dtype, shape): Read NumPy array from scratchpad
dma_transfer(...): DMA data movement
matmul(A, B, ...): Matrix multiplication with automatic memory management
benchmark_matmul(M, N, K, iterations): Performance benchmarking

Performance Characteristics

Theoretical Performance

The simulator implements a basic matrix multiplication algorithm with these characteristics:

Algorithm: Standard three-loop GEMM (General Matrix Multiply)
Parallelization: OpenMP parallel loops for matrices larger than 1024 elements
Memory Access: Optimized for cache-friendly access patterns
Precision: Single-precision floating-point (float32)

Benchmarking Results

Performance varies by matrix size and system configuration. Typical results on modern hardware:

Matrix Size	KPU Time	NumPy Time	KPU GFLOPS	NumPy GFLOPS
64x64	0.12ms	0.08ms	4.37	6.55
128x128	0.89ms	0.31ms	4.69	13.48
256x256	6.78ms	1.24ms	4.97	27.19

Note: NumPy uses highly optimized BLAS libraries (Intel MKL/OpenBLAS) which typically outperform basic implementations

Memory Usage

Main Memory: Configurable, default 1GB
Scratchpad: Configurable, default 1MB
Overhead: Minimal C++ object overhead (~1KB)
Python Wrapper: Additional NumPy array copies as needed

Error Handling

The simulator provides comprehensive error handling:

C++ Exceptions

std::out_of_range: Memory access violations
std::invalid_argument: Invalid parameters (matrix dimensions, etc.)
std::runtime_error: Resource allocation failures

Python Exceptions

KPUMemoryError: Memory access errors
KPUDimensionError: Matrix dimension mismatches
KPUError: General simulator errors

Error Codes (C API)

typedef enum {
    KPU_SUCCESS = 0,
    KPU_ERROR_INVALID_HANDLE = -1,
    KPU_ERROR_OUT_OF_BOUNDS = -2,
    KPU_ERROR_INVALID_DIMENSIONS = -3,
    KPU_ERROR_NULL_POINTER = -4,
    KPU_ERROR_UNKNOWN = -5
} KPUError;

Design Patterns and Best Practices

Memory Management

Use RAII for automatic resource cleanup
Prefer stack allocation for small objects
Use smart pointers for dynamic allocation
Implement proper move semantics

Thread Safety

Reader-writer locks for shared data structures
Atomic operations for simple state variables
Lock-free queues for producer-consumer patterns

Performance Optimization

Memory alignment for SIMD operations
Cache-friendly data structures
Minimize memory allocations in hot paths
Use compiler intrinsics where appropriate

Future Enhancements

Potential areas for expansion:

Additional Data Types: Support for int8, int16, bfloat16
Advanced Operations: Convolution, activation functions, normalization
Memory Hierarchy: Multi-level cache simulation
Pipeline Modeling: Instruction-level simulation
Power Modeling: Energy consumption estimation
Visualization: Real-time performance monitoring
Distributed Computing: Multi-KPU simulation

Contributing

To contribute to the KPU simulator:

Follow modern C++ best practices (C++17/20)
Maintain thread safety for all shared components
Add comprehensive tests for new features
Update documentation for API changes
Benchmark performance impact of modifications

License

This project is released under the MIT License. See LICENSE file for details.

Stillwater Computing, Inc.
Advancing the state of the art in knowledge processing

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
cmake		cmake
components		components
docs		docs
examples		examples
include/sw		include/sw
packaging		packaging
scripts		scripts
src		src
tests		tests
tools		tools
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
LICENSE		LICENSE
README.md		README.md
README_dev.md		README_dev.md
cmake_build_with_venv.bat		cmake_build_with_venv.bat
cmake_gui_setup.bat		cmake_gui_setup.bat
conda_setup.bat		conda_setup.bat
dev_setup_unix.sh		dev_setup_unix.sh
dev_setup_windows.bat		dev_setup_windows.bat
how-to-run-python-application.md		how-to-run-python-application.md
how_to_setup_pybind_venv.txt		how_to_setup_pybind_venv.txt
loading_python_bindings.md		loading_python_bindings.md
pyproject.toml		pyproject.toml
python_path.ps		python_path.ps
python_path.sh		python_path.sh
requirements.txt		requirements.txt

License

stillwater-sc/KPU-Simulator

Folders and files

Latest commit

History

Repository files navigation

Stillwater Knowledge Processing Unit (KPU) Simulator

Architecture Overview

Core Components

Key Features

Building the Simulator

Prerequisites

Build Steps

Manual Build (Alternative)

Usage Examples

C++ API

Python API

Advanced Examples

Neural Network Layer

Memory Hierarchy Management

Testing and Validation

Running Tests

Test Coverage

Educational Examples

API Reference

C++ Classes

KPUSimulator

MainMemory / Scratchpad

DMAEngine

ComputeEngine

Python Classes

KPUSimulator

Performance Characteristics

Theoretical Performance

Benchmarking Results

Memory Usage

Error Handling

C++ Exceptions

Python Exceptions

Error Codes (C API)

Design Patterns and Best Practices

Memory Management

Thread Safety

Performance Optimization

Future Enhancements

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`KPUSimulator`

`MainMemory` / `Scratchpad`

`DMAEngine`

`ComputeEngine`

`KPUSimulator`

Packages