A high-performance C++ functional simulator for the Stillwater KPU with Python bindings for easy testing and education.
The KPU simulator models a specialized hardware accelerator with the following components:
- Main Memory: Large capacity memory (default 1GB) with thread-safe access
- Scratchpad Memory: Fast, software-managed memory (default 1MB) optimized for compute operations
- DMA Engine: Asynchronous data movement between main memory and scratchpad
- Compute Engine: Matrix multiplication accelerator with single-precision floating-point support
- Modern C++17 Implementation: Uses RAII, smart pointers, and modern STL features
- Thread-Safe Design: All components support concurrent access with appropriate locking
- Zero-Copy Operations: Direct memory access where possible for maximum performance
- Comprehensive Error Handling: Robust error detection and reporting
- Python Integration: Easy-to-use Python API with NumPy integration
# Ubuntu/Debian
sudo apt install build-essential cmake libomp-dev python3-dev python3-pip
# macOS
brew install cmake libomp python3
xcode-select --install
# Python dependencies
pip3 install numpy matplotlib pybind11
- Generate build files:
python3 kpu_test_examples.py --build-files
- Build the C++ library:
chmod +x build_kpu.sh
./build_kpu.sh
- Install Python package (optional):
pip3 install -e .
# Create build directory
mkdir build && cd build
# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON
# Build
make -j$(nproc)
# The shared library will be in build/libkpu_simulator.so (Linux) or build/libkpu_simulator.dylib (macOS)
#include "kpu_simulator.hpp"
using namespace stillwater::kpu;
int main() {
// Create simulator
KPUSimulator kpu(1ULL << 30, 1ULL << 20); // 1GB main, 1MB scratchpad
// Generate test matrices
std::vector<float> A(100 * 200), B(200 * 150);
// ... fill matrices ...
// Write to scratchpad
kpu.scratchpad().write(0, A.data(), A.size() * sizeof(float));
kpu.scratchpad().write(A.size() * sizeof(float), B.data(), B.size() * sizeof(float));
// Perform matrix multiplication
MatrixDim dimA{100, 200}, dimB{200, 150};
kpu.compute_engine().matmul_f32(0, A.size() * sizeof(float),
(A.size() + B.size()) * sizeof(float),
dimA, dimB);
return 0;
}
import numpy as np
from kpu_simulator import KPUSimulator
# Create simulator
with KPUSimulator() as kpu:
# Generate test matrices
A = np.random.randn(100, 200).astype(np.float32)
B = np.random.randn(200, 150).astype(np.float32)
# Perform matrix multiplication
C = kpu.matmul(A, B)
# Verify against NumPy
C_numpy = A @ B
print(f"Results match: {np.allclose(C, C_numpy)}")
def neural_layer(kpu, inputs, weights, bias):
"""Compute neural network layer: output = inputs @ weights + bias"""
# Matrix multiplication
output = kpu.matmul(inputs, weights)
# Add bias using broadcasting via matrix multiplication
batch_size = inputs.shape[0]
ones = np.ones((batch_size, 1), dtype=np.float32)
bias_reshaped = bias.reshape(1, -1)
# Accumulate bias
return kpu.matmul(ones, bias_reshaped, addr_C=output_addr, accumulate=True)
def efficient_large_matmul(kpu, A_main_addr, B_main_addr, A_shape, B_shape):
"""Efficiently multiply large matrices using DMA"""
M, K = A_shape
K2, N = B_shape
# DMA transfer to scratchpad
A_size = M * K * 4 # float32
B_size = K * N * 4
kpu.dma_transfer(A_main_addr, 0, A_size,
src_main_memory=True, dst_main_memory=False)
kpu.dma_transfer(B_main_addr, A_size, B_size,
src_main_memory=True, dst_main_memory=False)
# Compute in scratchpad
A = kpu.read_scratchpad(0, np.float32, (M, K))
B = kpu.read_scratchpad(A_size, np.float32, (K, N))
C = kpu.matmul(A, B)
return C
# Run comprehensive test suite
python3 kpu_test_examples.py --test
# Run educational demonstrations
python3 kpu_test_examples.py --demo
# Run everything (tests + demos)
python3 kpu_test_examples.py --all
The test suite covers:
- Memory Operations: Read/write operations for various data types and sizes
- DMA Operations: Transfer between main memory and scratchpad
- Compute Operations: Matrix multiplication with various dimensions
- Error Handling: Boundary conditions and error cases
- Performance: Benchmarking against NumPy reference implementation
- Neural Network Layer: Demonstrates weight-input multiplication and bias addition
- Matrix Chain Multiplication: Shows optimization of operation ordering
- Memory Hierarchy Demo: Illustrates efficient use of DMA for data movement
- Performance Analysis: Benchmarks across different matrix sizes
- Blocked Matrix Multiplication: Demonstrates tiling for large matrices
Main simulator class that orchestrates all components.
Constructor:
KPUSimulator(Size main_memory_size = 1GB, Size scratchpad_size = 1MB)
Methods:
MainMemory& main_memory()
: Access to main memoryScratchpad& scratchpad()
: Access to scratchpad memoryDMAEngine& dma_engine()
: Access to DMA engineComputeEngine& compute_engine()
: Access to compute engine
Memory interfaces with thread-safe read/write operations.
Methods:
void read(Address addr, void* data, Size size)
: Read datavoid write(Address addr, const void* data, Size size)
: Write dataSize size() const
: Get memory size
Handles data movement between memory spaces.
Methods:
void transfer_sync(...)
: Synchronous transfervoid transfer_async(...)
: Asynchronous transfer with callback
Matrix multiplication accelerator.
Methods:
void matmul_f32(...)
: Matrix multiplicationvoid matmul_accumulate_f32(...)
: Matrix multiplication with accumulation
Python wrapper with NumPy integration.
Constructor:
KPUSimulator(main_memory_size=1<<30, scratchpad_size=1<<20)
Methods:
write_main_memory(addr, data)
: Write NumPy array to main memoryread_main_memory(addr, dtype, shape)
: Read NumPy array from main memorywrite_scratchpad(addr, data)
: Write NumPy array to scratchpadread_scratchpad(addr, dtype, shape)
: Read NumPy array from scratchpaddma_transfer(...)
: DMA data movementmatmul(A, B, ...)
: Matrix multiplication with automatic memory managementbenchmark_matmul(M, N, K, iterations)
: Performance benchmarking
The simulator implements a basic matrix multiplication algorithm with these characteristics:
- Algorithm: Standard three-loop GEMM (General Matrix Multiply)
- Parallelization: OpenMP parallel loops for matrices larger than 1024 elements
- Memory Access: Optimized for cache-friendly access patterns
- Precision: Single-precision floating-point (float32)
Performance varies by matrix size and system configuration. Typical results on modern hardware:
Matrix Size | KPU Time | NumPy Time | KPU GFLOPS | NumPy GFLOPS |
---|---|---|---|---|
64x64 | 0.12ms | 0.08ms | 4.37 | 6.55 |
128x128 | 0.89ms | 0.31ms | 4.69 | 13.48 |
256x256 | 6.78ms | 1.24ms | 4.97 | 27.19 |
Note: NumPy uses highly optimized BLAS libraries (Intel MKL/OpenBLAS) which typically outperform basic implementations
- Main Memory: Configurable, default 1GB
- Scratchpad: Configurable, default 1MB
- Overhead: Minimal C++ object overhead (~1KB)
- Python Wrapper: Additional NumPy array copies as needed
The simulator provides comprehensive error handling:
std::out_of_range
: Memory access violationsstd::invalid_argument
: Invalid parameters (matrix dimensions, etc.)std::runtime_error
: Resource allocation failures
KPUMemoryError
: Memory access errorsKPUDimensionError
: Matrix dimension mismatchesKPUError
: General simulator errors
typedef enum {
KPU_SUCCESS = 0,
KPU_ERROR_INVALID_HANDLE = -1,
KPU_ERROR_OUT_OF_BOUNDS = -2,
KPU_ERROR_INVALID_DIMENSIONS = -3,
KPU_ERROR_NULL_POINTER = -4,
KPU_ERROR_UNKNOWN = -5
} KPUError;
- Use RAII for automatic resource cleanup
- Prefer stack allocation for small objects
- Use smart pointers for dynamic allocation
- Implement proper move semantics
- Reader-writer locks for shared data structures
- Atomic operations for simple state variables
- Lock-free queues for producer-consumer patterns
- Memory alignment for SIMD operations
- Cache-friendly data structures
- Minimize memory allocations in hot paths
- Use compiler intrinsics where appropriate
Potential areas for expansion:
- Additional Data Types: Support for int8, int16, bfloat16
- Advanced Operations: Convolution, activation functions, normalization
- Memory Hierarchy: Multi-level cache simulation
- Pipeline Modeling: Instruction-level simulation
- Power Modeling: Energy consumption estimation
- Visualization: Real-time performance monitoring
- Distributed Computing: Multi-KPU simulation
To contribute to the KPU simulator:
- Follow modern C++ best practices (C++17/20)
- Maintain thread safety for all shared components
- Add comprehensive tests for new features
- Update documentation for API changes
- Benchmark performance impact of modifications
This project is released under the MIT License. See LICENSE file for details.
Stillwater Computing, Inc.
Advancing the state of the art in knowledge processing