Skip to content

stillwater-sc/KPU-Simulator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stillwater Knowledge Processing Unit (KPU) Simulator

A high-performance C++ functional simulator for the Stillwater KPU with Python bindings for easy testing and education.

Architecture Overview

The KPU simulator models a specialized hardware accelerator with the following components:

Core Components

  1. Main Memory: Large capacity memory (default 1GB) with thread-safe access
  2. Scratchpad Memory: Fast, software-managed memory (default 1MB) optimized for compute operations
  3. DMA Engine: Asynchronous data movement between main memory and scratchpad
  4. Compute Engine: Matrix multiplication accelerator with single-precision floating-point support

Key Features

  • Modern C++17 Implementation: Uses RAII, smart pointers, and modern STL features
  • Thread-Safe Design: All components support concurrent access with appropriate locking
  • Zero-Copy Operations: Direct memory access where possible for maximum performance
  • Comprehensive Error Handling: Robust error detection and reporting
  • Python Integration: Easy-to-use Python API with NumPy integration

Building the Simulator

Prerequisites

# Ubuntu/Debian
sudo apt install build-essential cmake libomp-dev python3-dev python3-pip

# macOS
brew install cmake libomp python3
xcode-select --install

# Python dependencies
pip3 install numpy matplotlib pybind11

Build Steps

  1. Generate build files:
python3 kpu_test_examples.py --build-files
  1. Build the C++ library:
chmod +x build_kpu.sh
./build_kpu.sh
  1. Install Python package (optional):
pip3 install -e .

Manual Build (Alternative)

# Create build directory
mkdir build && cd build

# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON

# Build
make -j$(nproc)

# The shared library will be in build/libkpu_simulator.so (Linux) or build/libkpu_simulator.dylib (macOS)

Usage Examples

C++ API

#include "kpu_simulator.hpp"
using namespace stillwater::kpu;

int main() {
    // Create simulator
    KPUSimulator kpu(1ULL << 30, 1ULL << 20);  // 1GB main, 1MB scratchpad
    
    // Generate test matrices
    std::vector<float> A(100 * 200), B(200 * 150);
    // ... fill matrices ...
    
    // Write to scratchpad
    kpu.scratchpad().write(0, A.data(), A.size() * sizeof(float));
    kpu.scratchpad().write(A.size() * sizeof(float), B.data(), B.size() * sizeof(float));
    
    // Perform matrix multiplication
    MatrixDim dimA{100, 200}, dimB{200, 150};
    kpu.compute_engine().matmul_f32(0, A.size() * sizeof(float), 
                                    (A.size() + B.size()) * sizeof(float), 
                                    dimA, dimB);
    
    return 0;
}

Python API

import numpy as np
from kpu_simulator import KPUSimulator

# Create simulator
with KPUSimulator() as kpu:
    # Generate test matrices
    A = np.random.randn(100, 200).astype(np.float32)
    B = np.random.randn(200, 150).astype(np.float32)
    
    # Perform matrix multiplication
    C = kpu.matmul(A, B)
    
    # Verify against NumPy
    C_numpy = A @ B
    print(f"Results match: {np.allclose(C, C_numpy)}")

Advanced Examples

Neural Network Layer

def neural_layer(kpu, inputs, weights, bias):
    """Compute neural network layer: output = inputs @ weights + bias"""
    # Matrix multiplication
    output = kpu.matmul(inputs, weights)
    
    # Add bias using broadcasting via matrix multiplication
    batch_size = inputs.shape[0]
    ones = np.ones((batch_size, 1), dtype=np.float32)
    bias_reshaped = bias.reshape(1, -1)
    
    # Accumulate bias
    return kpu.matmul(ones, bias_reshaped, addr_C=output_addr, accumulate=True)

Memory Hierarchy Management

def efficient_large_matmul(kpu, A_main_addr, B_main_addr, A_shape, B_shape):
    """Efficiently multiply large matrices using DMA"""
    M, K = A_shape
    K2, N = B_shape
    
    # DMA transfer to scratchpad
    A_size = M * K * 4  # float32
    B_size = K * N * 4
    
    kpu.dma_transfer(A_main_addr, 0, A_size, 
                     src_main_memory=True, dst_main_memory=False)
    kpu.dma_transfer(B_main_addr, A_size, B_size, 
                     src_main_memory=True, dst_main_memory=False)
    
    # Compute in scratchpad
    A = kpu.read_scratchpad(0, np.float32, (M, K))
    B = kpu.read_scratchpad(A_size, np.float32, (K, N))
    C = kpu.matmul(A, B)
    
    return C

Testing and Validation

Running Tests

# Run comprehensive test suite
python3 kpu_test_examples.py --test

# Run educational demonstrations
python3 kpu_test_examples.py --demo

# Run everything (tests + demos)
python3 kpu_test_examples.py --all

Test Coverage

The test suite covers:

  • Memory Operations: Read/write operations for various data types and sizes
  • DMA Operations: Transfer between main memory and scratchpad
  • Compute Operations: Matrix multiplication with various dimensions
  • Error Handling: Boundary conditions and error cases
  • Performance: Benchmarking against NumPy reference implementation

Educational Examples

  1. Neural Network Layer: Demonstrates weight-input multiplication and bias addition
  2. Matrix Chain Multiplication: Shows optimization of operation ordering
  3. Memory Hierarchy Demo: Illustrates efficient use of DMA for data movement
  4. Performance Analysis: Benchmarks across different matrix sizes
  5. Blocked Matrix Multiplication: Demonstrates tiling for large matrices

API Reference

C++ Classes

KPUSimulator

Main simulator class that orchestrates all components.

Constructor:

KPUSimulator(Size main_memory_size = 1GB, Size scratchpad_size = 1MB)

Methods:

  • MainMemory& main_memory(): Access to main memory
  • Scratchpad& scratchpad(): Access to scratchpad memory
  • DMAEngine& dma_engine(): Access to DMA engine
  • ComputeEngine& compute_engine(): Access to compute engine

MainMemory / Scratchpad

Memory interfaces with thread-safe read/write operations.

Methods:

  • void read(Address addr, void* data, Size size): Read data
  • void write(Address addr, const void* data, Size size): Write data
  • Size size() const: Get memory size

DMAEngine

Handles data movement between memory spaces.

Methods:

  • void transfer_sync(...): Synchronous transfer
  • void transfer_async(...): Asynchronous transfer with callback

ComputeEngine

Matrix multiplication accelerator.

Methods:

  • void matmul_f32(...): Matrix multiplication
  • void matmul_accumulate_f32(...): Matrix multiplication with accumulation

Python Classes

KPUSimulator

Python wrapper with NumPy integration.

Constructor:

KPUSimulator(main_memory_size=1<<30, scratchpad_size=1<<20)

Methods:

  • write_main_memory(addr, data): Write NumPy array to main memory
  • read_main_memory(addr, dtype, shape): Read NumPy array from main memory
  • write_scratchpad(addr, data): Write NumPy array to scratchpad
  • read_scratchpad(addr, dtype, shape): Read NumPy array from scratchpad
  • dma_transfer(...): DMA data movement
  • matmul(A, B, ...): Matrix multiplication with automatic memory management
  • benchmark_matmul(M, N, K, iterations): Performance benchmarking

Performance Characteristics

Theoretical Performance

The simulator implements a basic matrix multiplication algorithm with these characteristics:

  • Algorithm: Standard three-loop GEMM (General Matrix Multiply)
  • Parallelization: OpenMP parallel loops for matrices larger than 1024 elements
  • Memory Access: Optimized for cache-friendly access patterns
  • Precision: Single-precision floating-point (float32)

Benchmarking Results

Performance varies by matrix size and system configuration. Typical results on modern hardware:

Matrix Size KPU Time NumPy Time KPU GFLOPS NumPy GFLOPS
64x64 0.12ms 0.08ms 4.37 6.55
128x128 0.89ms 0.31ms 4.69 13.48
256x256 6.78ms 1.24ms 4.97 27.19

Note: NumPy uses highly optimized BLAS libraries (Intel MKL/OpenBLAS) which typically outperform basic implementations

Memory Usage

  • Main Memory: Configurable, default 1GB
  • Scratchpad: Configurable, default 1MB
  • Overhead: Minimal C++ object overhead (~1KB)
  • Python Wrapper: Additional NumPy array copies as needed

Error Handling

The simulator provides comprehensive error handling:

C++ Exceptions

  • std::out_of_range: Memory access violations
  • std::invalid_argument: Invalid parameters (matrix dimensions, etc.)
  • std::runtime_error: Resource allocation failures

Python Exceptions

  • KPUMemoryError: Memory access errors
  • KPUDimensionError: Matrix dimension mismatches
  • KPUError: General simulator errors

Error Codes (C API)

typedef enum {
    KPU_SUCCESS = 0,
    KPU_ERROR_INVALID_HANDLE = -1,
    KPU_ERROR_OUT_OF_BOUNDS = -2,
    KPU_ERROR_INVALID_DIMENSIONS = -3,
    KPU_ERROR_NULL_POINTER = -4,
    KPU_ERROR_UNKNOWN = -5
} KPUError;

Design Patterns and Best Practices

Memory Management

  • Use RAII for automatic resource cleanup
  • Prefer stack allocation for small objects
  • Use smart pointers for dynamic allocation
  • Implement proper move semantics

Thread Safety

  • Reader-writer locks for shared data structures
  • Atomic operations for simple state variables
  • Lock-free queues for producer-consumer patterns

Performance Optimization

  • Memory alignment for SIMD operations
  • Cache-friendly data structures
  • Minimize memory allocations in hot paths
  • Use compiler intrinsics where appropriate

Future Enhancements

Potential areas for expansion:

  1. Additional Data Types: Support for int8, int16, bfloat16
  2. Advanced Operations: Convolution, activation functions, normalization
  3. Memory Hierarchy: Multi-level cache simulation
  4. Pipeline Modeling: Instruction-level simulation
  5. Power Modeling: Energy consumption estimation
  6. Visualization: Real-time performance monitoring
  7. Distributed Computing: Multi-KPU simulation

Contributing

To contribute to the KPU simulator:

  1. Follow modern C++ best practices (C++17/20)
  2. Maintain thread safety for all shared components
  3. Add comprehensive tests for new features
  4. Update documentation for API changes
  5. Benchmark performance impact of modifications

License

This project is released under the MIT License. See LICENSE file for details.


Stillwater Computing, Inc.
Advancing the state of the art in knowledge processing

About

Simulator for a KPU-based accelerator

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published