[asm] Major ASM Backend Refactoring #639

harsh-nod · 2025-12-25T07:46:47Z

Summary

This PR represents a comprehensive overhaul of the Wave ASM backend, replacing the legacy dual-path architecture with a unified single-path Kernel IR compilation pipeline. The result is a cleaner, more maintainable codebase with improved optimization capabilities.

Major Changes

Architecture Overhaul

Single-Path Compilation: Eliminated the legacy "direct emission" mode. All compilation now flows through the Kernel IR path:
```
MLIR -> KernelCompilationContext -> KernelProgram -> Liveness -> RegAlloc -> Assembly
```
Module Decomposition: Split large monolithic files into focused modules (<1000 LOC each):
- kernel_pipeline.py -> kernel_compilation_context.py, kernel_expr_emitter.py, kernel_passes.py, kernel_loops.py, kernel_mfma.py, kernel_module_compiler.py
- handlers.py -> handlers_memory.py, handlers_control.py, handlers_arith_affine.py
Removed Legacy Infrastructure: Deleted ~8,000 lines of obsolete code including asm_emitter.py, register_allocator.py, scoreboard.py, hazards.py, latency_database.py, and the per-architecture latency JSON files

New Kernel IR Infrastructure

Virtual Register IR: New KVReg/KSReg types with whole-program liveness tracking
CFG-Based Liveness Analysis: Proper backward dataflow analysis with loop-carried value handling
Linear Scan Register Allocator: Constraint-aware allocation with precoloring for ABI registers
Unified Instruction Registry: YAML-based instruction definitions replacing hardcoded KOpcode enum

Optimizations

Algebraic Simplification: Uses symbol bounds (tid_x < 64, etc.) to eliminate redundant floor/mod operations
Bit Range Analysis: Converts ADD to OR when operand bit ranges don't overlap
Peephole Optimizer: Fuses v_lshlrev_b32 + v_add_u32 -> v_lshl_add_u32
Scoped CSE: Loop-invariant expression caching across iterations
LDS Offset Optimization: Uses ds_read/ds_write offset field (0-8192 bytes) to reduce VALU instructions
Precise Hazard Mitigation: Only inserts s_nop where actually needed (~2 vs ~46 previously)

Ticketing System Migration

Migrated memory operation tracking from AsmEmitter to kernel IR path
Integrated s_waitcnt coalescing into KernelCompilationContext.finalize()
Enabled by default via WAVE_KERNEL_TICKETING

New Components

Module	Purpose
`instruction_formatter.py`	Centralized instruction-to-assembly formatting
`metadata_emitter.py`	AMDGCN metadata directive generation
`mlir_analysis.py`	MLIR parsing, kernel selection, translation_info extraction
`abi.py`	ABI policies (system VGPR workitem ID)
`ticketing.py`	Memory operation tracking and waitcnt coalescing
`unified_emitter.py`	Dynamic instruction emission API
`kernel_regalloc.py`	Linear scan register allocator
`kernel_liveness.py`	CFG-based liveness analysis

Developer Tools

scripts/compare_backends.py: Compare LLVM vs ASM backend assembly
scripts/symbolic_asm_analysis.py: Analyze generated assembly
perf/benchmark_asm_backend.py: Benchmarking infrastructure

Testing

38 unit tests passing
33 e2e tests passing (copy, MMA, multi-wave, multi-workgroup, gather-to-LDS)
Lit tests updated and passing

Breaking Changes

AsmEmitter class removed - use KernelModuleCompiler.compile_mlir_string() instead
WAVE_KERNEL_IR environment variable removed (kernel IR is now the only path)
Several legacy modules deleted (see "Removed Legacy Infrastructure" above)

Documentation

Updated docs/wave/asm_backend.rst and docs/wave/register_allocation.rst to reflect the new architecture.

This commit adds an optional offset parameter to ds_read_b64, ds_write_b32, ds_write_b64, and ds_write_b128 instructions. The offset field allows encoding constant address offsets directly in the instruction (0-65535 bytes), which can save a v_add_u32 instruction for address computation. Changes: - DSReadB64, DSWriteB32, DSWriteB64, DSWriteB128 now accept offset parameter - asm_emitter.emit_lds_read_b64() accepts offset parameter - Added unit tests for DS instruction offset support - Updated documentation Register reduction: This is infrastructure for future optimizations. When the ds_read offset field is used instead of computing full address, it saves 1 VGPR per unique address offset pattern (by avoiding the v_add result). Tests: All 27 asm_backend_test.py E2E tests pass. Signed-off-by: Harsh Menon <[email protected]>

This commit introduces a new expression emitter (ExprEmitterV2) that uses a two-phase approach: first emit to virtual registers, then allocate physical registers using linear scan. This decouples expression computation from register allocation, enabling global CSE and optimization passes. New files: - expr_ir.py: Virtual register IR definitions (VReg, SReg, OpCode, ExprInstr, ExprProgram, CachedExprRef wrapper for preventing sympy flattening) - expr_regalloc.py: Linear scan register allocator with multiple policies (SEQUENTIAL, LIFO_REUSE, FIFO_REUSE, RANDOM, REVERSE) - expr_opt.py: IR-level optimization passes (copy propagation, DCE, coalescing) - expr_simplify.py: Algebraic simplification rules for floor/mod expressions - expr_emitter_interface.py: Protocol and factory for emitter selection - expr_emitter_v2.py: Main v2 emitter with streaming emission and CSE Key features: - Global subexpression CSE across entire kernel - Virtual registers enable register coalescing - Immediate value optimization (inline constants) - Algebraic simplification (floor/mod identities) - CachedExprRef prevents sympy from flattening add expressions Register reduction: Reduces v_mov instructions from 80+ to ~10 through: - Copy propagation eliminates redundant moves - Register coalescing allows same physical register for copies - Immediate usage avoids register allocation for small constants Tests: All 29 asm_backend_test.py E2E tests pass. Signed-off-by: Harsh Menon <[email protected]>

Add two analysis scripts to the scripts/ directory: scripts/compare_backends.py: - Compares LLVM and ASM backend assembly output side-by-side - Computes instruction metrics by category (SALU, VALU, VMEM, etc.) - Extracts resource usage (VGPR, SGPR, LDS) - Supports loading configs from benchmark_configs.json - Generates detailed comparison reports scripts/symbolic_asm_analysis.py: - Symbolic analysis of assembly using SymPy - Builds symbolic expressions for register values - Traces register definitions through instruction sequences - Useful for debugging address calculations Usage: python scripts/compare_backends.py --benchmark gemm-asm-benchmark python scripts/symbolic_asm_analysis.py wave_asm.s "buffer_load_dword" Signed-off-by: Harsh Menon <[email protected]>

Add benchmark_asm_backend.py for comparing ASM vs LLVM backend performance. Features: - Reads GEMM configuration from benchmark_configs.json - Runs GEMM kernel with both ASM and LLVM backends - Verifies correctness against PyTorch reference - Reports detailed performance metrics and instruction counts - Supports warmup iterations and multiple runs for averaging - Analyzes generated assembly for instruction statistics The benchmark_configs.json already includes a gemm_shapes section with the default GEMM configuration for ASM backend testing. Signed-off-by: Harsh Menon <[email protected]>

Simplify codebase by removing legacy expression emitter code: - Delete expression_emitter.py (legacy ExprEmitter class, ~780 lines) - Delete expr_emitter_interface.py (factory function, ~120 lines) - Rename expr_emitter_v2.py to expr_emitter.py - Rename ExprEmitterV2 class to ExprEmitter - Move expr_key() function to expr_ir.py - Update all imports and documentation The ExprEmitter now uses virtual register IR with: - Global subexpression CSE - Copy propagation and dead code elimination - Register coalescing - Algebraic simplification - Immediate value optimization Total: -900 lines of redundant code. Signed-off-by: Harsh Menon <[email protected]>

@singledispatch

- Add @singledispatch for get_max_value, simplify_expr, combine_like_terms - Register type-specific handlers for Integer, Symbol, Add, Mul, Mod, Pow - Declarative REWRITE_RULES with SymPy Wild patterns - Helper functions: is_floor(), is_pow2() - Use S.Zero/S.One singletons instead of Integer(0)/Integer(1) - Update asm_backend.rst documentation This makes adding new type handlers simple: just @func.register(NewType) Eliminates chains of isinstance() checks throughout the codebase. All 33 asm_backend_test.py tests pass. Signed-off-by: Harsh Menon <[email protected]>

- Remove CachedExprRef class from expr_ir.py (was experimental) - Remove CachedExprRef handling from expr_emitter.py - Remove two-stage factoring code from handlers.py (was disabled by default) - Remove factor_ds_read_offset helper from utils.py The two-stage factoring feature had correctness issues and was disabled behind WAVE_DS_TWO_STAGE=1. Since it was never enabled in production, removing the dead code simplifies the codebase. All 33 asm_backend_test.py tests pass. Signed-off-by: Harsh Menon <[email protected]>

New modules for kernel-level register allocation (WAVE_KERNEL_LSRA=1): kernel_ir.py: - KVReg/KSReg: Virtual register types - KPhysVReg/KPhysSReg: Precolored physical registers - KRegRange: Contiguous register ranges (pairs, quads, MFMA blocks) - KOpcode: Enumeration of all AMDGCN instructions - KInstr: Instruction with defs, uses, and constraints - KernelProgram: Complete kernel instruction sequence - KernelBuilder: Helper for emitting common instruction patterns - KernelABI: ABI bindings for reserved registers kernel_liveness.py: - LiveRange: Live range for a virtual register - LivenessInfo: Complete liveness information - compute_liveness(): SSA-based liveness analysis - validate_ssa(): Verify SSA form - Register pressure computation kernel_regalloc.py: - RegPool: Physical register pool with range allocation - KernelRegAlloc: Constraint-aware linear scan allocator - allocate_kernel(): Convenience function for allocation - AllocationError: Raised when allocation fails (no spilling) - Support for precoloring, reserved registers, alignment kernel_render.py: - PhysicalMapping: Virtual to physical register mapping - KernelRenderer: Renders KernelProgram to assembly - Instruction formatting with proper AMDGCN syntax kernel_pipeline.py: - KernelCompilationContext: Context for building kernels - use_kernel_lsra(): Check if kernel LSRA is enabled - CSE support at kernel level kernel_expr_bridge.py: - ExpressionBridge: Compatibility layer for existing code test_kernel_ir.py: - 25 unit tests covering IR, liveness, allocation, rendering All 33 asm_backend_test.py E2E tests pass. All 25 new unit tests pass. Signed-off-by: Harsh Menon <[email protected]>

Implements KernelEmitter, an alternative to ExprEmitter that uses a single kernel-wide register pool for better register reuse across expressions. kernel_emitter.py: - KernelEmitter class with same API as ExprEmitter (get_or_emit, bind_symbol) - Kernel-wide register pool: allocates from a shared pool, tracks reserved VGPRs - Global CSE cache: caches all expression results kernel-wide - Streaming emission: immediately allocates physical registers and emits - Complex expression support: floor/div, mod, nested arithmetic - Power-of-2 optimizations: shifts for multiply/divide, AND for modulo - Constant materialization with caching to avoid redundant v_mov_b32 - Factory function create_emitter() to choose emitter based on env var handlers.py: - Updated to use create_emitter() factory function - Automatically selects KernelEmitter when WAVE_KERNEL_LSRA=1 To enable: export WAVE_KERNEL_LSRA=1 Test results with WAVE_KERNEL_LSRA=1: - 32 of 33 asm_backend_test.py tests pass - 1 test (g2s-shape1-32-config1) fails, under investigation All 33 tests pass with default emitter (WAVE_KERNEL_LSRA=0). Signed-off-by: Harsh Menon <[email protected]>

Makes WAVE_KERNEL_LSRA=1 the default, enabling the kernel-level expression emitter with whole-program register allocation. Key changes: - use_kernel_emitter() now defaults to True (WAVE_KERNEL_LSRA=1) - Disable algebraic simplification by default (WAVE_EXPR_SIMPLIFY=0) to fix g2s-shape1-32-config1 test failure (to be investigated separately) - Remove debug output added during debugging - Clean up _emit helper method The KernelEmitter provides: - Global CSE across the entire kernel - Kernel-wide register pool for better register reuse - Streaming emission (immediate physical register allocation) All 33 asm_backend_test.py tests pass with the default settings. Signed-off-by: Harsh Menon <[email protected]>

Signed-off-by: Harsh Menon <[email protected]>

The algebraic simplification in KernelEmitter causes incorrect results in the g2s (global-to-shared) path for certain expression patterns. While the simplified expressions are mathematically equivalent (verified numerically), something in the instruction emission order or register allocation produces NaN values. The issue is specific to KernelEmitter - the legacy ExprEmitter works correctly with simplification enabled. This suggests the bug may be in how KernelEmitter handles the simplified expression forms. Key observations: - Expressions like "16*tid_x - 64*floor(tid_x/4)" simplify to "16*Mod(tid_x,4)" - Both forms are mathematically equivalent (verified for all tid_x values) - The generated instruction sequences look correct - But the g2s path produces NaN values with simplification enabled This change disables simplification by default for KernelEmitter while keeping it enabled by default for the legacy ExprEmitter. Use WAVE_EXPR_SIMPLIFY=1 to enable simplification for testing. All 33 asm_backend_test.py tests pass with the default settings. Signed-off-by: Harsh Menon <[email protected]>

Fix the g2s path issue by copying SGPRs to VGPRs before using them in VOP instructions. The issue was that using SGPRs directly in the src1 position of VOP2 instructions like v_lshlrev_b32 was causing incorrect assembly encoding. The fix mirrors ExprEmitter behavior which explicitly copies non-VGPRs to VGPRs before use in arithmetic operations. Changes: - Add SGPR-to-VGPR copy in _emit_mul when src_reg is an SGPR - Enable algebraic simplification by default (WAVE_EXPR_SIMPLIFY=1) - Update documentation to reflect new defaults All 33 asm_backend_test.py tests pass with both simplification enabled and the KernelEmitter as the default emitter. Signed-off-by: Harsh Menon <[email protected]>

KernelEmitter is now the only expression emitter. This simplifies the codebase by removing the legacy per-expression allocation approach. Deleted files: - expr_emitter.py: Legacy emitter with per-expression allocation - expr_ir.py: Virtual register IR (moved expr_key to kernel_emitter.py) - expr_opt.py: Optimization passes for virtual register IR - expr_regalloc.py: Linear scan allocator for virtual registers - expression_emitter_test.py: Broken test importing deleted module - test_expr_ir.py: Tests for deleted virtual register IR Updated files: - kernel_emitter.py: Removed WAVE_KERNEL_LSRA fallback, inlined expr_key - handlers.py: Removed unused use_kernel_emitter import - utils.py: Use KernelEmitter directly instead of ExprEmitter - kernel_expr_bridge.py: Updated comment - asm_backend.rst: Removed references to ExprEmitter and virtual IR Signed-off-by: Harsh Menon <[email protected]>

New register_allocation.rst document explains: - Kernel IR (kernel_ir.py): Virtual registers, ranges, instructions, opcodes - Liveness analysis (kernel_liveness.py): Live range computation, pressure - Register allocation (kernel_regalloc.py): Linear scan algorithm, constraints - Rendering (kernel_render.py): Physical register substitution Includes architecture diagram, code examples, and debugging tips. Signed-off-by: Harsh Menon <[email protected]>

This file was designed as a migration bridge between ExprEmitter and kernel-level IR. Since ExprEmitter has been removed and KernelEmitter is now the only emitter, this bridge is no longer needed. Signed-off-by: Harsh Menon <[email protected]>

Rename for clarity: the module generates assembly code from kernel IR. Changes: - Rename KernelRenderer class to KernelGenerator - Rename render() method to generate() - Rename render_to_string() to generate_to_string() - Add generate_program() convenience function - Keep backwards compatibility aliases (KernelRenderer, render_program) - Update imports in kernel_pipeline.py, kernel_regalloc.py - Update test imports in test_kernel_ir.py - Update documentation in register_allocation.rst Signed-off-by: Harsh Menon <[email protected]>

Infrastructure changes: - Add KSpecialReg support for m0, exec, vcc special registers - Add is_special() helper function in kernel_ir.py - Add M0, EXEC, VCC well-known special register constants - Extend KOpcode with buffer_load_dword_lds, buffer_load_dwordx4_lds, v_mbcnt_lo/hi, s_cmp_*, s_cbranch_*, s_branch, s_movk_i32, s_and/or_b64 - Add RAW_ASM pseudo-op for escape hatch Generator updates: - Handle KSpecialReg in _resolve_reg and _resolve_operand - Add mnemonics for all new opcodes - Special formatting for buffer_load_dword_lds (appends lds modifier) - Special formatting for branch instructions (label in operand) Pipeline updates: - Add use_kernel_ir_path() flag (WAVE_USE_LEGACY_STREAMING env var) - Add ssa_to_reg mapping for SSA value tracking - Add srds tracking for SRD setup - Add many new instruction emission methods: - vreg_pair, vreg_quad, sreg_pair, sreg_quad - ds_write_b32, ds_write_b128, ds_read_b128, ds_read_b32 - v_readfirstlane_b32, s_mov_b32_to_m0, v_mfma_f32_16x16x16_f16 - buffer_load_dwordx4, buffer_store_dwordx4, s_load_dwordx2 - buffer_load_dword_lds, buffer_load_dwordx4_lds - v_mbcnt_lo_u32_b32, v_mbcnt_hi_u32_b32 - v_sub_u32, s_and_b32, s_or_b32, s_movk_i32 - s_cmp_lt_u32, s_cbranch_scc1, s_branch, s_add_u32 - Add emit_raw, emit_label helper methods Driver/walker updates: - driver.py: Create KernelCompilationContext when kernel IR mode enabled - mlir_walker.py: Add kernel_ctx attribute and use_kernel_ir property - handlers.py: Update barrier handlers to use kernel IR when enabled All 33 e2e tests and 25 kernel IR tests pass. Signed-off-by: Harsh Menon <[email protected]>

Create a single source of truth for AMDGCN instruction definitions: instruction_defs/common.yaml: - 76 instruction definitions covering all backend-needed ops - Categories: VALU, SALU, VMEM, SMEM, LDS, MFMA, CONTROL - Operand type definitions with allowed types (vgpr, sgpr, imm, etc.) - Latency information for scheduling - Special properties (memory, sync, branch, accumulator, etc.) - Architecture-extensible design instruction_registry.py: - InstructionDef dataclass with complete instruction metadata - InstructionRegistry loads YAML files with architecture layering - OperandType enum for operand classification - InstructionCategory enum for scheduling - Lazy loading for performance unified_emitter.py: - UnifiedEmitter class dynamically generates methods from registry - Supports DIRECT mode (raw assembly lines) - Supports KERNEL_IR mode (emit to KernelCompilationContext) - VReg/SReg/Imm wrappers for explicit operand typing - InstructionBuilder handles formatting and special cases Design benefits: - Single source of truth for instruction properties - Architecture-specific overrides via gfx942.yaml, gfx950.yaml - Consistent API for both legacy and kernel IR paths - Latency/constraint info available for scheduling/allocation - Easy to add new instructions All 33 e2e tests continue to pass. Signed-off-by: Harsh Menon <[email protected]>

Step 1: Migrate instructions.py - Replace manual instruction classes with registry-backed versions - Add helper functions for register formatting - Each instruction class now queries the registry for metadata - Provides latency and category properties from registry - Maintains full backwards compatibility with existing code Step 2: Update handlers to use UnifiedEmitter - Add unified emitter property to AsmEmitter class - Update barrier handlers to use unified emitter API - Unified emitter shares line buffer with AsmEmitter - Provides consistent API for both legacy and kernel IR paths Step 3: Add architecture-specific YAML files - gfx942.yaml: CDNA3/MI300 specific instructions - FP8/BF8 MFMA instructions - Improved latency values for memory and MFMA - Type conversion instructions - gfx950.yaml: CDNA3+/MI350 specific instructions - All GFX942 features with improved latencies - MFMA latency: 64 (common) -> 32 (gfx942) -> 24 (gfx950) - Buffer load latency: 200 -> 150 -> 120 Registry layering working: - Common: 76 instructions - GFX942: 85 instructions (76 + 9 new) - GFX950: 85 instructions (with override latencies) All 33 e2e tests continue to pass. Signed-off-by: Harsh Menon <[email protected]>

This commit completes the migration to the YAML-based unified instruction infrastructure by removing all instruction class definitions from instructions.py. Key changes: 1. instructions.py: Reduced from 1106 lines to 75 lines - Only keeps the base Instruction class (for backwards compatibility) - All specific instruction classes removed - Instruction emission now uses UnifiedEmitter 2. asm_emitter.py: Added TicketingEmitterWrapper - Wraps UnifiedEmitter to add ticketing for memory operations - Intercepts unified emitter calls to issue VMEM/LGKM tickets - Handles hazard mitigation after instructions 3. handlers.py, utils.py, gather_to_shared.py: Migrated to unified emitter - Replaced emit_instruction(InstructionClass()) with unified.method() - All handlers now use emitter.unified.xxx() API 4. hazards.py: Updated to return strings instead of SNop objects - get_mitigation() now returns " s_nop 0" string directly 5. __init__.py: Updated exports - Removed instruction class exports - Added UnifiedEmitter, EmissionMode, get_registry exports 6. Deleted test_ds_instructions.py - DS instruction classes no longer exist Benefits: - Single source of truth: YAML files define all instructions - ~1000 lines of boilerplate code removed - Consistent API via unified emitter - Latency/category info available from registry - Architecture-specific overrides via gfx942.yaml/gfx950.yaml All 33 e2e tests and 28 ticketing tests pass. Signed-off-by: Harsh Menon <[email protected]>

This commit adds a unified property to KernelCompilationContext, providing the same API as AsmEmitter.unified for consistent instruction emission. Changes: 1. kernel_pipeline.py: - Import UnifiedEmitter and EmissionMode - Add _unified: UnifiedEmitter field initialized in __post_init__ - Add unified property that returns the emitter in KERNEL_IR mode - The emitter delegates to context methods (v_add_u32, etc.) - Falls back to emit_raw() for unsupported instructions 2. kernel_generator.py: - Add support for raw int immediates in _resolve_operand() - Makes API more ergonomic - callers can pass 42 instead of KImm(42) Usage: ctx = KernelCompilationContext() v1 = ctx.v_mov_b32(42) # Direct method call v2 = ctx.unified.v_add_u32(v1, 100) # Via unified emitter # Both produce the same kernel IR Benefits: - Consistent API between AsmEmitter and KernelCompilationContext - Enables gradual migration of handlers to kernel IR - Unified emitter provides instruction metadata from registry Signed-off-by: Harsh Menon <[email protected]>

Replace ~50 manual instruction method definitions with a single __getattr__ handler that uses the instruction registry and KOpcode mapping. Key changes: 1. Added _build_opcode_mapping() to create name->KOpcode mapping - Converts V_ADD_U32 to v_add_u32 for method dispatch 2. Added _get_def_info() to determine register allocation from operand types - Maps OperandType.VGPR -> single vreg - Maps OperandType.VGPR_PAIR -> vreg range of 2 - Maps OperandType.VGPR_QUAD -> vreg range of 4 - Similarly for SGPRs 3. Added __getattr__ to KernelCompilationContext - Intercepts method calls like ctx.v_add_u32(...) - Looks up KOpcode from instruction name - Returns a bound method that calls _emit_instruction() 4. Added _emit_instruction() for actual emission - Looks up InstructionDef from registry - Allocates destination registers based on operand types - Emits KInstr with proper defs/uses/comment 5. Kept only essential explicit methods: - Register allocation: vreg(), sreg(), vreg_pair(), vreg_quad(), etc. - Special cases: s_mov_b32_to_m0 (M0 destination), branch labels - Finalization: finalize(), finalize_to_string() - Properties: num_instructions, num_virtual_vregs, etc. Benefits: - Reduced file by ~45 lines (195 added, 240 removed) - Single source of truth: registry defines all instructions - Auto-allocates correct destination register type - Easy to add new instructions (just add to KOpcode enum) All 33 e2e tests and 25 kernel IR tests pass. Signed-off-by: Harsh Menon <[email protected]>

The kernel IR path was started but never completed. Only 2 handlers had branching for it (handle_barrier_op, handle_lds_barrier_op), while all other handlers used the legacy path directly. Since both branches emit the same assembly (just via different mechanisms), and the unified emitter now provides a consistent API, the branching was redundant. Changes: - handlers.py: Remove use_kernel_ir branches, use unified emitter directly - mlir_walker.py: Remove unused use_kernel_ir property All 33 e2e tests pass. Signed-off-by: Harsh Menon <[email protected]>

…egistry This commit unifies the instruction definition systems by making the YAML instruction registry the single source of truth. The KOpcode enum and hardcoded mnemonic mappings are removed. Key changes: 1. kernel_ir.py: - Removed KOpcode enum (~120 lines of enum values) - KInstr now uses name: str instead of opcode: KOpcode - Added is_comment, is_label, is_raw_asm properties - KernelBuilder methods emit KInstr with string names 2. kernel_generator.py: - Removed _get_mnemonic() with 60+ hardcoded mappings - Now looks up mnemonics from instruction registry - Falls back to using instruction name if not in registry - Simplified from ~350 lines to ~260 lines 3. kernel_pipeline.py: - Removed _OPCODE_MAP and _get_opcode() functions - Dynamic dispatch now uses registry.get(name) directly - Emits KInstr with string names 4. kernel_liveness.py, kernel_regalloc.py: - Removed unused KOpcode import 5. test_kernel_ir.py: - Updated to check instr.name == "v_add_u32" instead of opcode == KOpcode.V_ADD_U32 - Fixed test_vreg_range_allocation to use is_vgpr() Benefits: - Single source of truth: YAML files define all instructions - Removed ~290 lines of redundant code - Adding new instructions only requires YAML changes - Cleaner, more maintainable codebase All 33 e2e tests and 25 kernel IR tests pass. Signed-off-by: Harsh Menon <[email protected]>

This commit makes the kernel IR compilation path the default, providing whole-program register allocation and CSE benefits. Key changes: 1. kernel_pipeline.py: - Changed WAVE_USE_LEGACY_STREAMING default from "1" to "0" - Kernel IR path is now enabled by default - Set WAVE_USE_LEGACY_STREAMING=1 to use legacy direct emission 2. mlir_walker.py: - Added unified property that routes to the correct emitter - When kernel_ctx is available, uses kernel_ctx.unified - Otherwise falls back to emitter.unified 3. handlers.py, gather_to_shared.py: - Changed self.walker.emitter.unified to self.walker.unified - This ensures instructions are routed through kernel IR when enabled 4. docs/wave/asm_backend.rst: - Updated Kernel-Level IR documentation - Noted this is now the default compilation path - Added WAVE_USE_LEGACY_STREAMING=1 to disable All 33 e2e tests and 25 kernel IR tests pass. Signed-off-by: Harsh Menon <[email protected]>

This commit adds infrastructure for kernel IR compilation path with whole-program register allocation. Currently disabled by default because handlers still allocate physical registers from AsmEmitter while emitting to KernelCompilationContext - these need to be fully coordinated. Key changes: 1. kernel_pipeline.py: - Added WAVE_USE_KERNEL_IR environment variable (default: 0) - use_kernel_ir_path() checks this flag - Set WAVE_USE_KERNEL_IR=1 to enable for testing 2. driver.py, asm_emitter.py: - Conditionally create KernelCompilationContext based on flag - Call finalize() to generate assembly from kernel IR 3. mlir_walker.py: - Added unified property that routes to kernel_ctx or emitter - Enables handlers to use walker.unified for instruction emission 4. handlers.py, gather_to_shared.py: - Changed emitter.unified to walker.unified - Enables routing through kernel IR when enabled 5. kernel_generator.py: - Added support for string operands (e.g., waitcnt values) To fully enable kernel IR mode, handlers need to: - Allocate virtual registers from kernel_ctx instead of physical from emitter - Track virtual registers for SSA values - Let kernel IR pipeline handle physical allocation at finalize All 33 e2e tests pass with legacy mode (default). Signed-off-by: Harsh Menon <[email protected]>

Increased DS_MAX_OFFSET from 2040 to 8192 in handlers.py to allow ds_read instructions to use the offset field for larger offsets (4096, 4128, 4160, 4192). Results: - VALU reduced from 93 to 79 (-14 instructions) - v_mov_b32 reduced from 13 to 10 (-3 constants) - v_add_u32 reduced from 27 to 16 (-11 address adds) - ASM backend now achieves 65% of LLVM performance (up from 63%) - All 33 e2e tests pass The ds_read offset field now encodes constants like 4128, 4160, 4192 instead of materializing them in separate v_mov_b32 instructions. Multiple ds_read operations can now share a common base register. Signed-off-by: Harsh Menon <[email protected]>

Added comprehensive bit range analysis to KernelIRExprEmitter: - _get_bit_range(): Computes [min_bit, max_bit] for expressions - _bits_overlap(): Checks if two bit ranges overlap - Modified Add handling to use v_or_b32 when bits don't overlap Combined with peephole optimizer's v_lshl_or_b32 fusion for shift+OR patterns. Results: - VALU reduced from 79 to 67 (-12 instructions, 15%) - Performance improved from 3.96 to 4.32 TFLOPs (+9%) - ASM backend now achieves 71% of LLVM performance (up from 65%) - All 33 e2e tests pass Example generated code showing bit-aware OR: v_or_b32 v12, v8, v2 // or (bits 3-4 + 7-10) v_or_b32 v13, v12, v16 // or (bits 3-10 + 11-26) Signed-off-by: Harsh Menon <[email protected]>

Bug fixes: 1. VALU count was inflated because 's_mov_b32' contains 'v_' as substring - Fixed: Check if first word STARTS with 'v_' not CONTAINS 'v_' 2. VGPR/SGPR/LDS extraction was failing - Fixed: Look for .amdhsa_next_free_vgpr/sgpr patterns - Fixed: Use raw_asm for ASM backend (has HSA metadata) Results now correctly show: - ASM VALU: 63 (was incorrectly 80) - ASM Resources: VGPR=32, SGPR=32, LDS=8192 (was 0) Signed-off-by: Harsh Menon <[email protected]>

Bug fixes: 1. LLVM resources showed 0 because metadata is in .rocmasm file, not disasm - Now reads *.rocmasm from llvm_intermediates directory 2. Updated extract_resource_usage to handle YAML format - .vgpr_count: 27 (YAML with spaces) - .group_segment_fixed_size: 8192 Correct results now: - LLVM: VGPR=27, SGPR=16, LDS=8192 - ASM: VGPR=32, SGPR=32, LDS=8192 - VALU gap: 33 (LLVM) vs 63 (ASM) = 30 extra instructions Signed-off-by: Harsh Menon <[email protected]>

Key optimizations: 1. Add constant cache to avoid materializing same constant multiple times 2. Handle negative powers of 2 with shift + negate instead of mul - e.g., x * -8192 becomes -(x << 13) = 0 - (x << 13) - Saves 1 instruction per negative power-of-2 multiply 3. Use inline literals for constants in Add expressions - e.g., v_add_u32 dst, 4096, src instead of mov + add - Saves 1 v_mov_b32 per constant addend Results: - v_mov_b32: 10 -> 7 (-3 instructions) - v_mul_lo_u32: 2 -> 0 (-2 instructions) - VALU total: 63 -> 62 (-1 net, shift+sub replaces mov+mul) All 33 e2e tests pass in kernel IR mode. Signed-off-by: Harsh Menon <[email protected]>

This commit removes the legacy mode from the ASM backend, leaving kernel IR as the only compilation path. Changes include: - kernel_pipeline.py: use_kernel_ir() now always returns True - driver.py: Removed use_kernel_ir() branching, always uses kernel IR - handlers.py: Removed all `kernel_ctx is None` checks and legacy code paths - Removed _get_expr_emitter(), _compute_lds_address(), _extract_source_registers(), _emit_ds_write(), _get_scalar_register_for_store(), _compute_store_address() - Simplified all operation handlers to use only kernel IR mode - mlir_walker.py: Removed precompute_m0_values_legacy call and import - gather_to_shared.py: Removed precompute_m0_values_legacy function and all _use_kernel_ir() checks, simplified all methods to kernel IR only All 33 e2e tests pass. The WAVE_KERNEL_IR environment variable is now ignored as legacy mode has been removed. Signed-off-by: Harsh Menon <[email protected]>

Removed: - asm_asm.s, asm_disasm.s, llvm_asm.s, llvm_disasm.s (generated assembly files) - comparison_report.txt (temporary report) - wave_lang/tests/*/__init__.py (empty init files) Signed-off-by: Harsh Menon <[email protected]>

Key documentation updates: - Remove all references to legacy mode (kernel IR is now the only path) - Document CFG-based backward dataflow liveness analysis - Document peephole optimizations (v_lshl_add_u32, v_lshl_or_b32 fusion) - Document bit range analysis for OR optimization - Document precise hazard mitigation (reduced s_nop from ~46 to ~2) - Document ds_read/ds_write offset optimization (up to 8192 bytes) - Document algebraic simplification with symbol bounds - Document loop-invariant expression caching - Document loop SGPR reservation (s24+) - Add performance comparison vs LLVM backend section - Update troubleshooting with new debug environment variables - Reorganize and modernize architecture section Signed-off-by: Harsh Menon <[email protected]>

Key documentation updates: - Add KernelCompilationContext and KernelIRExprEmitter sections - Document scoped CSE and loop-invariant detection - Document algebraic simplification with symbol bounds - Document bit range analysis for OR optimization - Update liveness analysis to describe CFG-based backward dataflow - Add loop handling and back-edge detection documentation - Document loop SGPR reservation (s24+) - Add peephole optimization section (instruction fusion) - Document precise hazard mitigation - Add new debugging environment variables - Update future work section with three-operand fusion - Remove references to legacy streaming allocation Signed-off-by: Harsh Menon <[email protected]>

Dead code removed: - kernel_emitter.py (710 lines) - entire file deleted as it was replaced by KernelIRExprEmitter in kernel_pipeline.py - emit_expression_asm and _emit_thread_id_expression from utils.py (~82 lines) - compute_voffset_and_instoffset_* methods from asm_emitter.py (~117 lines) These were legacy code paths that are no longer called since the kernel IR compilation path is now the only path. Before (kernel_*.py total): 5933 lines After (kernel_*.py total): 5223 lines Net reduction: 710 lines from kernel_*.py files Additional reduction in utils.py and asm_emitter.py: ~199 lines Total reduction: ~909 lines of dead code removed All 33 e2e tests pass after this cleanup. Signed-off-by: Harsh Menon <[email protected]>

This commit implements Option B (Single-Path Architecture) for instruction emission: New architecture: MetadataEmitter.emit_prologue() -> assembler directives KernelCompilationContext -> kernel IR (virtual registers) KernelGenerator -> assembly text (physical registers) MetadataEmitter.emit_epilogue() -> YAML metadata Key changes: 1. InstructionFormatter (NEW): - Single point for ALL physical instruction formatting - Uses InstructionRegistry (YAML) for instruction definitions - Handles special formatting (buffer ops, waitcnt, DS offsets) - Used by both UnifiedEmitter and KernelGenerator 2. MetadataEmitter (NEW): - Handles kernel prologue/epilogue metadata directives - .amdgcn_target, .amdhsa_kernel, .amdgpu_metadata - Resource patching (VGPR/SGPR/LDS after allocation) 3. KernelGenerator refactored: - Now uses InstructionFormatter instead of inline f-strings - Single point for physical instruction rendering - Pseudo-instruction handlers use formatter API 4. KernelCompilationContext enhanced: - emit_kernargs() - emits s_load_dwordx2 at kernel start - finalize() - emits s_endpgm at kernel end - Kernarg pair tracking for SRD setup All 33 e2e tests pass. Signed-off-by: Harsh Menon <[email protected]>

This commit implements Option B (Single-Path Architecture) for instruction emission: New architecture: MetadataEmitter.emit_prologue() -> assembler directives KernelCompilationContext -> kernel IR (virtual registers) KernelGenerator -> assembly text (physical registers) MetadataEmitter.emit_epilogue() -> YAML metadata Key changes: 1. InstructionFormatter (instruction_formatter.py): - Single point for ALL physical instruction formatting - Uses InstructionRegistry (YAML) for instruction definitions - Handles special formatting (buffer ops, waitcnt, DS offsets) - Used by both UnifiedEmitter and KernelGenerator 2. MetadataEmitter (metadata_emitter.py): - Handles kernel prologue/epilogue metadata directives - .amdgcn_target, .amdhsa_kernel, .amdgpu_metadata - Resource patching (VGPR/SGPR/LDS after allocation) 3. KernelGenerator refactored: - Now uses InstructionFormatter instead of inline f-strings - Single point for physical instruction rendering - Pseudo-instruction handlers use formatter API 4. KernelCompilationContext enhanced: - emit_kernargs() - emits s_load_dwordx2 at kernel start - finalize() - emits s_endpgm at kernel end (if not already there) - Kernarg pair tracking for SRD setup 5. driver.py updated: - Uses MetadataEmitter for prologue/epilogue - Uses KernelCompilationContext for kernel body - Single-path flow: metadata -> kernel IR -> metadata All 25 asm unit tests pass. Signed-off-by: Harsh Menon <[email protected]>

- Refactored KernelModuleCompiler to use MetadataEmitter for prologue/epilogue - Removed dead code from AsmEmitter (~190 lines): - Loop methods: begin_loop, emit_loop_header, emit_loop_latch, end_loop - Kernarg/SRD: emit_kernargs, ensure_srd_for_subspan - Load/Store: chunk_offsets, emit_load, emit_store_* methods - Added _NoOpTicketing class and ticketing property to KernelCompilationContext - Updated handlers.py to use kernel_ctx.ticketing when available - Added KernelModuleCompiler to package exports All 25 unit tests and 33 e2e tests pass. Signed-off-by: Harsh Menon <[email protected]>

…icy\n\n- Add mlir_analysis.py and reuse it across driver/KernelModuleCompiler/AsmEmitter\n- Make MetadataEmitter the source of prologue+epilogue metadata; AsmEmitter delegates\n- Unify normalize_wg_size and register granularity rules\n- Add abi.py for system_vgpr_workitem_id policy and tests for helpers\n\nTests: unit(asm/) + e2e(asm_backend_test.py) Signed-off-by: Harsh Menon <[email protected]>

Signed-off-by: Harsh Menon <[email protected]>

…ker kernel_ctx-only Signed-off-by: Harsh Menon <[email protected]>

…tigation; shrink AsmEmitter Signed-off-by: Harsh Menon <[email protected]>

Signed-off-by: Harsh Menon <[email protected]>

…plication High priority fixes: - Fix regalloc peak VGPR/SGPR accounting bug (vreg_map keys are ints, not KVReg) - Make kernel selection policy explicit with documented constants Medium priority fixes: - Add strict operand validation to InstructionFormatter (behind WAVE_STRICT_FORMATTER) - Unify duplicate InstructionCategory enums (single definition in instruction_categories.py) - Replace star imports with explicit imports in kernel pipeline modules - Clean up untracked legacy files (linear_scan_allocator.py, liveness.py, sympy_fused_ops.py) - Improve kernel_pipeline_shared.py __all__ to only export public symbols All 30 unit tests and 33 e2e tests pass. Signed-off-by: Harsh Menon <[email protected]>

Testing improvements: - Add regression test for regalloc peak VGPR/SGPR accounting with ranges - Add unit tests for should_skip_function kernel selection policy - Enable strict formatter validation by default in ASM tests (conftest.py) Script cleanup (compare_backends.py): - Move os.environ mutation from import-time to main() - Replace wildcard import with explicit imports from global_symbols - Add documentation for external tool requirements All 38 unit tests and 33 e2e tests pass. Signed-off-by: Harsh Menon <[email protected]>

Changes: - Update Srd127_96 constant from hex 0x20000 to decimal 131072 - Simplify s_load_dwordx2 patterns to accept decimal offsets - Remove overly specific SRD setup comments and patterns - Simplify MFMA and buffer store patterns to use flexible matchers - Update test_gemm_gather_to_lds to focus on loop structure The CHECK patterns now match the actual assembly output from the kernel IR compilation path. Signed-off-by: Harsh Menon <[email protected]>

- Remove asm_emitter.py reference (file was deleted) - Add new components: instruction_formatter.py, metadata_emitter.py, kernel_module_compiler.py, kernel_compilation_context.py, kernel_passes.py, mlir_analysis.py, ticketing.py, abi.py - Document handler file split (handlers_memory.py, handlers_control.py, etc.) - Update architecture diagram in register_allocation.rst - Update module references to point to correct split files Signed-off-by: Harsh Menon <[email protected]>

harsh-nod requested review from panditsa and xintin December 25, 2025 07:47

harsh-nod force-pushed the reduce_regs branch from ee43673 to d14cb60 Compare December 25, 2025 07:48

harsh-nod added 27 commits December 24, 2025 23:54

Update docs: KernelEmitter is now the default

60b9bd8

Signed-off-by: Harsh Menon <[email protected]>

Remove unused kernel_expr_bridge.py

abffcb1

This file was designed as a migration bridge between ExprEmitter and kernel-level IR. Since ExprEmitter has been removed and KernelEmitter is now the only emitter, this bridge is no longer needed. Signed-off-by: Harsh Menon <[email protected]>

harsh-nod added 29 commits December 24, 2025 23:54

Remove temporary/generated files from git tracking

131d743

Removed: - asm_asm.s, asm_disasm.s, llvm_asm.s, llvm_disasm.s (generated assembly files) - comparison_report.txt (temporary report) - wave_lang/tests/*/__init__.py (empty init files) Signed-off-by: Harsh Menon <[email protected]>

Clean up unused imports in kernel IR tests

dfb021c

Signed-off-by: Harsh Menon <[email protected]>

Kernel IR: migrate ticketing waitcnt placement and simplify AsmEmitter

c54b925

Signed-off-by: Harsh Menon <[email protected]>

Enable kernel ticketing waitcnt placement by default

03114c1

Signed-off-by: Harsh Menon <[email protected]>

Deduplicate MLIR->ASM pipeline: canonical KernelModuleCompiler; IRWal…

1df703e

…ker kernel_ctx-only Signed-off-by: Harsh Menon <[email protected]>

Remove translation_info parsing duplication; keep kernel IR hazard mi…

1bfaf57

…tigation; shrink AsmEmitter Signed-off-by: Harsh Menon <[email protected]>

Remove AsmEmitter; use KernelModuleCompiler everywhere

5f1703a

Signed-off-by: Harsh Menon <[email protected]>

Remove unused legacy ASM hazard/allocator modules

5cd1485

Signed-off-by: Harsh Menon <[email protected]>

Remove ASM debug-dump hooks from compile.py

1952527

Signed-off-by: Harsh Menon <[email protected]>

Split kernel_pipeline and handlers into <1000 LOC modules

5030f0e

Signed-off-by: Harsh Menon <[email protected]>

Rename split modules: use descriptive names instead of mixin

e6432bf

Signed-off-by: Harsh Menon <[email protected]>

Replace hardcoded defaults with explicit errors for required config

f4e64f6

Signed-off-by: Harsh Menon <[email protected]>

harsh-nod force-pushed the reduce_regs branch from d14cb60 to c51ab43 Compare December 25, 2025 07:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[asm] Major ASM Backend Refactoring #639

[asm] Major ASM Backend Refactoring #639

Uh oh!

harsh-nod commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[asm] Major ASM Backend Refactoring #639

Are you sure you want to change the base?

[asm] Major ASM Backend Refactoring #639

Uh oh!

Conversation

harsh-nod commented Dec 25, 2025

Summary

Major Changes

Architecture Overhaul

New Kernel IR Infrastructure

Optimizations

Ticketing System Migration

New Components

Developer Tools

Testing

Breaking Changes

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant