Skip to content

Conversation

@harsh-nod
Copy link
Collaborator

Summary

This PR represents a comprehensive overhaul of the Wave ASM backend, replacing the legacy dual-path architecture with a unified single-path Kernel IR compilation pipeline. The result is a cleaner, more maintainable codebase with improved optimization capabilities.

Major Changes

Architecture Overhaul

  • Single-Path Compilation: Eliminated the legacy "direct emission" mode. All compilation now flows through the Kernel IR path:

    MLIR -> KernelCompilationContext -> KernelProgram -> Liveness -> RegAlloc -> Assembly
    
  • Module Decomposition: Split large monolithic files into focused modules (<1000 LOC each):

    • kernel_pipeline.py -> kernel_compilation_context.py, kernel_expr_emitter.py, kernel_passes.py, kernel_loops.py, kernel_mfma.py, kernel_module_compiler.py
    • handlers.py -> handlers_memory.py, handlers_control.py, handlers_arith_affine.py
  • Removed Legacy Infrastructure: Deleted ~8,000 lines of obsolete code including asm_emitter.py, register_allocator.py, scoreboard.py, hazards.py, latency_database.py, and the per-architecture latency JSON files

New Kernel IR Infrastructure

  • Virtual Register IR: New KVReg/KSReg types with whole-program liveness tracking
  • CFG-Based Liveness Analysis: Proper backward dataflow analysis with loop-carried value handling
  • Linear Scan Register Allocator: Constraint-aware allocation with precoloring for ABI registers
  • Unified Instruction Registry: YAML-based instruction definitions replacing hardcoded KOpcode enum

Optimizations

  • Algebraic Simplification: Uses symbol bounds (tid_x < 64, etc.) to eliminate redundant floor/mod operations
  • Bit Range Analysis: Converts ADD to OR when operand bit ranges don't overlap
  • Peephole Optimizer: Fuses v_lshlrev_b32 + v_add_u32 -> v_lshl_add_u32
  • Scoped CSE: Loop-invariant expression caching across iterations
  • LDS Offset Optimization: Uses ds_read/ds_write offset field (0-8192 bytes) to reduce VALU instructions
  • Precise Hazard Mitigation: Only inserts s_nop where actually needed (~2 vs ~46 previously)

Ticketing System Migration

  • Migrated memory operation tracking from AsmEmitter to kernel IR path
  • Integrated s_waitcnt coalescing into KernelCompilationContext.finalize()
  • Enabled by default via WAVE_KERNEL_TICKETING

New Components

Module Purpose
instruction_formatter.py Centralized instruction-to-assembly formatting
metadata_emitter.py AMDGCN metadata directive generation
mlir_analysis.py MLIR parsing, kernel selection, translation_info extraction
abi.py ABI policies (system VGPR workitem ID)
ticketing.py Memory operation tracking and waitcnt coalescing
unified_emitter.py Dynamic instruction emission API
kernel_regalloc.py Linear scan register allocator
kernel_liveness.py CFG-based liveness analysis

Developer Tools

  • scripts/compare_backends.py: Compare LLVM vs ASM backend assembly
  • scripts/symbolic_asm_analysis.py: Analyze generated assembly
  • perf/benchmark_asm_backend.py: Benchmarking infrastructure

Testing

  • 38 unit tests passing
  • 33 e2e tests passing (copy, MMA, multi-wave, multi-workgroup, gather-to-LDS)
  • Lit tests updated and passing

Breaking Changes

  • AsmEmitter class removed - use KernelModuleCompiler.compile_mlir_string() instead
  • WAVE_KERNEL_IR environment variable removed (kernel IR is now the only path)
  • Several legacy modules deleted (see "Removed Legacy Infrastructure" above)

Documentation

Updated docs/wave/asm_backend.rst and docs/wave/register_allocation.rst to reflect the new architecture.

This commit adds an optional offset parameter to ds_read_b64, ds_write_b32,
ds_write_b64, and ds_write_b128 instructions. The offset field allows encoding
constant address offsets directly in the instruction (0-65535 bytes), which can
save a v_add_u32 instruction for address computation.

Changes:
- DSReadB64, DSWriteB32, DSWriteB64, DSWriteB128 now accept offset parameter
- asm_emitter.emit_lds_read_b64() accepts offset parameter
- Added unit tests for DS instruction offset support
- Updated documentation

Register reduction: This is infrastructure for future optimizations. When
the ds_read offset field is used instead of computing full address, it saves
1 VGPR per unique address offset pattern (by avoiding the v_add result).

Tests: All 27 asm_backend_test.py E2E tests pass.
Signed-off-by: Harsh Menon <[email protected]>
This commit introduces a new expression emitter (ExprEmitterV2) that uses a
two-phase approach: first emit to virtual registers, then allocate physical
registers using linear scan. This decouples expression computation from
register allocation, enabling global CSE and optimization passes.

New files:
- expr_ir.py: Virtual register IR definitions (VReg, SReg, OpCode, ExprInstr,
  ExprProgram, CachedExprRef wrapper for preventing sympy flattening)
- expr_regalloc.py: Linear scan register allocator with multiple policies
  (SEQUENTIAL, LIFO_REUSE, FIFO_REUSE, RANDOM, REVERSE)
- expr_opt.py: IR-level optimization passes (copy propagation, DCE, coalescing)
- expr_simplify.py: Algebraic simplification rules for floor/mod expressions
- expr_emitter_interface.py: Protocol and factory for emitter selection
- expr_emitter_v2.py: Main v2 emitter with streaming emission and CSE

Key features:
- Global subexpression CSE across entire kernel
- Virtual registers enable register coalescing
- Immediate value optimization (inline constants)
- Algebraic simplification (floor/mod identities)
- CachedExprRef prevents sympy from flattening add expressions

Register reduction: Reduces v_mov instructions from 80+ to ~10 through:
- Copy propagation eliminates redundant moves
- Register coalescing allows same physical register for copies
- Immediate usage avoids register allocation for small constants

Tests: All 29 asm_backend_test.py E2E tests pass.
Signed-off-by: Harsh Menon <[email protected]>
Add two analysis scripts to the scripts/ directory:

scripts/compare_backends.py:
  - Compares LLVM and ASM backend assembly output side-by-side
  - Computes instruction metrics by category (SALU, VALU, VMEM, etc.)
  - Extracts resource usage (VGPR, SGPR, LDS)
  - Supports loading configs from benchmark_configs.json
  - Generates detailed comparison reports

scripts/symbolic_asm_analysis.py:
  - Symbolic analysis of assembly using SymPy
  - Builds symbolic expressions for register values
  - Traces register definitions through instruction sequences
  - Useful for debugging address calculations

Usage:
  python scripts/compare_backends.py --benchmark gemm-asm-benchmark
  python scripts/symbolic_asm_analysis.py wave_asm.s "buffer_load_dword"
Signed-off-by: Harsh Menon <[email protected]>
Add benchmark_asm_backend.py for comparing ASM vs LLVM backend performance.

Features:
- Reads GEMM configuration from benchmark_configs.json
- Runs GEMM kernel with both ASM and LLVM backends
- Verifies correctness against PyTorch reference
- Reports detailed performance metrics and instruction counts
- Supports warmup iterations and multiple runs for averaging
- Analyzes generated assembly for instruction statistics

The benchmark_configs.json already includes a gemm_shapes section with
the default GEMM configuration for ASM backend testing.

Signed-off-by: Harsh Menon <[email protected]>
Simplify codebase by removing legacy expression emitter code:

- Delete expression_emitter.py (legacy ExprEmitter class, ~780 lines)
- Delete expr_emitter_interface.py (factory function, ~120 lines)
- Rename expr_emitter_v2.py to expr_emitter.py
- Rename ExprEmitterV2 class to ExprEmitter
- Move expr_key() function to expr_ir.py
- Update all imports and documentation

The ExprEmitter now uses virtual register IR with:
- Global subexpression CSE
- Copy propagation and dead code elimination
- Register coalescing
- Algebraic simplification
- Immediate value optimization

Total: -900 lines of redundant code.
Signed-off-by: Harsh Menon <[email protected]>
- Add @singledispatch for get_max_value, simplify_expr, combine_like_terms
- Register type-specific handlers for Integer, Symbol, Add, Mul, Mod, Pow
- Declarative REWRITE_RULES with SymPy Wild patterns
- Helper functions: is_floor(), is_pow2()
- Use S.Zero/S.One singletons instead of Integer(0)/Integer(1)
- Update asm_backend.rst documentation

This makes adding new type handlers simple: just @func.register(NewType)
Eliminates chains of isinstance() checks throughout the codebase.

All 33 asm_backend_test.py tests pass.

Signed-off-by: Harsh Menon <[email protected]>
- Remove CachedExprRef class from expr_ir.py (was experimental)
- Remove CachedExprRef handling from expr_emitter.py
- Remove two-stage factoring code from handlers.py (was disabled by default)
- Remove factor_ds_read_offset helper from utils.py

The two-stage factoring feature had correctness issues and was disabled
behind WAVE_DS_TWO_STAGE=1. Since it was never enabled in production,
removing the dead code simplifies the codebase.

All 33 asm_backend_test.py tests pass.

Signed-off-by: Harsh Menon <[email protected]>
New modules for kernel-level register allocation (WAVE_KERNEL_LSRA=1):

kernel_ir.py:
  - KVReg/KSReg: Virtual register types
  - KPhysVReg/KPhysSReg: Precolored physical registers
  - KRegRange: Contiguous register ranges (pairs, quads, MFMA blocks)
  - KOpcode: Enumeration of all AMDGCN instructions
  - KInstr: Instruction with defs, uses, and constraints
  - KernelProgram: Complete kernel instruction sequence
  - KernelBuilder: Helper for emitting common instruction patterns
  - KernelABI: ABI bindings for reserved registers

kernel_liveness.py:
  - LiveRange: Live range for a virtual register
  - LivenessInfo: Complete liveness information
  - compute_liveness(): SSA-based liveness analysis
  - validate_ssa(): Verify SSA form
  - Register pressure computation

kernel_regalloc.py:
  - RegPool: Physical register pool with range allocation
  - KernelRegAlloc: Constraint-aware linear scan allocator
  - allocate_kernel(): Convenience function for allocation
  - AllocationError: Raised when allocation fails (no spilling)
  - Support for precoloring, reserved registers, alignment

kernel_render.py:
  - PhysicalMapping: Virtual to physical register mapping
  - KernelRenderer: Renders KernelProgram to assembly
  - Instruction formatting with proper AMDGCN syntax

kernel_pipeline.py:
  - KernelCompilationContext: Context for building kernels
  - use_kernel_lsra(): Check if kernel LSRA is enabled
  - CSE support at kernel level

kernel_expr_bridge.py:
  - ExpressionBridge: Compatibility layer for existing code

test_kernel_ir.py:
  - 25 unit tests covering IR, liveness, allocation, rendering

All 33 asm_backend_test.py E2E tests pass.
All 25 new unit tests pass.

Signed-off-by: Harsh Menon <[email protected]>
Implements KernelEmitter, an alternative to ExprEmitter that uses a single
kernel-wide register pool for better register reuse across expressions.

kernel_emitter.py:
  - KernelEmitter class with same API as ExprEmitter (get_or_emit, bind_symbol)
  - Kernel-wide register pool: allocates from a shared pool, tracks reserved VGPRs
  - Global CSE cache: caches all expression results kernel-wide
  - Streaming emission: immediately allocates physical registers and emits
  - Complex expression support: floor/div, mod, nested arithmetic
  - Power-of-2 optimizations: shifts for multiply/divide, AND for modulo
  - Constant materialization with caching to avoid redundant v_mov_b32
  - Factory function create_emitter() to choose emitter based on env var

handlers.py:
  - Updated to use create_emitter() factory function
  - Automatically selects KernelEmitter when WAVE_KERNEL_LSRA=1

To enable: export WAVE_KERNEL_LSRA=1

Test results with WAVE_KERNEL_LSRA=1:
  - 32 of 33 asm_backend_test.py tests pass
  - 1 test (g2s-shape1-32-config1) fails, under investigation

All 33 tests pass with default emitter (WAVE_KERNEL_LSRA=0).

Signed-off-by: Harsh Menon <[email protected]>
Makes WAVE_KERNEL_LSRA=1 the default, enabling the kernel-level expression
emitter with whole-program register allocation.

Key changes:
- use_kernel_emitter() now defaults to True (WAVE_KERNEL_LSRA=1)
- Disable algebraic simplification by default (WAVE_EXPR_SIMPLIFY=0) to fix
  g2s-shape1-32-config1 test failure (to be investigated separately)
- Remove debug output added during debugging
- Clean up _emit helper method

The KernelEmitter provides:
- Global CSE across the entire kernel
- Kernel-wide register pool for better register reuse
- Streaming emission (immediate physical register allocation)

All 33 asm_backend_test.py tests pass with the default settings.

Signed-off-by: Harsh Menon <[email protected]>
The algebraic simplification in KernelEmitter causes incorrect results in
the g2s (global-to-shared) path for certain expression patterns. While the
simplified expressions are mathematically equivalent (verified numerically),
something in the instruction emission order or register allocation produces
NaN values.

The issue is specific to KernelEmitter - the legacy ExprEmitter works
correctly with simplification enabled. This suggests the bug may be in
how KernelEmitter handles the simplified expression forms.

Key observations:
- Expressions like "16*tid_x - 64*floor(tid_x/4)" simplify to "16*Mod(tid_x,4)"
- Both forms are mathematically equivalent (verified for all tid_x values)
- The generated instruction sequences look correct
- But the g2s path produces NaN values with simplification enabled

This change disables simplification by default for KernelEmitter while
keeping it enabled by default for the legacy ExprEmitter.

Use WAVE_EXPR_SIMPLIFY=1 to enable simplification for testing.

All 33 asm_backend_test.py tests pass with the default settings.

Signed-off-by: Harsh Menon <[email protected]>
Fix the g2s path issue by copying SGPRs to VGPRs before using them in
VOP instructions. The issue was that using SGPRs directly in the src1
position of VOP2 instructions like v_lshlrev_b32 was causing incorrect
assembly encoding.

The fix mirrors ExprEmitter behavior which explicitly copies non-VGPRs
to VGPRs before use in arithmetic operations.

Changes:
- Add SGPR-to-VGPR copy in _emit_mul when src_reg is an SGPR
- Enable algebraic simplification by default (WAVE_EXPR_SIMPLIFY=1)
- Update documentation to reflect new defaults

All 33 asm_backend_test.py tests pass with both simplification enabled
and the KernelEmitter as the default emitter.

Signed-off-by: Harsh Menon <[email protected]>
KernelEmitter is now the only expression emitter. This simplifies the
codebase by removing the legacy per-expression allocation approach.

Deleted files:
- expr_emitter.py: Legacy emitter with per-expression allocation
- expr_ir.py: Virtual register IR (moved expr_key to kernel_emitter.py)
- expr_opt.py: Optimization passes for virtual register IR
- expr_regalloc.py: Linear scan allocator for virtual registers
- expression_emitter_test.py: Broken test importing deleted module
- test_expr_ir.py: Tests for deleted virtual register IR

Updated files:
- kernel_emitter.py: Removed WAVE_KERNEL_LSRA fallback, inlined expr_key
- handlers.py: Removed unused use_kernel_emitter import
- utils.py: Use KernelEmitter directly instead of ExprEmitter
- kernel_expr_bridge.py: Updated comment
- asm_backend.rst: Removed references to ExprEmitter and virtual IR

Signed-off-by: Harsh Menon <[email protected]>
New register_allocation.rst document explains:
- Kernel IR (kernel_ir.py): Virtual registers, ranges, instructions, opcodes
- Liveness analysis (kernel_liveness.py): Live range computation, pressure
- Register allocation (kernel_regalloc.py): Linear scan algorithm, constraints
- Rendering (kernel_render.py): Physical register substitution

Includes architecture diagram, code examples, and debugging tips.

Signed-off-by: Harsh Menon <[email protected]>
This file was designed as a migration bridge between ExprEmitter and
kernel-level IR. Since ExprEmitter has been removed and KernelEmitter
is now the only emitter, this bridge is no longer needed.

Signed-off-by: Harsh Menon <[email protected]>
Rename for clarity: the module generates assembly code from kernel IR.

Changes:
- Rename KernelRenderer class to KernelGenerator
- Rename render() method to generate()
- Rename render_to_string() to generate_to_string()
- Add generate_program() convenience function
- Keep backwards compatibility aliases (KernelRenderer, render_program)
- Update imports in kernel_pipeline.py, kernel_regalloc.py
- Update test imports in test_kernel_ir.py
- Update documentation in register_allocation.rst

Signed-off-by: Harsh Menon <[email protected]>
Infrastructure changes:
- Add KSpecialReg support for m0, exec, vcc special registers
- Add is_special() helper function in kernel_ir.py
- Add M0, EXEC, VCC well-known special register constants
- Extend KOpcode with buffer_load_dword_lds, buffer_load_dwordx4_lds,
  v_mbcnt_lo/hi, s_cmp_*, s_cbranch_*, s_branch, s_movk_i32, s_and/or_b64
- Add RAW_ASM pseudo-op for escape hatch

Generator updates:
- Handle KSpecialReg in _resolve_reg and _resolve_operand
- Add mnemonics for all new opcodes
- Special formatting for buffer_load_dword_lds (appends lds modifier)
- Special formatting for branch instructions (label in operand)

Pipeline updates:
- Add use_kernel_ir_path() flag (WAVE_USE_LEGACY_STREAMING env var)
- Add ssa_to_reg mapping for SSA value tracking
- Add srds tracking for SRD setup
- Add many new instruction emission methods:
  - vreg_pair, vreg_quad, sreg_pair, sreg_quad
  - ds_write_b32, ds_write_b128, ds_read_b128, ds_read_b32
  - v_readfirstlane_b32, s_mov_b32_to_m0, v_mfma_f32_16x16x16_f16
  - buffer_load_dwordx4, buffer_store_dwordx4, s_load_dwordx2
  - buffer_load_dword_lds, buffer_load_dwordx4_lds
  - v_mbcnt_lo_u32_b32, v_mbcnt_hi_u32_b32
  - v_sub_u32, s_and_b32, s_or_b32, s_movk_i32
  - s_cmp_lt_u32, s_cbranch_scc1, s_branch, s_add_u32
- Add emit_raw, emit_label helper methods

Driver/walker updates:
- driver.py: Create KernelCompilationContext when kernel IR mode enabled
- mlir_walker.py: Add kernel_ctx attribute and use_kernel_ir property
- handlers.py: Update barrier handlers to use kernel IR when enabled

All 33 e2e tests and 25 kernel IR tests pass.

Signed-off-by: Harsh Menon <[email protected]>
Create a single source of truth for AMDGCN instruction definitions:

instruction_defs/common.yaml:
- 76 instruction definitions covering all backend-needed ops
- Categories: VALU, SALU, VMEM, SMEM, LDS, MFMA, CONTROL
- Operand type definitions with allowed types (vgpr, sgpr, imm, etc.)
- Latency information for scheduling
- Special properties (memory, sync, branch, accumulator, etc.)
- Architecture-extensible design

instruction_registry.py:
- InstructionDef dataclass with complete instruction metadata
- InstructionRegistry loads YAML files with architecture layering
- OperandType enum for operand classification
- InstructionCategory enum for scheduling
- Lazy loading for performance

unified_emitter.py:
- UnifiedEmitter class dynamically generates methods from registry
- Supports DIRECT mode (raw assembly lines)
- Supports KERNEL_IR mode (emit to KernelCompilationContext)
- VReg/SReg/Imm wrappers for explicit operand typing
- InstructionBuilder handles formatting and special cases

Design benefits:
- Single source of truth for instruction properties
- Architecture-specific overrides via gfx942.yaml, gfx950.yaml
- Consistent API for both legacy and kernel IR paths
- Latency/constraint info available for scheduling/allocation
- Easy to add new instructions

All 33 e2e tests continue to pass.

Signed-off-by: Harsh Menon <[email protected]>
Step 1: Migrate instructions.py
- Replace manual instruction classes with registry-backed versions
- Add helper functions for register formatting
- Each instruction class now queries the registry for metadata
- Provides latency and category properties from registry
- Maintains full backwards compatibility with existing code

Step 2: Update handlers to use UnifiedEmitter
- Add unified emitter property to AsmEmitter class
- Update barrier handlers to use unified emitter API
- Unified emitter shares line buffer with AsmEmitter
- Provides consistent API for both legacy and kernel IR paths

Step 3: Add architecture-specific YAML files
- gfx942.yaml: CDNA3/MI300 specific instructions
  - FP8/BF8 MFMA instructions
  - Improved latency values for memory and MFMA
  - Type conversion instructions
- gfx950.yaml: CDNA3+/MI350 specific instructions
  - All GFX942 features with improved latencies
  - MFMA latency: 64 (common) -> 32 (gfx942) -> 24 (gfx950)
  - Buffer load latency: 200 -> 150 -> 120

Registry layering working:
- Common: 76 instructions
- GFX942: 85 instructions (76 + 9 new)
- GFX950: 85 instructions (with override latencies)

All 33 e2e tests continue to pass.

Signed-off-by: Harsh Menon <[email protected]>
This commit completes the migration to the YAML-based unified instruction
infrastructure by removing all instruction class definitions from
instructions.py.

Key changes:

1. instructions.py: Reduced from 1106 lines to 75 lines
   - Only keeps the base Instruction class (for backwards compatibility)
   - All specific instruction classes removed
   - Instruction emission now uses UnifiedEmitter

2. asm_emitter.py: Added TicketingEmitterWrapper
   - Wraps UnifiedEmitter to add ticketing for memory operations
   - Intercepts unified emitter calls to issue VMEM/LGKM tickets
   - Handles hazard mitigation after instructions

3. handlers.py, utils.py, gather_to_shared.py: Migrated to unified emitter
   - Replaced emit_instruction(InstructionClass()) with unified.method()
   - All handlers now use emitter.unified.xxx() API

4. hazards.py: Updated to return strings instead of SNop objects
   - get_mitigation() now returns "    s_nop 0" string directly

5. __init__.py: Updated exports
   - Removed instruction class exports
   - Added UnifiedEmitter, EmissionMode, get_registry exports

6. Deleted test_ds_instructions.py
   - DS instruction classes no longer exist

Benefits:
- Single source of truth: YAML files define all instructions
- ~1000 lines of boilerplate code removed
- Consistent API via unified emitter
- Latency/category info available from registry
- Architecture-specific overrides via gfx942.yaml/gfx950.yaml

All 33 e2e tests and 28 ticketing tests pass.

Signed-off-by: Harsh Menon <[email protected]>
This commit adds a unified property to KernelCompilationContext, providing
the same API as AsmEmitter.unified for consistent instruction emission.

Changes:

1. kernel_pipeline.py:
   - Import UnifiedEmitter and EmissionMode
   - Add _unified: UnifiedEmitter field initialized in __post_init__
   - Add unified property that returns the emitter in KERNEL_IR mode
   - The emitter delegates to context methods (v_add_u32, etc.)
   - Falls back to emit_raw() for unsupported instructions

2. kernel_generator.py:
   - Add support for raw int immediates in _resolve_operand()
   - Makes API more ergonomic - callers can pass 42 instead of KImm(42)

Usage:
    ctx = KernelCompilationContext()
    v1 = ctx.v_mov_b32(42)              # Direct method call
    v2 = ctx.unified.v_add_u32(v1, 100) # Via unified emitter
    # Both produce the same kernel IR

Benefits:
- Consistent API between AsmEmitter and KernelCompilationContext
- Enables gradual migration of handlers to kernel IR
- Unified emitter provides instruction metadata from registry

Signed-off-by: Harsh Menon <[email protected]>
Replace ~50 manual instruction method definitions with a single __getattr__
handler that uses the instruction registry and KOpcode mapping.

Key changes:

1. Added _build_opcode_mapping() to create name->KOpcode mapping
   - Converts V_ADD_U32 to v_add_u32 for method dispatch

2. Added _get_def_info() to determine register allocation from operand types
   - Maps OperandType.VGPR -> single vreg
   - Maps OperandType.VGPR_PAIR -> vreg range of 2
   - Maps OperandType.VGPR_QUAD -> vreg range of 4
   - Similarly for SGPRs

3. Added __getattr__ to KernelCompilationContext
   - Intercepts method calls like ctx.v_add_u32(...)
   - Looks up KOpcode from instruction name
   - Returns a bound method that calls _emit_instruction()

4. Added _emit_instruction() for actual emission
   - Looks up InstructionDef from registry
   - Allocates destination registers based on operand types
   - Emits KInstr with proper defs/uses/comment

5. Kept only essential explicit methods:
   - Register allocation: vreg(), sreg(), vreg_pair(), vreg_quad(), etc.
   - Special cases: s_mov_b32_to_m0 (M0 destination), branch labels
   - Finalization: finalize(), finalize_to_string()
   - Properties: num_instructions, num_virtual_vregs, etc.

Benefits:
- Reduced file by ~45 lines (195 added, 240 removed)
- Single source of truth: registry defines all instructions
- Auto-allocates correct destination register type
- Easy to add new instructions (just add to KOpcode enum)

All 33 e2e tests and 25 kernel IR tests pass.

Signed-off-by: Harsh Menon <[email protected]>
The kernel IR path was started but never completed. Only 2 handlers had
branching for it (handle_barrier_op, handle_lds_barrier_op), while all
other handlers used the legacy path directly.

Since both branches emit the same assembly (just via different mechanisms),
and the unified emitter now provides a consistent API, the branching was
redundant.

Changes:
- handlers.py: Remove use_kernel_ir branches, use unified emitter directly
- mlir_walker.py: Remove unused use_kernel_ir property

All 33 e2e tests pass.

Signed-off-by: Harsh Menon <[email protected]>
…egistry

This commit unifies the instruction definition systems by making the YAML
instruction registry the single source of truth. The KOpcode enum and
hardcoded mnemonic mappings are removed.

Key changes:

1. kernel_ir.py:
   - Removed KOpcode enum (~120 lines of enum values)
   - KInstr now uses name: str instead of opcode: KOpcode
   - Added is_comment, is_label, is_raw_asm properties
   - KernelBuilder methods emit KInstr with string names

2. kernel_generator.py:
   - Removed _get_mnemonic() with 60+ hardcoded mappings
   - Now looks up mnemonics from instruction registry
   - Falls back to using instruction name if not in registry
   - Simplified from ~350 lines to ~260 lines

3. kernel_pipeline.py:
   - Removed _OPCODE_MAP and _get_opcode() functions
   - Dynamic dispatch now uses registry.get(name) directly
   - Emits KInstr with string names

4. kernel_liveness.py, kernel_regalloc.py:
   - Removed unused KOpcode import

5. test_kernel_ir.py:
   - Updated to check instr.name == "v_add_u32" instead of opcode == KOpcode.V_ADD_U32
   - Fixed test_vreg_range_allocation to use is_vgpr()

Benefits:
- Single source of truth: YAML files define all instructions
- Removed ~290 lines of redundant code
- Adding new instructions only requires YAML changes
- Cleaner, more maintainable codebase

All 33 e2e tests and 25 kernel IR tests pass.

Signed-off-by: Harsh Menon <[email protected]>
This commit makes the kernel IR compilation path the default, providing
whole-program register allocation and CSE benefits.

Key changes:

1. kernel_pipeline.py:
   - Changed WAVE_USE_LEGACY_STREAMING default from "1" to "0"
   - Kernel IR path is now enabled by default
   - Set WAVE_USE_LEGACY_STREAMING=1 to use legacy direct emission

2. mlir_walker.py:
   - Added unified property that routes to the correct emitter
   - When kernel_ctx is available, uses kernel_ctx.unified
   - Otherwise falls back to emitter.unified

3. handlers.py, gather_to_shared.py:
   - Changed self.walker.emitter.unified to self.walker.unified
   - This ensures instructions are routed through kernel IR when enabled

4. docs/wave/asm_backend.rst:
   - Updated Kernel-Level IR documentation
   - Noted this is now the default compilation path
   - Added WAVE_USE_LEGACY_STREAMING=1 to disable

All 33 e2e tests and 25 kernel IR tests pass.

Signed-off-by: Harsh Menon <[email protected]>
This commit adds infrastructure for kernel IR compilation path with
whole-program register allocation. Currently disabled by default because
handlers still allocate physical registers from AsmEmitter while emitting
to KernelCompilationContext - these need to be fully coordinated.

Key changes:

1. kernel_pipeline.py:
   - Added WAVE_USE_KERNEL_IR environment variable (default: 0)
   - use_kernel_ir_path() checks this flag
   - Set WAVE_USE_KERNEL_IR=1 to enable for testing

2. driver.py, asm_emitter.py:
   - Conditionally create KernelCompilationContext based on flag
   - Call finalize() to generate assembly from kernel IR

3. mlir_walker.py:
   - Added unified property that routes to kernel_ctx or emitter
   - Enables handlers to use walker.unified for instruction emission

4. handlers.py, gather_to_shared.py:
   - Changed emitter.unified to walker.unified
   - Enables routing through kernel IR when enabled

5. kernel_generator.py:
   - Added support for string operands (e.g., waitcnt values)

To fully enable kernel IR mode, handlers need to:
- Allocate virtual registers from kernel_ctx instead of physical from emitter
- Track virtual registers for SSA values
- Let kernel IR pipeline handle physical allocation at finalize

All 33 e2e tests pass with legacy mode (default).

Signed-off-by: Harsh Menon <[email protected]>
Increased DS_MAX_OFFSET from 2040 to 8192 in handlers.py to allow
ds_read instructions to use the offset field for larger offsets
(4096, 4128, 4160, 4192).

Results:
- VALU reduced from 93 to 79 (-14 instructions)
- v_mov_b32 reduced from 13 to 10 (-3 constants)
- v_add_u32 reduced from 27 to 16 (-11 address adds)
- ASM backend now achieves 65% of LLVM performance (up from 63%)
- All 33 e2e tests pass

The ds_read offset field now encodes constants like 4128, 4160, 4192
instead of materializing them in separate v_mov_b32 instructions.
Multiple ds_read operations can now share a common base register.

Signed-off-by: Harsh Menon <[email protected]>
Added comprehensive bit range analysis to KernelIRExprEmitter:
- _get_bit_range(): Computes [min_bit, max_bit] for expressions
- _bits_overlap(): Checks if two bit ranges overlap
- Modified Add handling to use v_or_b32 when bits don't overlap

Combined with peephole optimizer's v_lshl_or_b32 fusion for shift+OR patterns.

Results:
- VALU reduced from 79 to 67 (-12 instructions, 15%)
- Performance improved from 3.96 to 4.32 TFLOPs (+9%)
- ASM backend now achieves 71% of LLVM performance (up from 65%)
- All 33 e2e tests pass

Example generated code showing bit-aware OR:
  v_or_b32 v12, v8, v2  // or (bits 3-4 + 7-10)
  v_or_b32 v13, v12, v16  // or (bits 3-10 + 11-26)

Signed-off-by: Harsh Menon <[email protected]>
Bug fixes:
1. VALU count was inflated because 's_mov_b32' contains 'v_' as substring
   - Fixed: Check if first word STARTS with 'v_' not CONTAINS 'v_'

2. VGPR/SGPR/LDS extraction was failing
   - Fixed: Look for .amdhsa_next_free_vgpr/sgpr patterns
   - Fixed: Use raw_asm for ASM backend (has HSA metadata)

Results now correctly show:
- ASM VALU: 63 (was incorrectly 80)
- ASM Resources: VGPR=32, SGPR=32, LDS=8192 (was 0)

Signed-off-by: Harsh Menon <[email protected]>
Bug fixes:
1. LLVM resources showed 0 because metadata is in .rocmasm file, not disasm
   - Now reads *.rocmasm from llvm_intermediates directory

2. Updated extract_resource_usage to handle YAML format
   - .vgpr_count:     27 (YAML with spaces)
   - .group_segment_fixed_size: 8192

Correct results now:
- LLVM: VGPR=27, SGPR=16, LDS=8192
- ASM:  VGPR=32, SGPR=32, LDS=8192
- VALU gap: 33 (LLVM) vs 63 (ASM) = 30 extra instructions

Signed-off-by: Harsh Menon <[email protected]>
Key optimizations:
1. Add constant cache to avoid materializing same constant multiple times
2. Handle negative powers of 2 with shift + negate instead of mul
   - e.g., x * -8192 becomes -(x << 13) = 0 - (x << 13)
   - Saves 1 instruction per negative power-of-2 multiply
3. Use inline literals for constants in Add expressions
   - e.g., v_add_u32 dst, 4096, src instead of mov + add
   - Saves 1 v_mov_b32 per constant addend

Results:
- v_mov_b32: 10 -> 7 (-3 instructions)
- v_mul_lo_u32: 2 -> 0 (-2 instructions)
- VALU total: 63 -> 62 (-1 net, shift+sub replaces mov+mul)

All 33 e2e tests pass in kernel IR mode.

Signed-off-by: Harsh Menon <[email protected]>
This commit removes the legacy mode from the ASM backend, leaving kernel IR
as the only compilation path. Changes include:

- kernel_pipeline.py: use_kernel_ir() now always returns True
- driver.py: Removed use_kernel_ir() branching, always uses kernel IR
- handlers.py: Removed all `kernel_ctx is None` checks and legacy code paths
  - Removed _get_expr_emitter(), _compute_lds_address(), _extract_source_registers(),
    _emit_ds_write(), _get_scalar_register_for_store(), _compute_store_address()
  - Simplified all operation handlers to use only kernel IR mode
- mlir_walker.py: Removed precompute_m0_values_legacy call and import
- gather_to_shared.py: Removed precompute_m0_values_legacy function and
  all _use_kernel_ir() checks, simplified all methods to kernel IR only

All 33 e2e tests pass. The WAVE_KERNEL_IR environment variable is now ignored
as legacy mode has been removed.

Signed-off-by: Harsh Menon <[email protected]>
Removed:
- asm_asm.s, asm_disasm.s, llvm_asm.s, llvm_disasm.s (generated assembly files)
- comparison_report.txt (temporary report)
- wave_lang/tests/*/__init__.py (empty init files)

Signed-off-by: Harsh Menon <[email protected]>
Key documentation updates:
- Remove all references to legacy mode (kernel IR is now the only path)
- Document CFG-based backward dataflow liveness analysis
- Document peephole optimizations (v_lshl_add_u32, v_lshl_or_b32 fusion)
- Document bit range analysis for OR optimization
- Document precise hazard mitigation (reduced s_nop from ~46 to ~2)
- Document ds_read/ds_write offset optimization (up to 8192 bytes)
- Document algebraic simplification with symbol bounds
- Document loop-invariant expression caching
- Document loop SGPR reservation (s24+)
- Add performance comparison vs LLVM backend section
- Update troubleshooting with new debug environment variables
- Reorganize and modernize architecture section

Signed-off-by: Harsh Menon <[email protected]>
Key documentation updates:
- Add KernelCompilationContext and KernelIRExprEmitter sections
- Document scoped CSE and loop-invariant detection
- Document algebraic simplification with symbol bounds
- Document bit range analysis for OR optimization
- Update liveness analysis to describe CFG-based backward dataflow
- Add loop handling and back-edge detection documentation
- Document loop SGPR reservation (s24+)
- Add peephole optimization section (instruction fusion)
- Document precise hazard mitigation
- Add new debugging environment variables
- Update future work section with three-operand fusion
- Remove references to legacy streaming allocation

Signed-off-by: Harsh Menon <[email protected]>
Dead code removed:
- kernel_emitter.py (710 lines) - entire file deleted as it was replaced by
  KernelIRExprEmitter in kernel_pipeline.py
- emit_expression_asm and _emit_thread_id_expression from utils.py (~82 lines)
- compute_voffset_and_instoffset_* methods from asm_emitter.py (~117 lines)

These were legacy code paths that are no longer called since the kernel IR
compilation path is now the only path.

Before (kernel_*.py total): 5933 lines
After (kernel_*.py total): 5223 lines
Net reduction: 710 lines from kernel_*.py files

Additional reduction in utils.py and asm_emitter.py: ~199 lines

Total reduction: ~909 lines of dead code removed

All 33 e2e tests pass after this cleanup.

Signed-off-by: Harsh Menon <[email protected]>
This commit implements Option B (Single-Path Architecture) for instruction emission:

New architecture:
  MetadataEmitter.emit_prologue() -> assembler directives
  KernelCompilationContext      -> kernel IR (virtual registers)
  KernelGenerator              -> assembly text (physical registers)
  MetadataEmitter.emit_epilogue() -> YAML metadata

Key changes:

1. InstructionFormatter (NEW):
   - Single point for ALL physical instruction formatting
   - Uses InstructionRegistry (YAML) for instruction definitions
   - Handles special formatting (buffer ops, waitcnt, DS offsets)
   - Used by both UnifiedEmitter and KernelGenerator

2. MetadataEmitter (NEW):
   - Handles kernel prologue/epilogue metadata directives
   - .amdgcn_target, .amdhsa_kernel, .amdgpu_metadata
   - Resource patching (VGPR/SGPR/LDS after allocation)

3. KernelGenerator refactored:
   - Now uses InstructionFormatter instead of inline f-strings
   - Single point for physical instruction rendering
   - Pseudo-instruction handlers use formatter API

4. KernelCompilationContext enhanced:
   - emit_kernargs() - emits s_load_dwordx2 at kernel start
   - finalize() - emits s_endpgm at kernel end
   - Kernarg pair tracking for SRD setup

All 33 e2e tests pass.

Signed-off-by: Harsh Menon <[email protected]>
This commit implements Option B (Single-Path Architecture) for instruction emission:

New architecture:
  MetadataEmitter.emit_prologue() -> assembler directives
  KernelCompilationContext      -> kernel IR (virtual registers)
  KernelGenerator              -> assembly text (physical registers)
  MetadataEmitter.emit_epilogue() -> YAML metadata

Key changes:

1. InstructionFormatter (instruction_formatter.py):
   - Single point for ALL physical instruction formatting
   - Uses InstructionRegistry (YAML) for instruction definitions
   - Handles special formatting (buffer ops, waitcnt, DS offsets)
   - Used by both UnifiedEmitter and KernelGenerator

2. MetadataEmitter (metadata_emitter.py):
   - Handles kernel prologue/epilogue metadata directives
   - .amdgcn_target, .amdhsa_kernel, .amdgpu_metadata
   - Resource patching (VGPR/SGPR/LDS after allocation)

3. KernelGenerator refactored:
   - Now uses InstructionFormatter instead of inline f-strings
   - Single point for physical instruction rendering
   - Pseudo-instruction handlers use formatter API

4. KernelCompilationContext enhanced:
   - emit_kernargs() - emits s_load_dwordx2 at kernel start
   - finalize() - emits s_endpgm at kernel end (if not already there)
   - Kernarg pair tracking for SRD setup

5. driver.py updated:
   - Uses MetadataEmitter for prologue/epilogue
   - Uses KernelCompilationContext for kernel body
   - Single-path flow: metadata -> kernel IR -> metadata

All 25 asm unit tests pass.

Signed-off-by: Harsh Menon <[email protected]>
- Refactored KernelModuleCompiler to use MetadataEmitter for prologue/epilogue
- Removed dead code from AsmEmitter (~190 lines):
  - Loop methods: begin_loop, emit_loop_header, emit_loop_latch, end_loop
  - Kernarg/SRD: emit_kernargs, ensure_srd_for_subspan
  - Load/Store: chunk_offsets, emit_load, emit_store_* methods
- Added _NoOpTicketing class and ticketing property to KernelCompilationContext
- Updated handlers.py to use kernel_ctx.ticketing when available
- Added KernelModuleCompiler to package exports

All 25 unit tests and 33 e2e tests pass.

Signed-off-by: Harsh Menon <[email protected]>
…icy\n\n- Add mlir_analysis.py and reuse it across driver/KernelModuleCompiler/AsmEmitter\n- Make MetadataEmitter the source of prologue+epilogue metadata; AsmEmitter delegates\n- Unify normalize_wg_size and register granularity rules\n- Add abi.py for system_vgpr_workitem_id policy and tests for helpers\n\nTests: unit(asm/) + e2e(asm_backend_test.py)

Signed-off-by: Harsh Menon <[email protected]>
…tigation; shrink AsmEmitter

Signed-off-by: Harsh Menon <[email protected]>
…plication

High priority fixes:
- Fix regalloc peak VGPR/SGPR accounting bug (vreg_map keys are ints, not KVReg)
- Make kernel selection policy explicit with documented constants

Medium priority fixes:
- Add strict operand validation to InstructionFormatter (behind WAVE_STRICT_FORMATTER)
- Unify duplicate InstructionCategory enums (single definition in instruction_categories.py)
- Replace star imports with explicit imports in kernel pipeline modules
- Clean up untracked legacy files (linear_scan_allocator.py, liveness.py, sympy_fused_ops.py)
- Improve kernel_pipeline_shared.py __all__ to only export public symbols

All 30 unit tests and 33 e2e tests pass.

Signed-off-by: Harsh Menon <[email protected]>
Testing improvements:
- Add regression test for regalloc peak VGPR/SGPR accounting with ranges
- Add unit tests for should_skip_function kernel selection policy
- Enable strict formatter validation by default in ASM tests (conftest.py)

Script cleanup (compare_backends.py):
- Move os.environ mutation from import-time to main()
- Replace wildcard import with explicit imports from global_symbols
- Add documentation for external tool requirements

All 38 unit tests and 33 e2e tests pass.

Signed-off-by: Harsh Menon <[email protected]>
Changes:
- Update Srd127_96 constant from hex 0x20000 to decimal 131072
- Simplify s_load_dwordx2 patterns to accept decimal offsets
- Remove overly specific SRD setup comments and patterns
- Simplify MFMA and buffer store patterns to use flexible matchers
- Update test_gemm_gather_to_lds to focus on loop structure

The CHECK patterns now match the actual assembly output from the
kernel IR compilation path.

Signed-off-by: Harsh Menon <[email protected]>
- Remove asm_emitter.py reference (file was deleted)
- Add new components: instruction_formatter.py, metadata_emitter.py,
  kernel_module_compiler.py, kernel_compilation_context.py, kernel_passes.py,
  mlir_analysis.py, ticketing.py, abi.py
- Document handler file split (handlers_memory.py, handlers_control.py, etc.)
- Update architecture diagram in register_allocation.rst
- Update module references to point to correct split files

Signed-off-by: Harsh Menon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant