FIBR - 0

Matpiler: Tiled Matmul Compilation via Fibers

We are computing a matmul over:

A = (2,3,1,5,6,4)
B = (1,1,3,5,4,5)

At depth 2, the active shapes are:

A₂ = (2,3,1) → flattened size 6
B₂ = (1,1,3) → flattened size 3
C₂ = (2,3,3) → flattened size 18

Fiber Definition

A Fiber is a modular computational unit deployed at a particular tiling level (block, warp, thread, etc.).
It describes:

Which part of the tensors it owns
How it moves data between memory spaces
What computation it performs locally

The system is organized around decomposing tensor operations into fibers at different depths, each responsible for a contiguous or logically strided region of work, with memory movement and computation explicitly described.

Pseudocode-like Fiber Interface

fiber(x, t, P, Q, load_pattern, compute_pattern):
    ownedA₂ = ownership(A₂, x)
    ownedB₂ = ownership(B₂, x)
    ownedC₂ = ownership(C₂, x)

    load_pattern(ownedA₂, x, t, P)
    load_pattern(ownedB₂, x, t, Q)

    compute_pattern(P, Q, x, t, ownedC₂)

Explanation of Each Concept

ownership(T, x)

Defines the region of tensor T assigned to fiber instance x.
At this tiling depth, this typically means:

A contiguous slice in the logical 1D view of the depth’s shape.
Example: ownership(A₂, x) might correspond to A₂[x:x+5].

Contiguous ownership is preferred because moving scattered elements from global memory into a contiguous shared memory region would cause non-coalesced loads and inefficient memory access patterns. In most cases, each fiber instance (block/warp/thread) should be assigned a logically contiguous region in memory.

load_pattern(ownedT, x, t, P)

Describes how data is loaded from an owned region ownedT into a local memory space P (for example, shared memory or registers).

The iterator t exists because:

A fiber may own more data than the available compute resources (number of threads, size of shared memory, etc.)
If there are more elements in ownedT than resources to process them, work needs to be iterated in multiple t steps.

The load pattern defines:

How to map indices in ownedT to indices in P at each step t
How to lay out these tiles in local memory, respecting hardware constraints like bank conflicts or coalesced accesses

At each t, it computes one or more index mappings between ownedT and P, potentially in 3D or ND depending on the depth and tile structure.

compute_pattern(P, Q, x, t, ownedC₂)

Describes the computation performed on local tiles P and Q at step t, affecting the corresponding region ownedC₂.

There are two important cases:

Reduction along a shared K dimension (typical matmul)
- Each iteration t processes a K-slice of P and Q.
- The result is accumulated into ownedC₂.
- This is the standard fused multiply-add behavior of matmul.
Broadcasted matmul
- If no reduction happens at this level, each iteration t maps a pair of Pₜ, Qₜ to a distinct part of ownedC₂.
- In this case, the compute_pattern might be empty at this level.
- The computation is deferred to deeper nested fibers (such as warp- or thread-level fibers) which have sufficient compute resources to perform the required operations.

Why the Iterator t is Needed

Fibers can be nested. A higher-level fiber (like a block fiber) may own a large region of work, for example 100 elements, but only have 32 threads available to execute it. In this case:

The work is divided into multiple iterations, each indexed by t.
Each iteration processes a subset of the fiber’s owned data.
Shared memory buffers P and Q might also be smaller than the fiber’s owned region, further requiring iteration.

The iterator t parameterizes partial work within a fiber's owned region, making it possible to cover large data regions with limited compute resources.

Recap: Clean Annotated Form

fiber(x, t, P, Q, load_pattern, compute_pattern):
    // Determine the tensor slices this fiber instance 'x' owns at this tiling depth
    ownedA₂ = ownership(A₂, x)
    ownedB₂ = ownership(B₂, x)
    ownedC₂ = ownership(C₂, x)

    // Load from global memory slices owned by this fiber into local memory P and Q
    load_pattern(ownedA₂, x, t, P)
    load_pattern(ownedB₂, x, t, Q)

    // Perform compute (reduce/accumulate or deferred compute) on local tiles
    compute_pattern(P, Q, x, t, ownedC₂)

Summary of Constraints and Design Principles

Ownership slices should ideally be contiguous in the 1D logical memory layout at each tiling depth for efficient shared memory loads.
Load patterns define how to move these slices into local memory tiles and how to index between them.
Iterators t allow fibers to progressively cover their ownership regions when the region size exceeds available resources.
Compute patterns define either a reduction along a shared dimension or defer the computation to nested fibers depending on whether reductions happen at this level.
In broadcasted matmuls, compute may be deferred to lower levels if tiles from A and B have not yet aligned at a reduction depth.

Final Notes

This abstraction effectively describes a composable fiber algebra for matmul operations. The system separates:

Data ownership and partitioning at each depth
Load scheduling into local memory
Compute scheduling within local memory

It is modular, hierarchical, and suitable for decomposition into nested fibers at block, warp, thread, or instruction levels.
The abstraction generalizes to:

Tensor Core operations
Fused compute epilogues
Mixed precision matmuls
Auto-tuning kernel generators
Batched and broadcasted GEMMs

The model is compatible with runtime and ahead-of-time code generation strategies, and can be extended to multi-device or multi-SM scheduling as well.

Matpiler: Nested Fiber Inheritance and Decomposition Rules

This document describes how nested fibers in a tiled matmul compilation system inherit and subdivide the work domains of their parent fibers.
Each fiber instance (block, warp, thread, etc.) operates on a subset of tensor data, loads it into local memory, and optionally performs computations.
When a fiber cannot process its entire owned region at once (due to compute or memory limits), it spawns nested fibers which inherit a subset of the parent’s region and iterate over it.

Nested Fiber Structure

A nested fiber inherits:

Ownership slices for tensors A, B, C within its parent fiber’s domain.
Iteration ranges within its parent’s region.
Local memory allocations (which may be inherited or further subdivided).
Load and compute patterns appropriate for its compute level.

Each nested fiber level typically has tighter resource constraints (e.g., fewer threads, smaller registers, fixed warp size).

Inheritance and Decomposition Rules

1. Ownership Inheritance

Each nested fiber receives a contiguous sub-region of its parent’s owned region.
The decomposition follows these rules:

If the parent owns a contiguous slice [start:end] in tensor T, the child fibers split this range into sub-slices of equal or parameterized size.
The number of child fibers depends on the available resources (threads per warp, registers per thread, etc.)
The mapping strategy (contiguous or strided) should aim to preserve coalesced global and shared memory accesses.

Example:

Parent fiber owns A₂[0:64]
4 child fibers → each owns 16 elements:
  - child 0: A₂[0:16]
  - child 1: A₂[16:32]
  - child 2: A₂[32:48]
  - child 3: A₂[48:64]

2. Iterator Range Inheritance

Each nested fiber inherits a local iterator t, used to iterate over its owned sub-region.
If a nested fiber’s work still exceeds available compute resources, it can internally loop over t.

Example:

If a warp fiber owns 128 elements, but only has 32 threads:

It will iterate t = 0, 1, 2, 3
Each thread processes one element at each t value

3. Local Memory (P, Q) Allocation

Local memory buffers (P, Q) can be inherited as:
- A shared region partitioned among child fibers
- Separate per-fiber allocations within shared memory
Decomposition rules:
- If shared memory is sufficient, each child fiber can map to a distinct contiguous segment.
- If insufficient, nested iteration over t can tile over partial regions.

Example:

If parent’s shared memory P has space for 128 elements and four child fibers:

Each child can be assigned 32 elements of P
Or, if only 64 elements available:
- Each child iterates in t over its portion

4. Load and Compute Pattern Inheritance

Each child fiber inherits:

The parent’s load pattern restricted to its owned region
The parent’s compute pattern, specialized for its sub-region

Child fibers may redefine:

Tile shapes and strides
Memory layouts (e.g., warps may load interleaved data for better bank conflict avoidance)
Compute schedules (e.g., switching from reduction to FMA at a warp level)

5. Reduction and Broadcast Handling

At each level:

If a reduction along a K dimension is active, the child fiber accumulates partial results locally or forwards them to a deeper fiber
If operating on broadcasted operands:
- The fiber may choose to defer computation if operand alignment has not occurred
- Or, if alignment has occurred, perform independent matmuls in parallel

Nested fibers inherit knowledge of reduction dimensions and broadcast shapes from their parent.

Formal Recap of Inheritance

At each nested depth:

ownedA_child = subrange(ownedA_parent, child_index, child_count)
ownedB_child = subrange(ownedB_parent, child_index, child_count)
ownedC_child = subrange(ownedC_parent, child_index, child_count)
P_child and Q_child are either:
- Disjoint subregions of P_parent, or
- Shared buffers partitioned via index mapping
t_child range is determined by:
- t_parent range
- Size of ownedA_child and ownedB_child
- Number of threads in the child fiber

The compute pattern is inherited or specialized per child fiber as appropriate.

Summary

Nested fibers in Matpiler:

Inherit ownership, iterators, memory, and compute strategies from their parent fiber
Subdivide the parent’s owned region into smaller, contiguous or logically strided subregions
Adjust iteration ranges and compute schedules to match available resources
Propagate reduction and broadcast alignment rules to deeper levels
Specialize load and compute patterns for each tiling depth and hardware level

This system enables efficient, hierarchical scheduling of large matmul operations across GPU block/warp/thread hierarchies while preserving clear, composable abstraction boundaries between levels.