FIBR - 0
Matpiler: Tiled Matmul Compilation via Fibers
We are computing a matmul over:
A = (2,3,1,5,6,4)
B = (1,1,3,5,4,5)
At depth 2, the active shapes are:
A₂ = (2,3,1)
→ flattened size 6B₂ = (1,1,3)
→ flattened size 3C₂ = (2,3,3)
→ flattened size 18
Fiber Definition
A Fiber is a modular computational unit deployed at a particular tiling level (block, warp, thread, etc.).
It describes:
- Which part of the tensors it owns
- How it moves data between memory spaces
- What computation it performs locally
The system is organized around decomposing tensor operations into fibers at different depths, each responsible for a contiguous or logically strided region of work, with memory movement and computation explicitly described.
Pseudocode-like Fiber Interface
fiber(x, t, P, Q, load_pattern, compute_pattern):
ownedA₂ = ownership(A₂, x)
ownedB₂ = ownership(B₂, x)
ownedC₂ = ownership(C₂, x)
load_pattern(ownedA₂, x, t, P)
load_pattern(ownedB₂, x, t, Q)
compute_pattern(P, Q, x, t, ownedC₂)
Explanation of Each Concept
ownership(T, x)
Defines the region of tensor T
assigned to fiber instance x
.
At this tiling depth, this typically means:
- A contiguous slice in the logical 1D view of the depth’s shape.
- Example:
ownership(A₂, x)
might correspond toA₂[x:x+5]
.
Contiguous ownership is preferred because moving scattered elements from global memory into a contiguous shared memory region would cause non-coalesced loads and inefficient memory access patterns. In most cases, each fiber instance (block/warp/thread) should be assigned a logically contiguous region in memory.
load_pattern(ownedT, x, t, P)
Describes how data is loaded from an owned region ownedT
into a local memory space P
(for example, shared memory or registers).
The iterator t
exists because:
- A fiber may own more data than the available compute resources (number of threads, size of shared memory, etc.)
- If there are more elements in
ownedT
than resources to process them, work needs to be iterated in multiplet
steps.
The load pattern defines:
- How to map indices in
ownedT
to indices inP
at each stept
- How to lay out these tiles in local memory, respecting hardware constraints like bank conflicts or coalesced accesses
At each t
, it computes one or more index mappings between ownedT
and P
, potentially in 3D or ND depending on the depth and tile structure.
compute_pattern(P, Q, x, t, ownedC₂)
Describes the computation performed on local tiles P
and Q
at step t
, affecting the corresponding region ownedC₂
.
There are two important cases:
-
Reduction along a shared K dimension (typical matmul)
- Each iteration
t
processes a K-slice of P and Q. - The result is accumulated into
ownedC₂
. - This is the standard fused multiply-add behavior of matmul.
- Each iteration
-
Broadcasted matmul
- If no reduction happens at this level, each iteration
t
maps a pair ofPₜ
,Qₜ
to a distinct part ofownedC₂
. - In this case, the compute_pattern might be empty at this level.
- The computation is deferred to deeper nested fibers (such as warp- or thread-level fibers) which have sufficient compute resources to perform the required operations.
- If no reduction happens at this level, each iteration
Why the Iterator t is Needed
Fibers can be nested. A higher-level fiber (like a block fiber) may own a large region of work, for example 100 elements, but only have 32 threads available to execute it. In this case:
- The work is divided into multiple iterations, each indexed by
t
. - Each iteration processes a subset of the fiber’s owned data.
- Shared memory buffers
P
andQ
might also be smaller than the fiber’s owned region, further requiring iteration.
The iterator t
parameterizes partial work within a fiber's owned region, making it possible to cover large data regions with limited compute resources.
Recap: Clean Annotated Form
fiber(x, t, P, Q, load_pattern, compute_pattern):
// Determine the tensor slices this fiber instance 'x' owns at this tiling depth
ownedA₂ = ownership(A₂, x)
ownedB₂ = ownership(B₂, x)
ownedC₂ = ownership(C₂, x)
// Load from global memory slices owned by this fiber into local memory P and Q
load_pattern(ownedA₂, x, t, P)
load_pattern(ownedB₂, x, t, Q)
// Perform compute (reduce/accumulate or deferred compute) on local tiles
compute_pattern(P, Q, x, t, ownedC₂)
Summary of Constraints and Design Principles
- Ownership slices should ideally be contiguous in the 1D logical memory layout at each tiling depth for efficient shared memory loads.
- Load patterns define how to move these slices into local memory tiles and how to index between them.
- Iterators
t
allow fibers to progressively cover their ownership regions when the region size exceeds available resources. - Compute patterns define either a reduction along a shared dimension or defer the computation to nested fibers depending on whether reductions happen at this level.
- In broadcasted matmuls, compute may be deferred to lower levels if tiles from A and B have not yet aligned at a reduction depth.
Final Notes
This abstraction effectively describes a composable fiber algebra for matmul operations. The system separates:
- Data ownership and partitioning at each depth
- Load scheduling into local memory
- Compute scheduling within local memory
It is modular, hierarchical, and suitable for decomposition into nested fibers at block, warp, thread, or instruction levels.
The abstraction generalizes to:
- Tensor Core operations
- Fused compute epilogues
- Mixed precision matmuls
- Auto-tuning kernel generators
- Batched and broadcasted GEMMs
The model is compatible with runtime and ahead-of-time code generation strategies, and can be extended to multi-device or multi-SM scheduling as well.
Matpiler: Nested Fiber Inheritance and Decomposition Rules
This document describes how nested fibers in a tiled matmul compilation system inherit and subdivide the work domains of their parent fibers.
Each fiber instance (block, warp, thread, etc.) operates on a subset of tensor data, loads it into local memory, and optionally performs computations.
When a fiber cannot process its entire owned region at once (due to compute or memory limits), it spawns nested fibers which inherit a subset of the parent’s region and iterate over it.
Nested Fiber Structure
A nested fiber inherits:
- Ownership slices for tensors A, B, C within its parent fiber’s domain.
- Iteration ranges within its parent’s region.
- Local memory allocations (which may be inherited or further subdivided).
- Load and compute patterns appropriate for its compute level.
Each nested fiber level typically has tighter resource constraints (e.g., fewer threads, smaller registers, fixed warp size).
Inheritance and Decomposition Rules
1. Ownership Inheritance
Each nested fiber receives a contiguous sub-region of its parent’s owned region.
The decomposition follows these rules:
- If the parent owns a contiguous slice
[start:end]
in tensorT
, the child fibers split this range into sub-slices of equal or parameterized size. - The number of child fibers depends on the available resources (threads per warp, registers per thread, etc.)
- The mapping strategy (contiguous or strided) should aim to preserve coalesced global and shared memory accesses.
Example:
Parent fiber owns A₂[0:64]
4 child fibers → each owns 16 elements:
- child 0: A₂[0:16]
- child 1: A₂[16:32]
- child 2: A₂[32:48]
- child 3: A₂[48:64]
2. Iterator Range Inheritance
Each nested fiber inherits a local iterator t
, used to iterate over its owned sub-region.
If a nested fiber’s work still exceeds available compute resources, it can internally loop over t
.
Example:
If a warp fiber owns 128 elements, but only has 32 threads:
- It will iterate
t = 0, 1, 2, 3
- Each thread processes one element at each
t
value
3. Local Memory (P, Q) Allocation
- Local memory buffers (
P
,Q
) can be inherited as:- A shared region partitioned among child fibers
- Separate per-fiber allocations within shared memory
- Decomposition rules:
- If shared memory is sufficient, each child fiber can map to a distinct contiguous segment.
- If insufficient, nested iteration over
t
can tile over partial regions.
Example:
If parent’s shared memory P
has space for 128 elements and four child fibers:
- Each child can be assigned 32 elements of
P
- Or, if only 64 elements available:
- Each child iterates in
t
over its portion
- Each child iterates in
4. Load and Compute Pattern Inheritance
Each child fiber inherits:
- The parent’s load pattern restricted to its owned region
- The parent’s compute pattern, specialized for its sub-region
Child fibers may redefine:
- Tile shapes and strides
- Memory layouts (e.g., warps may load interleaved data for better bank conflict avoidance)
- Compute schedules (e.g., switching from reduction to FMA at a warp level)
5. Reduction and Broadcast Handling
At each level:
- If a reduction along a K dimension is active, the child fiber accumulates partial results locally or forwards them to a deeper fiber
- If operating on broadcasted operands:
- The fiber may choose to defer computation if operand alignment has not occurred
- Or, if alignment has occurred, perform independent matmuls in parallel
Nested fibers inherit knowledge of reduction dimensions and broadcast shapes from their parent.
Formal Recap of Inheritance
At each nested depth:
ownedA_child = subrange(ownedA_parent, child_index, child_count)
ownedB_child = subrange(ownedB_parent, child_index, child_count)
ownedC_child = subrange(ownedC_parent, child_index, child_count)
P_child
andQ_child
are either:- Disjoint subregions of
P_parent
, or - Shared buffers partitioned via index mapping
- Disjoint subregions of
t_child
range is determined by:t_parent
range- Size of
ownedA_child
andownedB_child
- Number of threads in the child fiber
The compute pattern is inherited or specialized per child fiber as appropriate.
Summary
Nested fibers in Matpiler:
- Inherit ownership, iterators, memory, and compute strategies from their parent fiber
- Subdivide the parent’s owned region into smaller, contiguous or logically strided subregions
- Adjust iteration ranges and compute schedules to match available resources
- Propagate reduction and broadcast alignment rules to deeper levels
- Specialize load and compute patterns for each tiling depth and hardware level
This system enables efficient, hierarchical scheduling of large matmul operations across GPU block/warp/thread hierarchies while preserving clear, composable abstraction boundaries between levels.