FIBR - 0

Matpiler: Tiled Matmul Compilation via Fibers

We are computing a matmul over:

At depth 2, the active shapes are:


Fiber Definition

A Fiber is a modular computational unit deployed at a particular tiling level (block, warp, thread, etc.).
It describes:

The system is organized around decomposing tensor operations into fibers at different depths, each responsible for a contiguous or logically strided region of work, with memory movement and computation explicitly described.


Pseudocode-like Fiber Interface

fiber(x, t, P, Q, load_pattern, compute_pattern):
    ownedA₂ = ownership(A₂, x)
    ownedB₂ = ownership(B₂, x)
    ownedC₂ = ownership(C₂, x)

    load_pattern(ownedA₂, x, t, P)
    load_pattern(ownedB₂, x, t, Q)

    compute_pattern(P, Q, x, t, ownedC₂)

Explanation of Each Concept

ownership(T, x)

Defines the region of tensor T assigned to fiber instance x.
At this tiling depth, this typically means:

Contiguous ownership is preferred because moving scattered elements from global memory into a contiguous shared memory region would cause non-coalesced loads and inefficient memory access patterns. In most cases, each fiber instance (block/warp/thread) should be assigned a logically contiguous region in memory.


load_pattern(ownedT, x, t, P)

Describes how data is loaded from an owned region ownedT into a local memory space P (for example, shared memory or registers).

The iterator t exists because:

The load pattern defines:

At each t, it computes one or more index mappings between ownedT and P, potentially in 3D or ND depending on the depth and tile structure.


compute_pattern(P, Q, x, t, ownedC₂)

Describes the computation performed on local tiles P and Q at step t, affecting the corresponding region ownedC₂.

There are two important cases:

  1. Reduction along a shared K dimension (typical matmul)

    • Each iteration t processes a K-slice of P and Q.
    • The result is accumulated into ownedC₂.
    • This is the standard fused multiply-add behavior of matmul.
  2. Broadcasted matmul

    • If no reduction happens at this level, each iteration t maps a pair of Pₜ, Qₜ to a distinct part of ownedC₂.
    • In this case, the compute_pattern might be empty at this level.
    • The computation is deferred to deeper nested fibers (such as warp- or thread-level fibers) which have sufficient compute resources to perform the required operations.

Why the Iterator t is Needed

Fibers can be nested. A higher-level fiber (like a block fiber) may own a large region of work, for example 100 elements, but only have 32 threads available to execute it. In this case:

The iterator t parameterizes partial work within a fiber's owned region, making it possible to cover large data regions with limited compute resources.


Recap: Clean Annotated Form

fiber(x, t, P, Q, load_pattern, compute_pattern):
    // Determine the tensor slices this fiber instance 'x' owns at this tiling depth
    ownedA₂ = ownership(A₂, x)
    ownedB₂ = ownership(B₂, x)
    ownedC₂ = ownership(C₂, x)

    // Load from global memory slices owned by this fiber into local memory P and Q
    load_pattern(ownedA₂, x, t, P)
    load_pattern(ownedB₂, x, t, Q)

    // Perform compute (reduce/accumulate or deferred compute) on local tiles
    compute_pattern(P, Q, x, t, ownedC₂)

Summary of Constraints and Design Principles


Final Notes

This abstraction effectively describes a composable fiber algebra for matmul operations. The system separates:

It is modular, hierarchical, and suitable for decomposition into nested fibers at block, warp, thread, or instruction levels.
The abstraction generalizes to:

The model is compatible with runtime and ahead-of-time code generation strategies, and can be extended to multi-device or multi-SM scheduling as well.

Matpiler: Nested Fiber Inheritance and Decomposition Rules

This document describes how nested fibers in a tiled matmul compilation system inherit and subdivide the work domains of their parent fibers.
Each fiber instance (block, warp, thread, etc.) operates on a subset of tensor data, loads it into local memory, and optionally performs computations.
When a fiber cannot process its entire owned region at once (due to compute or memory limits), it spawns nested fibers which inherit a subset of the parent’s region and iterate over it.


Nested Fiber Structure

A nested fiber inherits:

Each nested fiber level typically has tighter resource constraints (e.g., fewer threads, smaller registers, fixed warp size).


Inheritance and Decomposition Rules

1. Ownership Inheritance

Each nested fiber receives a contiguous sub-region of its parent’s owned region.
The decomposition follows these rules:

Example:

Parent fiber owns A₂[0:64]
4 child fibers → each owns 16 elements:
  - child 0: A₂[0:16]
  - child 1: A₂[16:32]
  - child 2: A₂[32:48]
  - child 3: A₂[48:64]

2. Iterator Range Inheritance

Each nested fiber inherits a local iterator t, used to iterate over its owned sub-region.
If a nested fiber’s work still exceeds available compute resources, it can internally loop over t.

Example:

If a warp fiber owns 128 elements, but only has 32 threads:


3. Local Memory (P, Q) Allocation

Example:

If parent’s shared memory P has space for 128 elements and four child fibers:


4. Load and Compute Pattern Inheritance

Each child fiber inherits:

Child fibers may redefine:


5. Reduction and Broadcast Handling

At each level:

Nested fibers inherit knowledge of reduction dimensions and broadcast shapes from their parent.


Formal Recap of Inheritance

At each nested depth:

The compute pattern is inherited or specialized per child fiber as appropriate.


Summary

Nested fibers in Matpiler:

This system enables efficient, hierarchical scheduling of large matmul operations across GPU block/warp/thread hierarchies while preserving clear, composable abstraction boundaries between levels.