KaryaMorph is in its Cocoon

KaryaMorph: Building Infra for Understanding GPUs

I’ve given a name for my north star problem. Karya means “work” in Sanskrit, and Morph is morphisms, of course. (The heart of math imo)

Mostly because it’s kinda hard to keep repeating:

“A (mathematical then programming) language to model the reduced instruction set and simplified SIMT machine needed for the kernel of a single operation, where the human chooses and models the instruction set and the tiny SIMT machine, and as long as it is state-able in this language, a tractable search space will be exposed towards speed.”

I wrote this blog because I realized I won’t have an MVD (minimum viable demonstration) for a long time. This is theory land for now, though I’ll keep testing PTX instructions and trying to understand them. But if I ever hope to figure out the math parts of the modeling, I need to understand GPUs deeply, maybe even jump down to an abstraction lower than PTX.

For the next few months, I’ll read everything I can find about GPUs. Whenever I learn something new, I’ll try to mathematically model it and test it in code. I need to read, collect, understand, test, and model, really soak myself in. Iteration is key: read → model → code_test → understand → read.

One reason I gravitated to math is the extremely easy access. You only need pen and paper, or even just thought. But for my ADHD brain that screams when there isn't enough stimulation, testing a PTX instruction is way more friction.

Here’s what “zero infra” looks like:

create a Vast AI instance
connect to it
write the kernel
write boilerplate for tensors
set launch parameters
launch kernel
dump compute sanitize checks
verify correctness
check speed
attempt profiling without bare-metal access
ensure compile arguments, architecture, ISA versions, CUDA version all line up

That’s a lot, just to see how an instruction works. And I’ve procrastinated on PTX by doing Ring Theory instead, because of that friction. The only solution is to reduce and automate this process.

I need to build some infra

Until KaryaMorph emerges from its cocoon (a year, maybe two), I’m happy to use NVIDIA 50 series cards as my anchor. Even if 60 series roll out, data-center cards come first, then consumer, then cheap cloud availability. By then I’ll have training data on 50 series and can treat 60 series as test/dev.
I would also like to use other chips to test, AMD SERIOUSLY JUST PUT A CRAPLOAD OF VRAM INTO YOUR CONSUMER CARDS OR SOMETHING IDK I'LL BUY!

Reasons:

If I can model Tiny SIMT machines for the 50 series, I should be able to state them for the 60 series, and if modeled well, still search towards speed.
The 5th gen tensor core PTX instructions are quirky, constrained yet flexible, very fast, and fascinating (though poorly documented IMO). They’re available now and a perfect starting point.
AMD and other cards are barely available on the cloud for dirt cheap.

So what infra do I want? CUDA Driver API and PyCUDA expose device info, kernel launch parameters, compiler arguments, compute sanitize info, error messages, etc. There’s so much to understand if I can just move past friction and neatly dump it all.

What I’ll probably build is a GUI thingy (mouse go clicky color pretty). A main tile to write PTX (syntax highlighting is weak for new PTX, sadge). Another tile for kernel launch options, tensor allocs, arguments, maybe not GUI, just a slim Python driver wrapping PyCUDA/C++ bindings, with much less boilerplate.

The most important piece is the dashboard tile. It should show:

device metrics
ISA version
CUDA version
SASS dump
compute sanitize dump
register pressure, occupancy, memory use

The idea: write one kernel, set launch params, then compile and run multiple times in order:

check syntax/compiler correctness
check out-of-bounds, races, sync issues
check correctness against a torch equivalent
enable optimizations and do timed run
collect as many kernel metrics as possible
dump SASS

All of this should be collected automatically. I’ll still choose launch params and allocate tensors of course, but I don’t want to choose every piece of info that I need to dump, and write code for it. I want everything, neatly organized in a dashboard. Even if multiple compiles/runs are slower, the reduced friction will keep me moving, and will be a lot faster then my impatient ass trying to write boilerplate every single time.

And I’d rather not prompt an expensive ass LLM over and over until it builds me this Infra in its entirety. I’ll use it for reference, corrections, small code bits, styling. But I want to understand exactly what info I can dump, what it means, and build it myself to some extent. Maybe I’ll even become a real programmer if I do enough of these.

Reading

Before (and or during) any of that, I’ve got a ton of reading. Things are aligning in surprising ways. I’m quitting certain stimulants and nicotine gums, tired and in withdrawal, but the slower pace might actually help. Reading feels nice, relaxing, and necessary.

And relevant resources keep popping up. For example, my good friend Suraj, one of the goated programmers, hackers, and human beings I know, reposted evanlinn’s tiny TPU project. I saw it the same midnight I was crying about not finding transparent stuff on GPU-like chips.

Then Suraj sent me this. TLA, created by Leslie Lamport, lets you mathematically model abstractions of distributed systems. Perfect for me to soak up methods of modeling “computers” with math, from one of the absolute legends in this area.

There will be no more blogs for a long time. I’ll retreat into my internal world, rest, and read.

I don't care if I succeed or not, I will still take this to the end, the reason for why I couldn't make this work will be just as fascinating, and I'll come out having learnt a lot of stuff. Hesitation is defeat.