As we can see the cpu has fewer number of cores, each core is much more capable with its own cache and control units.
The memory hierarchy is also deeper in a cpu.
L1 cache is closest and local to each core.
l2 cache is one lever lower, adn l3 is closest to ram (all reducing latency of memory access).
In the gpu, The yellows are the thread groups, which share cores. there is no L3 cache, and each core and reach row "multiprocessor stream" has its L2 cache.
The concept of a kernel grid is as follows:
It is a memory abstraction in CUDA that allows upto 3d shapes.
So there is the blockDims, let us say it has shape (X_block, Y_block, Z_block).
and then there is the gridDims or ThreadDims of shape (x_thread, y_thread, z_thread)
So if we have a TENSOR of shape (X,Y,Z)
then, The X dimension is going to be pushed into X_block number of blocks, each with ceil(X/X_block) number of threads.
so if i want to get the TRUE x-index,
then true_x = blockIdx.x * X_block + threadIdx.x
same goes for true_y and true_z
And so if A,B are nxn matrices, and C is the nxn matmul result,
But under the hood, CUDA or openCL will store everything as 1 dimensional array.
Let us generalize the memeory access pattern math.
let D = (d1, d2, d3, ...d_n)
Then, our memory patten is row major,
that is the first d_n number of elements are going to be
Notice that for each co-oridnate or dimension
so we can find all the n_dimensional indices (x1,x2,x3,..x_n) using the co-responding blockIDX, blockSize and ThreadIdx.
how do we find the co-responding 1d index of the above tuple?
very simple take a 3d array of shape (d1,d2,d3)
to go from (0,0,0) to (1,0,0) in index,
we need to step through a whole d2xd2 matrix,
so in general, to go from we need to step through (n-1)dimensional tensors.
to get to we need to step through . Using the same logic,