Intro, GPU, CPU

Support/Figures/Pasted image 20250525225113.png
As we can see the cpu has fewer number of cores, each core is much more capable with its own cache and control units.

The memory hierarchy is also deeper in a cpu.
L1 cache is closest and local to each core.
l2 cache is one lever lower, adn l3 is closest to ram (all reducing latency of memory access).

In the gpu, The yellows are the thread groups, which share cores. there is no L3 cache, and each core and reach row "multiprocessor stream" has its L2 cache.
The concept of a kernel grid is as follows:

It is a memory abstraction in CUDA that allows upto 3d shapes.

So there is the blockDims, let us say it has shape (X_block, Y_block, Z_block).
and then there is the gridDims or ThreadDims of shape (x_thread, y_thread, z_thread)
So if we have a TENSOR of shape (X,Y,Z)

then, The X dimension is going to be pushed into X_block number of blocks, each with ceil(X/X_block) number of threads.

so if i want to get the TRUE x-index,
then true_x = blockIdx.x * X_block + threadIdx.x

same goes for true_y and true_z

And so if A,B are nxn matrices, and C is the nxn matmul result,

But under the hood, CUDA or openCL will store everything as 1 dimensional array.

Let us generalize the memeory access pattern math.

let D = (d1, d2, d3, ...d_n)
Then, our memory patten is row major,
that is the first d_n number of elements are going to be A[0,0,...,0],A[0,0,...1],...A[0,0,...dn1]

Notice that for each co-oridnate or dimension xi xi=blockIDX.xiblockSIZE.xi+threadIdx.xi
so we can find all the n_dimensional indices (x1,x2,x3,..x_n) using the co-responding blockIDX, blockSize and ThreadIdx.
how do we find the co-responding 1d index of the above tuple?

very simple take a 3d array of shape (d1,d2,d3)
to go from (0,0,0) to (1,0,0) in index,
we need to step through a whole d2xd2 matrix,
so in general, to go from (0,...0)to(1,....0) we need to step through i=2ndi (n-1)dimensional tensors.

to get to (x1,0,0,..0) we need to step through x1i=2ndi. Using the same logic,

\mathbb 1d(x_{1},x_{2},\dots x_{n}) = \sum_{j=1}^{n-1}x_{j}\prod_{i=j+1}^nd_{i} + x_{n}$$. Here is a nice matmul kernel ```C #include <stdio.h> #include <cuda_runtime.h> __global__ void matmul(int n, float *a, float *b, float *c) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if (x < n && y < n) { int x_y_1d = x * n + y; float dot_prod = 0.0f; for (int k = 0; k < n; k++) { int x_k_1d = x * n + k; int k_y_1d = k * n + y; dot_prod += a[x_k_1d] * b[k_y_1d]; } c[x_y_1d] = dot_prod; } } void printMatrix(float *mat, int n, const char *name) { printf("Matrix %s:\n", name); for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { int i_j_1d = i * n + j; printf("%6.1f ", mat[i_j_1d]); } printf("\n"); } printf("\n"); } int main() { int N = 8; int size = N * N * sizeof(float); float *A, *B, *C; float *A_d, *B_d, *C_d; A = (float*)malloc(size); B = (float*)malloc(size); C = (float*)malloc(size); for (int i = 0; i < N * N; ++i) { A[i] = 1.0f; B[i] = 2.0f; } cudaMalloc(&A_d, size); cudaMalloc(&B_d, size); cudaMalloc(&C_d, size); cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice); cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice); dim3 threadsPerBlock(2, 2); dim3 numBlocks((N + threadsPerBlock.x - 1) / threadsPerBlock.x, (N + threadsPerBlock.y - 1) / threadsPerBlock.y); matmul<<<numBlocks, threadsPerBlock>>>(N, A_d, B_d, C_d); cudaDeviceSynchronize(); cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost); printMatrix(A, N, "A"); printMatrix(B, N, "B"); printMatrix(C, N, "C = A x B"); cudaFree(A_d); cudaFree(B_d); cudaFree(C_d); free(A); free(B); free(C); return 0; } ``` In the kernel, at the end of the day, every operation must be one dimensional. as long as your accessing 1d indices, you're good. A hint to tiling, you see now, to compute the matmul entry $C[i,j]$ we have to load acess $A[i,k], B[k,j]$ $k = 1 to N$ times to compute the dot project. We may imagine a more efficent memory heirarchy. Get a set of "workers" with shared memory $A[i]$ Let each worker_j load up $B[:,j]$ and access $A[i,k]$ from the shared worker memory, and $B[k,j]$ from its local worker memory. Something like that. ![Support/Figures/Pasted image 20250526011814.png](/img/user/Support/Figures/Pasted%20image%2020250526011814.png) How would we do a broadacsted compute like this one? ``` true_x = BlockIDX.x * BlockDIM.x + ThreadIDX.x true_y = BlockIDX.y * BlockDIM.y + ThreadIDX.y true_z = BlockIDX.z * BlockDIM.z + ThreadIDX.z so then, 1d(x,y,z) = true_x(d_y*d_z) + true_y *d_z + true_z 1d(x,y) = true_x (d_y) + true_y 1d(x) = true_x OUT[1d(x,y,z)] = A[1d(x,y,z)] + B[1d(x,y)] + C[1d(x)]. ```