Notes: CUDA Matrix Multiplication

October 16, 2025 - 3 mins read

Write a program that multiplies two matrices of 32-bit floating point numbers on a GPU. Given matrix A of dimensions MxN and matrix B of dimensions NxK, compute the product matrix C, which will have dimensions MxK. All matrices are stored in row-major format.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


__global__ void matrix_multiplication_kernel(const float* A, const float* B, float* C, int M, int N, int K) {

}

// A, B, C are device pointers (i.e. pointers to memory on the GPU)
extern "C" void solve(const float* A, const float* B, float* C, int M, int N, int K) {
    dim3 threadsPerBlock(16, 16);
    dim3 blocksPerGrid((K + threadsPerBlock.x - 1) / threadsPerBlock.x,
                       (M + threadsPerBlock.y - 1) / threadsPerBlock.y);

    matrix_multiplication_kernel<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, M, N, K);
    cudaDeviceSynchronize();
}

Global Indexes, revisited

Unlike the previous problem, we have multiple dimensions now. We need two global indexes:

A global row index (i)
A global column index (j)

1
2


i = (blockIdx.y * blockDim.y) + threadIdx.y
j = (blockIdx.x * blockDim.x) + threadIdx.x

With these two indicies, each thread is assigned to calculate one unique element $C_{i,j}$.

Checks for integer division

The bounds check is written as follows:

1
2
3


if (i>=M || j>=K){
  return;
}

Core Matrix Multiplication

Now that a valid thread is assigned to calculate a unique element $C_{i,j}$, we need to perform the dot product calculation of the i-th row of matrix A and the j-th column of matrix B.

$$C_{i,j}=\sum_{k=0}^{N-1}{A_{i,k}\times B_{k,j}}$$

To do this calculations, we need a loop that iterates N times with a local accumulator variable to store the dot product sum.

2D to 1D Indexing

Since the matrices are stored in row-major format, we can find the 1D index for an element $X_{r,c}$ in a matrix with $C_{dim}$ columns:

$$Index = r \times C_{dim} + c$$

Thus, we can calculate the index for $$A_{i,k}, B_{k,j}, \text{and } C_{i,j} $$:

$$ idx_{A_{i,k}} = iN+k \newline idx_{B_{k,j}} = kK+j \newline idx_{C_{i,j}} = i*K+j \newline $$

Solving

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


__global__ void matrix_multiplication_kernel(const float* A, const float* B, float* C, int M, int N, int K) {

    // 1. Calculate i and j
    int i = (blockIdx.y * blockDim.y) + threadIdx.y;
    int j = (blockIdx.x * blockDim.x) + threadIdx.x;
    // 2. Bounds check
    if (i>=M || j>=K){
      return;
    }
    // 3. Initialize accumulator
    float d = 0.0;
    // 4. Inner loop for dot product
    for (int k=0;k<N;k++){
      float A_i_k = A[i*N+k];
      float B_k_j = B[k*K+j];
      d += A_i_k*B_k_j;
    }
    // 5. Store result in C
    C[i*K+j] = d;

}