Notes: CUDA Matrix Multiplication
Write a program that multiplies two matrices of 32-bit floating point numbers on a GPU. Given matrix A of dimensions MxN and matrix B of dimensions NxK, compute the product matrix C, which will have dimensions MxK. All matrices are stored in row-major format.
|
|
Global Indexes, revisited
Unlike the previous problem, we have multiple dimensions now. We need two global indexes:
- A global row index (i)
- A global column index (j)
|
|
With these two indicies, each thread is assigned to calculate one unique element $C_{i,j}$.
Checks for integer division
The bounds check is written as follows:
|
|
Core Matrix Multiplication
Now that a valid thread is assigned to calculate a unique element $C_{i,j}$, we need to perform the dot product calculation of the i-th row of matrix A and the j-th column of matrix B.
$$C_{i,j}=\sum_{k=0}^{N-1}{A_{i,k}\times B_{k,j}}$$
To do this calculations, we need a loop that iterates N times with a local accumulator variable to store the dot product sum.
2D to 1D Indexing
Since the matrices are stored in row-major format, we can find the 1D index for an element $X_{r,c}$ in a matrix with $C_{dim}$ columns:
$$Index = r \times C_{dim} + c$$
Thus, we can calculate the index for $$A_{i,k}, B_{k,j}, \text{and } C_{i,j} $$:
$$ idx_{A_{i,k}} = iN+k \newline idx_{B_{k,j}} = kK+j \newline idx_{C_{i,j}} = i*K+j \newline $$
Solving
|
|