Notes: CUDA Matrix Transpose
-
2 mins read
Write a program that transposes a matrix of 32-bit floating point numbers on a GPU. The transpose of a matrix switches its rows and columns. Given a matrix A of dimensions $rows \times cols$, the transpose $A^T$ will have dimensions $cols \times rows$. All matrices are stored in row-major format.
|
|
Background
Performing a matrix transpose of Matrix A: $$A^t_{i,j}=A_{j,i}$$
Implementation
Global Indexing
This should be pretty simple to implement, we calculate the global indexes:
|
|
Index Calculations
we can perform the 2D to 1D index calculation as well:
$$ A_{j,i} = i*cols+j \newline $$
we want to store this variable into
$$ A^t_{j,i} = j*rows+i $$
Bounds Checks
i must not exceed N, and j must not exceed M.
|
|
This deviates from what we did for the earlier two problems, as the threads cover the output matrix of dims cols*rows.
Solve
|
|