Notes: CUDA Matrix Transpose

October 16, 2025 - 2 mins read

Write a program that transposes a matrix of 32-bit floating point numbers on a GPU. The transpose of a matrix switches its rows and columns. Given a matrix A of dimensions $rows \times cols$, the transpose $A^T$ will have dimensions $cols \times rows$. All matrices are stored in row-major format.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


#include <cuda_runtime.h>

__global__ void matrix_transpose_kernel(const float* input, float* output, int rows, int cols) {

}

// input, output are device pointers (i.e. pointers to memory on the GPU)
extern "C" void solve(const float* input, float* output, int rows, int cols) {
    dim3 threadsPerBlock(16, 16);
    dim3 blocksPerGrid((cols + threadsPerBlock.x - 1) / threadsPerBlock.x,
                       (rows + threadsPerBlock.y - 1) / threadsPerBlock.y);

    matrix_transpose_kernel<<<blocksPerGrid, threadsPerBlock>>>(input, output, rows, cols);
    cudaDeviceSynchronize();
}

Background

Performing a matrix transpose of Matrix A: $$A^t_{i,j}=A_{j,i}$$

Implementation

Global Indexing

This should be pretty simple to implement, we calculate the global indexes:

1
2


i = (blockIdx.y * blockDim.y) + threadIdx.y
j = (blockIdx.x * blockDim.x) + threadIdx.x

Index Calculations

we can perform the 2D to 1D index calculation as well:

$$ A_{j,i} = i*cols+j \newline $$

we want to store this variable into

$$ A^t_{j,i} = j*rows+i $$

Bounds Checks

i must not exceed N, and j must not exceed M.

1
2
3


if (j>=cols || i>=rows){
  return;
}

This deviates from what we did for the earlier two problems, as the threads cover the output matrix of dims cols*rows.

Solve

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


__global__ void matrix_transpose_kernel(const float* input, float* output, int rows, int cols) {
    // 1. calculate global indexes
    int j = blockIdx.x * blockDim.x + threadIdx.x;
    int i = blockIdx.y * blockDim.y + threadIdx.y;

    // 2. bound checks
    if (j>=cols || i>=rows){
        return;
    }

    // 3. perform transpose
    output[j*rows+i] = input[i*cols+j];
}