The kernel function, vector_add, is executed in parallel by many threads. The goal is for one thread to handle one element of the vector.
Grid, blocks, threads
A grid contains blocks, which contains threads.
Global Index
Each thread in the kernel has a unique global index. This index tells the thread which specific element of the vector it should process.
The global index can be calculated with built-in CUDA variables:
1
2
3
threadIdx.x (thread's index within it's block)
blockIdx.x (block's index within it's grid)
blockDim.x (# of threads per block)
1
idx = (blockIdx.x * blockDim.x) + threadIdx.x
We multiply blockIdx.x by blockDim.x because thread 0 of block X is offset by X * blockDim.x threads preceeding it.
Integer division
Often, more threads are launched than there are elements (N) because we need to round up to ensure all elements are covered.
e.g. If 1024 threads are launched for N elements, for threads with idx >= N, the kernel must make them exit immediately.
1
2
3
if (idx >= N) {
return;
}
Solving
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <cuda_runtime.h>__global__ void vector_add(constfloat* A, constfloat* B, float* C, int N) {
}
// A, B, C are device pointers (i.e. pointers to memory on the GPU)
extern "C" void solve(constfloat* A, constfloat* B, float* C, int N) {
int threadsPerBlock =256;
int blocksPerGrid = (N + threadsPerBlock -1) / threadsPerBlock;
vector_add<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, N);
cudaDeviceSynchronize();
}
We need to write the vector_add function:
1
2
3
4
5
6
7
8
9
10
11
12
13
__global__ void vector_add(constfloat* A, constfloat* B, float* C, int N) {
//1. Calculate global index (idx)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
//2. Perform bounds check
if (idx>=N){
return;
}
//3. Execute the addition;
C[idx] = A[idx] + B[idx];
}
1
2
3
4
5
6
Spinning up NVIDIA TESLA T4...
Running...
A = [1.0, 2.0, 3.0, 4.0]
B = [5.0, 6.0, 7.0, 8.0]
N = 4
✓ Test passed!