Parallel Reduction with CUDA
Finding the minimum of very large arrays using GPUs
Note: This post is intended for users with a considerable background in GPU hardware and CUDA programming
Parallel reduction refers to algorithms which combine an array of elements producing a single value as a result. Problems eligible for this algorithm include those which involve operators that are associative and commutative in nature. Some of them include
- Sum of an array
- Minimum/Maximum of an array
let’s take a problem statement of finding the minimum in an array of size 10⁸ floating point elements. In a CPU this usually takes around 15–17s for an array of this size. The time for computation only increases on a larger scale as the computations and size increases.
To minimize the computation time on a CPU we might parallelize the computation using multiple threads on multiple cores. The parallelism that can be achieved with CPUs is very much limited. On the other hand, the compute power of GPUs is evolving continuously. Given the ubiquitous nature of GPUs recently and the ability to execute several hundreds of threads in parallel, it is vital to extract the compute power of GPUs whenever possible.
The following is an attempt to implement some techniques from Mark Harris’s deep dive into how to optimize CUDA reduction kernel.
The idea is to use the multiple thread blocks in a GPU to reduce a small portion of the array. A tree based reduction is used inside each thread block and shared memory is used to achieve communication among threads of the same block and synchronization among them.
- A small portion of the array Ai is read from the global memory by each thread and stored into the shared memory.
- The array is then reduced in the shared memory. i.e in our case, the min value for that small chunk Ai is calculated within each thread block
- The min value is then stored in a global memory
- The final global minimum is then extracted from the global memory
Although the algorithm looks simple, the difficult part lies in implementing it the right way. There are three major problems that might degrade the performance and cause bottle neck if not implemented properly.
- Branch divergence
- Memory bank conflict
- The number of threads inactive after each reduction step
These can be overcome with an implementation involving sequential addressing with reversed loop and thread ID-based indexing followed by loop unrolling the last warp (n_threads < 32). A high level explanation to these concepts is provided below
Step 0: Create a 2D sample matrix of size N*N and type floating point and fill the matrix with random values.
Step 1: Allocate memory for the matrix in the device (GPU) and copy the matrix from host to the device
step 2: Defining the parallel reduction kernel
- Before reducing the array in each block we will first create a copy of a small chunk of the original array in the shared memory of each block.
- This is where the actual reduction takes place. Herein we use sequential addressing and strided indexing with reversed loop to make sure we reduce branch divergence and memory bank conflicts
What is strided indexing with reverse loop?
Let’s assume we have an array of 16 elements.
- Iteration 1: Stride = 8. We compare index 0 and 8 and store minimum in index 0 likewise we do the same for index 1–7 & 8–15. The results are stored in index 0–7.
- Iteration 2: Stride = 4. We reduce the stride by half and compare index 0 and 5 and store the minimum in index 0 and likewise for other indices
…
… - Iteration N: stride 1: Compare index 0-1 and store the result in Index 0.
Step 3: Define Block size and grid size for invocation. In our case we choose a block size of 256. Allocate an array of size equal to the grid size to store the result of each thread block
Step 4: Invoke the kernel.
Finally, copy the output array to the host and extract the final minimum value. (This output array again can be reduced in parallel if the size demands it)
By employing parallel reduction techniques it is possible to achieve 18x performance improve on a GPU