memory-bandwidth

Optimizing CUDA: Warps, Threads and Blocks.

We are going to explain how to maximize the use of the processors in the GPU. In order to do so, we need to understand the architecture of the NVIDIA CUDA GPU’s. So, let’s begin:

The main advantage of the GPU’s against the CPU is that the GPU has much more processors units than the CPU. Therefore, it allows us to execute an even greater number of threads that run in parallel. Every time that we create threads in CUDA, they are assigned to one of the execution resources that handle these. These resources are called Streaming Multiprocessors (SM, following the convention name in Programming Massively Parallel Processors by David B. Kirk, Wen-mei W. Hwu ), and depending on the NVIDIA GPU, it will have X or Y number of SM. Each of these SM can handle up to 8 blocks, and this is a restriction imposed by the architecture. For the same reason, there’s a limit on the maximum number of threads that we can run in each SM and, as an example, let’s say that this number is 1024 threads per block (GT200). If we know that the GT200 has 30 SM (240 c0res according to the specs), basic maths tell us that the maximum number of threads that can run in the GPU is 30 (SM) x 1024 threads (per SM) = 30720 threads can reside in SM.

Automatic Scalability

Another way of doing this maths and getting a better understanding of the CUDA architecture is looking at the number of warps. Each SM schedules the threads based on warps, which are a set of threads belonging to the same block. The number of threads that belongs to a warp depends on the Compute Capability version, but in our example, it is 32 threads per warp. Thus, having (1024 threads/block) / (32 threads/warps) = 32 warps per SM. According to the specs, GT200 has a maximum of 32 warps/SM and 8 blocks/SM limitation. Thereby, it can handle a maximum of 30 SM x 32 warps/SM x 32 threads/warp = 30720.

If we remember the example that I gave in the first post of this series, the details where:

add<<<64,64>>>(a_d, b_d, c_d); // 64 blocks, 64 threads

Thus, if we liked to use 64 threads, we would need (1024 threads/SM)/ (64 threads) = 16 blocks to occupy an SM. But, we said we are limited to 8 blocks per SM so, we would end up using only 64 threads/block * 8 block/SM = 512 threads/SM, when the max for this GPU is 1024 threads/SM. If we use 128 threads, we will need (1024 threads/SM) / (128 threads) = 8 blocks == max num blocks per SM. So, we will end up using (128 threads/block) * 8 block/SM = 1024 threads/SM, meaning that all of the SM will be using their max thread capacity and the maximal number of warps per SM. As a result of this:

// 64 blocks to 8 blocks
// 64 threads to 128 threads
// add<<<64,64>>>(a_d, b_d, c_d);
add<<<8,128>>>(a_d, b_d, c_d);

For further information, I recommend to read the NVIDIA docs and the book Programming massively parallel processors by David B. Kirk and Wen-mei W. Hwu.

Thanks for reading.

This entry was published on January 13, 2013 at 12:20 pm. It’s filed under Uncategorized and tagged , , , , . Bookmark the permalink. Follow any comments here with the RSS feed for this post.

One thought on “Optimizing CUDA: Warps, Threads and Blocks.

  1. Yes but there are 30 SMs. So wouldnt be nice if we write add<<>>(a_d, b_d, c_d);

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: