Vector addition using 4096 threads

We are going to explain how to implement the simple vector addition operation using 4096 threads. In order to do this, we are going to use a new programming model created by NVIDIA that will run in the GPU. As we can see on the image, there are 4 steps involved in this process:

  1. Copying the data from the host to the device.
  2. Giving the instruction to run the program in the GPU.
  3. Executing the kernel function that works in parallel.
  4. Copying back the result from the device to host.

CUDA flow

In this case, let’s assume that we have three vectors of N size (16384), which are contained in variables:

 int a[N], b[N], c[N];

Let’s assume that a and b have values inside (otherwise we can just put some values), and we want to compute the sum of the vector in c (element-wise operation of course). Let’s create our device variables, the ones that get a copy of the host array, which are pointers to the device memory:

 int *a_d, *b_d, *c_d;

These pointers are the ones that access the device memory and do the necessary operations and, only after finishing to work on the device, we transfer the data from the device to host memory.

In order to reserve the necessary amount of memory in the device, we have to call an operation similar to malloc, but that reserves memory in the device. This function is called cudaMalloc, and receives as first parameter a pointer to the pointer you want to hold the address, and the size of the memory you want to allocate. Therefore, we use this method as follows:

cudaMalloc((void**) &a_d, N*sizeof(int));
cudaMalloc((void**) &b_d, N*sizeof(int));
cudaMalloc((void**) &c_d, N*sizeof(int));

where N is the total number of elements in the vector. Afterwards, we copy the data from the host memory to device memory using the following function:

cudaMemcpy(a_d, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(b_d, b, N * sizeof(int), cudaMemcpyHostToDevice);

This function is quite similar to memcpy, but copy the data from the host to the device (cudaMemcpyHostToDevice). We have already finished with the first step, easy-peasy.

The second step is calling the method that runs in the GPU, using again the CUDA API. We have the power to specify the number of blocks and threads that we want to use. In our case, we are going to use 64 blocks where each block is going to launch 64 threads, meaning a total number of 64*64 threads (4096). We can easily change these values and launch more threads or blocks, but this should be a quite simple configuration that most of the NVIDIA GPUs are able to handle.

add<<<64,64>>>(a_d, b_d, c_d);

The function add is going to take care of adding the elements. Since the function is running on the device,  the implementation must have the prefix __global__:

__global__ void add(int *a, int *b, int *c){
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  while(index &lt; N){
    c[index] = a[index] + b[index];
    index += blockDim.x * gridDim.x;

There are many new things inside this function:

  • threadIdx.x holds the thread number in dimension x that is running.
  • blockIdx.x holds the block number in dimension x that is running.
  • blockDim.x holds the total number of threads per block in dimension x (max 3D).
  • gridDim.x holds the total number of blocks in dimension x (max 2D).

As you might have guessed, we can run two dimensional blocks with a three dimensional space per thread. In total we could use 5D, although that is not the normal case.

This function basically iterates through the elements in the array and does the sum of a[x] + b[x]. Nonetheless, we took care of incrementing the index in a shrewd way, making each thread to compute not only one value, but the values that are within their domains. This means that having:

  • threadIdx.x = 0
  • blockIdx.x = 0
  • gridDim.x = 64 (number of total blocks)
  • blockDim.x = 64 (number of threads per block)

the thread 0 within the block 0 will compute the index value 0, 4096, 8192, etc. This is a clever way of computing the sum of arbitrarily long vectors, having the amount of device memory as the only limitation.

Afterwards, we pass the data from the device to host using again cudaMemcpy:

cudaMemcpy(c, c_d, N * sizeof(int), cudaMemcpyDeviceToHost);

Finally, we free memory from the device memory:


This code can be found in my git CUDA playground where I am learning by doing. Thanks for reading.

This entry was published on January 3, 2013 at 5:06 pm. It’s filed under Uncategorized and tagged , , , . Bookmark the permalink. Follow any comments here with the RSS feed for this post.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s