CUDA in 5 minutes!!

Introduction to CUDA

Pratiksha Sancheti
6 min readMay 2, 2021

CUDA(Compute Unified Device Architecture) is a parallel computing platform and application programming interface model by Nvidia which provides straightforward APIs to manage devices, memory, and all the computations are performed parallelly using a grid of blocks and grid of threads on GPU. Shared memory and synchronization among threads are provided by CUDA and it uses a heterogeneous system i.e it uses CPU(Host) and GPU(Device). CUDA supports C/C++ languages and the program is complied with using the nvcc compiler provided by NVIDIA.
Before going into CUDA programming, let’s talk more about CUDA logical hierarchy.

CUDA Logical Hierarchy

Thread :
A thread is the single instance of execution and is a fundamental building block of a parallel program.

Thread block :
A group of threads that execute simultaneously on the same processor is known as a thread block.
Thread blocks can use shared data through shared memory and synchronization their execution.

Grid :
A Group of thread blocks is called a CUDA grid where each block is executing on a different processor.
Grid is executed on GPU which is composed of many multiprocessors.
The multiprocessors consist of many stream processors, which then are responsible for running one or more of the threads in the block.

Warp :
Warp
is a number of threads in block running simultaneously.

Let’s step into CUDA programming now, but before that, let us look into the structure of the CUDA program that uses the GPU for computation:

Flow of CUDA programming

The presented flowchart would help you to understand the structure of CUDA programming.

CUDA programming model is a combination of serial and parallel executions.

1. CUDA program is divided into host code and device code.

2. API for handling device memory provided by CUDA are:

  • To allocate memory on GPU:
cudaMalloc(void** devPtr, size_t size);
  • To copy data from host to device or from device to host:
cudaMemcpy(void* dst, const void* src, size_t count, cudaMemcpyKind kind);
  • To free data on GPU:
cudaFree(void* devPtr);
  • To launch the kernel:
kernel_function<<<number_of_blocks, number_of threads >>>();

<<< >>> mark a call from host code to device code and called as “kernel launch”.

By adding the __global__ qualifier to the function and by calling
it using a special angle bracket syntax, we execute the function on our GPU.

4. Device function should start from:

__global__ void mykernel(argument_list)
{
}
  • __global__ indicates a function that runs on a device and is called from host code.
  • CUDA source code is in .cu files.
  • NVIDIA compiler (nvcc) is used to compile programs with no device code. nvcc separates source code into host and device components:
  • Device functions (e.g. mykernel()) processed by NVIDIA compiler.
  • Host functions (e.g. main()) processed by standard host compiler -gcc, cl.exe

Enough of the theory knowledge now, let’s understand CUDA by taking an example of adding two large vectors.
Imagine having two lists of numbers where we want to sum
corresponding elements of each list and store the result in a third list.

Addition of 2 vectors

Initially, we’ll look at how traditional addition is performed on CPU in the C language.

void add(int *a, int *b, int *c, int N)
{
for(int i=0;i<N;i++)
{
c[i] = a[i] + b[i];
}
}

The time required to perform this addition would be O(n).
Let’s see how parallelly we can perform the same operation!!!

Let us understand this by taking an example.
Consider vector of size 8.

In the above figure, you can see that it is divided into 4 different colors, which means, our vector is divided into 4 blocks and each block is divided into 2 i.e 2 threads per block.
4 blocks are executed parallelly and results are stored into vector C.

The actual computation is being done by individual threads in each of the blocks. Here, we use 4 blocks and 2 threads per block (8 threads will run in parallel) and our total array size N is 8.
Now, let us make our hands dirty and look into code.

__global__ void add(int *a, int *b, int *c, int N){int tid = blockIdx.x*blockDim.x+threadIdx.x;if (tid < N)c[id] = a[id] + b[id];}

I know you would be super confused looking at blockIdx or threadIdx.I’ll clear your doubts here.
tid variable gives thread global id and it is calculated by blockIdx and threadIdx. blockIdx contains the blocks position in the grid, ranging from 0 to gridDim-1 and threadIdx is the threads index inside of its associated block, ranging from 0 to blockDim-1.

c[id] = a[id] + b[id];

The thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and b and write the sum to c.

The thread whose threadIdx.x is 0 within block 0 will compute c[0], because tid = (2 * 0) + 0
The thread whose threadIdx.x is 0 within block 1 will compute c[2], because tid = (2 * 1) + 0
The thread whose threadIdx.x is 1 within block 1 will compute c[3], because tid = (2 * 1) + 1
c[0] through c[7] will be computed concurrently.

Now that we are done with the explanation for 8-element vector addition, Here is the full CUDA code in C.

#include <stdio.h>
#define N 8
#define numThread 2 // 2 threads in a block
#define numBlock 4 // 4 blocks

__global__ void add( int *a, int *b, int *c )
{

int tid = blockDim.x * blockIdx.x + threadIdx.x;
if(tid < N)
{
c[tid] = a[tid] + b[tid];
}
}

int main( void ) {
int *a, *b, *c; // The arrays on the host CPU machine
int *dev_a, *dev_b, *dev_c; // The arrays for the GPU device

//allocate the memory on the CPU
a = (int*)malloc( N * sizeof(int) );
b = (int*)malloc( N * sizeof(int) );
c = (int*)malloc( N * sizeof(int) );

//fill the arrays 'a' and 'b' on the CPU with dummy values
for (int i=0; i<N; i++) {
a[i] = i;
b[i] = i;
}

// allocate the memory on the GPU
cudaMalloc( (void**)&dev_a, N * sizeof(int) );
cudaMalloc( (void**)&dev_b, N * sizeof(int) );
cudaMalloc( (void**)&dev_c, N * sizeof(int) );

// copy the arrays 'a' and 'b' to the GPU
cudaMemcpy( dev_a, a, N * sizeof(int),
cudaMemcpyHostToDevice );
cudaMemcpy( dev_b, b, N * sizeof(int),
cudaMemcpyHostToDevice );

// Execute the vector addition 'kernel function' on GPU device,
// declaring how many blocks and how many threads per block to use.
add<<<numBlock,numThread>>>( dev_a, dev_b, dev_c );

// copy the array 'c' back from the GPU to the CPU
cudaMemcpy( c, dev_c, N * sizeof(int),
cudaMemcpyDeviceToHost );
// verify gpu output
bool success = true;
int total=0;
for (int i=0; i<N; i++) {
if ((a[i] + b[i]) != c[i]) {
printf( "Error: %d + %d != %d\n", a[i], b[i], c[i] );
success = false;
}
total += 1;
}
if (success) printf("Success!!");
// free the memory we allocated on the CPU
free( a );
free( b );
free( c );

// free the memory we allocated on the GPU
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );

return 0;
}

We are done with coding now, let’s compile our code using nvcc compiler.
Save program using .cu extension and then type below command in terminal.

nvcc add_two_vectors.cu./a.out

And the code is successfully compiled…… and we did it!!!
We have successfully performed the addition of vectors parallelly in CUDA.

For more such programs on CUDA, have a look at my github repo

https://github.com/pratiksha-sancheti/CUDA_programms

This is my first attempt at writing an article on medium. I hope that you have gained some knowledge while reading this article and helped you somehow!!

Would love to connect with my readers on LinkedIn .
Keep contributing to the community. Happy learning :)

--

--

Pratiksha Sancheti

I’m an undergrad currently pursuing my bachelors degree in Computer Engineering from Pune Institute of Computer Technology