First steps on CUDA

On this post I’m going to use the programming language “C” for our first steps in CUDA. Remember that “C++” is also compatible.

To start we have to know the basic structure of any CUDA program: is like a regular program but in this one we have to communicate with our Nvidia GPU, keeping in mind memory assignment, memory transactions and calling the kernel itself.

For example, we want to create an array of 10 float elements in “C”. It would be something like this:

array of 10 float elements in C

This memory is allocated in our HOST memory, is not visible to our GPU. Now lets  do the same procedure in a GPU.


This directive will let us allocate memory in our GPU global memory. On the contrary as before, it is only visible from our GPU, therefore we cannot use this in our HOST program.

On the Nvidia Documentation we can check how a cudaMalloc works.

cudaError_t cudaMalloc(void** devPtr, size_t size): applying this procedure we can allocate an array of 10 float elements in our GPU memory.

10 float elements allocated in the GPU global memory

For error management we can use the returned value cudaError_t. For more information click here.

Now, lets say we populate our array in our HOST and we want to move that array to our GPU for massive parallel computations. How can we do that?


This directive will let us transfer memory from our host to our GPU and vice-versa.

On the Nvidia Documentation we can check how a cudaMemcpy works.

cudaError_t cudaMemcpy (void* dst, const void* src, size_t count, cudaMemcpyKind kind), where:

void* dst is our destination.

const void* src is our source.

size_t count is the total amount of memory to transfer.

cudaMemcpyKind kind is the kind of copy we want to make, it can be from our host to our device or from our device to our host. For a better explanation click here.

For our example it will go like this:

Allocate host memory and populate array h_array
Allocate device memory d_array and then use cudaMemcpy to copy the memory from h_array (HOST) to d_array (DEVICE) using cudaMemcpyHostToDevice

Now we have our memory in our device ready.

So let’s recap a little. We know how to allocate and transfer memory, so now it’s time for the fun part, the kernel.

Basically the kernel is the procedure that is going to be executed in our GPU.

Kernel, grids, blocks and threads

So lets say we want our device to add a number ‘p’ to each element in the array.

We add each element of the array by “p”

This seems like a problem that our GPU can easily take care of. For example, “N” threads and each thread can make 1 addition, so instead of “N” iterations in a loop, it would only be 1 instruction.

How to create a kernel procedure

The kernel procedure must be declared with the directive __global__. This indicates that is a kernel and it’s going to be executed in the GPU.

Our kernel will look something like this:

__global__ void addPtoArray(float* d_array,int lenght,int p)


Implementation of the kernel here:


And to invoke our kernel we have to use the triple chevron, define grids and threads and finally the name of the kernel.

Grids, blocks and threads

It’s better explained with an image:

2d grid with 16 blocks and each block is 2d and has 4 threads

In order to call a kernel, we have to specify how many threads are going to be executed. We can manage the threads in a grid like the previous image.

The grid and blocks are specified using an structure called dim3. Each dim3 structure has 3 values (X,Y and Z), so we can create a 1d,2d or 3d grid.

Inside the kernel we have information about the grid and thread that is being executed.

blockIdx is the index of the block. In the image example, the block highlighted in green is blockIdx.x=2 and blockIdx.y=1

blockDim is the dimension of the grid, in the image example is blockDim.x=4 and blockDim.y=4

threadIdx is the index of the thread, in the image example is threadIdx.x=1 and threadIdx.y=1

Some examples about grids, blocks and threads


Going back to our example the kernel will look something like this:

Kernel implementation

Now we are set to work that GPU! 

Our code will look something like this:

#include <stdlib.h>;
#include <sys/time.h>;
#include <cuda.h>;

//kernel implementation
__global__ void addPtoArray(float *d_array, int lenght,int p){
 int i;
 i = blockIdx.x * blockDim.x + threadIdx.x; // We calculate the current thread of our context
 if (i<lenght)//if the thread is inside the lenght of our array

int main(int argc, char *argv[])
 float *h_array, *d_array,*h_hostResult;
 int p = 20;
 int lenght=5;

 // Mallocs CPU
 h_array = (float *)malloc(sizeof(float)*lenght);
 for (int i=0; i<lenght; i++){ h_array[i] = i;}    /* Mallocs GPU */    cudaMalloc((void **) &amp;amp;amp;amp;amp;amp;amp;amp;amp;d_array,sizeof(float)*lenght);   /* CPU-&amp;amp;amp;amp;amp;amp;amp;amp;gt;GPU */
 cudaMemcpy(d_array, h_array, sizeof(float)*lenght, cudaMemcpyHostToDevice); 

 /* Add Array GPU*/
 dim3 GridBlocks(256);
 dim3 threadsPerBlock(256);
 addPtoArray<<<GridBlocks, threadsPerBlock>>>(d_array, lenght ,p );

 /* GPU->CPU */
 h_hostResult = (float *)malloc(sizeof(float)*lenght);

 /* Results */
 for (int i=0; i<lenght; i++)
 printf("h_array[%i] = %f \n",i, h_array[i]);
 printf("h_hostResult[%i] = %f \n",i, h_hostResult[i]);


 /* Free CPU */
 /* Free GPU */


Save this code on a file, let’s say

Now to compile we do nvcc -o addPtoArray

This will generate a file called addPtoArray

Now we just execute ./addPtoArray. It should look like this:

Compile and execute program
Result of the program. We add 20 to each element of the array

One thought on “First steps on CUDA

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s