# My First CUDA Program

So for this post I wanted to show you guys the “Hello World” in CUDA, and no, we are not going to printf(“Hello World “) , we are going to add vectors!, yeah feeling the hype!?

so let’s start:

First, if you are a complete newbie I recommend my previous post First steps on CUDA, get a handle on how I work and then come back.

Since we are all good software developers, let’s define our problem and propose a simple solution in the programming language C.

# Our problem in C

We want to do something like C = A+B where A,B,C are vectors, obviously these 3 have the same lenght.

So:

Allocate memory for A, B and C (for simplicity, the lenght of the array is an input parameter, and the type will be float)

```        int length;
float *vA,*vB,*vC;

if (argc!=2){
printf("./exec n \n");
exit(-1);
}
length = atoi(argv[1]);
//Allocating memory
vA = (float*)malloc(sizeof(float)*length);
vB = (float*)malloc(sizeof(float)*length);
vC = (float*)malloc(sizeof(float)*length);
```

Initialize A, B

```/* Initialize A and B, A[i] will have it's own index by 2 and B[i] it's own index by 3*/
for (int i=0;i < length; i++ ){
vA[i]=i*2;
vB[i]=i*3;
}
```

Make the computation C = A + B

```        for (int i=0;i < length ; i++ )
vC[i] = vA[i]+ vB[i];
```

Show result on console

```         for (int i=0;i < length; i++ )
printf ("C[%i] = %f \n",i,vC[i]);
```

Free memory

```         free(vA);
free(vB);
free(vC);
```

Seems pretty straight forward right? it’s easy to see that the computation of C[i] and C[i+1]  are completely independent, making it perfect to our GPU to make.

# Our problem in CUDA

Now the fun part, is pretty similar to the solution in but now we have to add the GPU things ( DUH!).

To make the code more easy to understand, all the host variables will have the prefix h_

and our device variables will have the prefix d_

We need to:

Allocate memory for h_A, h_B and h_C on our Host. These Variables are our vectors

```        int length;
float *h_A,*h_B,*h_C,*d_A,*d_B,*d_C;

if (argc!=2){
printf("./exec n \n");
exit(-1);
}
n = atoi(argv[1]);

//Allocating memory
h_A = (float*)malloc(sizeof(float)*length);
h_B = (float*)malloc(sizeof(float)*length);
h_C = (float*)malloc(sizeof(float)*length);
```

Allocate memory for d_A,d_B and d_C on our Device. These variables are our Vector in the device

```        cudaMalloc((void **)&d_A,sizeof(float)*length);
cudaMalloc((void **)&d_B,sizeof(float)*length);
cudaMalloc((void **)&d_C,sizeof(float)*length);
```

Initialize the host vectors h_A and h_B.

```         for (int i=0;i < length;i++){
h_A[i]=i*2;
h_B[i]=i*3;
}
```

Transfer the memory from  (h_A, h_B) to (d_A,d_B).

```      /* CPU to GPU */
cudaMemcpy(d_A, h_A, sizeof(float)*length, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, sizeof(float)*length, cudaMemcpyHostToDevice);

```

Initialize grids and threads. Since is a 1D problem we will only work on the X axis making Y =Z=1.

```//Each block will have 512 threads and as many blocks as needed
dim3 dimGrid( ceil(length/512) +1, 1, 1); //	length/512 blocks
dim3 dimBlock(512, 1, 1);	//512 threads for each block
```

Implement and call the kernel.

```__global__ void addVector(float *d_A, float *d_B, float *d_C,int length )
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < length)//we need to make sure our threads are within bounds
d_C[index]= d_A[index]+d_B[index];
}

//call to the kernel in the main
addVector <<< dimGrid , dimBlock >>> (d_A, d_B,d_C,length);
cudaThreadSynchronize();/*let's make sure all the threads finish here to avoid race conditions.*/
```

Transfer memory from d_C to h_C.

```         cudaMemcpy(h_C, d_C, sizeof(float)*length, cudaMemcpyDeviceToHost);
```

Show Result on Console.

```         for (int i=0; i<length;i++)
printf ("C[%i] = %f \n",i,h_C[i]);
```

Free host and device memories.

```        free(h_A);
free(h_B);
free(h_C);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
```

also pretty straight forward.

Now… this code shows the result on console, but for extreme large vector  is not helpful at all .

Let’s workThatGPU! by measuring the time with very big vectors. And comparing our CUDA solution to the C solution.

## First  vectors with size 4096

CPU Execution time 0.000072 ms.
GPU Execution time 0.000270 ms.

The CPU time is way smaller than the GPU, why? because the cudaMemcpy , actually transferring the memory back and forth takes more time than the actual computation

## Vectors size 20000

CPU Execution time 0.000347 ms.
GPU Execution time 0.000471 ms.

very similar times but still the CPU is ahead.

## Vectors size 1000000

CPU Execution time 0.015172 ms.
GPU Execution time 0.009702 ms.

Wow, nice! our GPU is 1.66 times faster than our CPU!

## Vectors size 700000000!!

CPU Execution time 13.240476 ms.
GPU Execution time 0.035979 ms.

So for this size of vector, our GPU is 378,2 times faster than our CPU!! it makes sense that our GPU works better with very large amount of information but when it’s little, just don’t even bother.