# Canny edge detector

On this post I’m going to implement the canny edge detector algorithm made by John Canny.

I will provide multiple implementations on CUDA:

• The canny edge detector CPU approach.
• The canny edge detector GPU 1D approach.
• The canny edge detector GPU 2D approach.

The Process of Canny edge detection algorithm can be broken down to 5 different steps:

• Apply Gaussian filter to smooth the image in order to remove the noise.
• Find the intensity gradients of the image.
• Apply non-maximum suppression to get rid of spurious response to edge detection.
• Apply double threshold to determine potential edges.
• Track edge by hysteresis: Finalise the detection of edges by suppressing all the other edges that are weak and not connected to strong edges.

The explanation in depth can be found here.

I’m not going to demonstrate or explain the equations, that is up to you my dear reader. But what I’m going to do is to explain to you my very own implementation of this algorithm step by step.

First we need to know where do we want to go, assuming the canny edge detector algorithm is a box I want something like this:

OK, the global process seems pretty straight forward, an input image, do some process of the image, get output image.

For the box is going to be something like this:

## Canny edge detector CPU approach

### Gaussian filter for noise reduction:

Supposing my input image has dimensions width x Height and applying a 5×5 Gaussian filter:

the noise reduction of an image can go like this:

```
for(i=2; i<height-2; i++)
for(j=2; j<width-2; j++)
{
// Noise reduction
NR[i*width+j] =
(2.0*im[(i-2)*width+(j-2)] +  4.0*im[(i-2)*width+(j-1)] +  5.0*im[(i-2)*width+(j)] +  4.0*im[(i-2)*width+(j+1)] + 2.0*im[(i-2)*width+(j+2)]
+ 4.0*im[(i-1)*width+(j-2)] +  9.0*im[(i-1)*width+(j-1)] + 12.0*im[(i-1)*width+(j)] +  9.0*im[(i-1)*width+(j+1)] + 4.0*im[(i-1)*width+(j+2)]
+ 5.0*im[(i  )*width+(j-2)] + 12.0*im[(i  )*width+(j-1)] + 15.0*im[(i  )*width+(j)] + 12.0*im[(i  )*width+(j+1)] + 5.0*im[(i  )*width+(j+2)]
+ 4.0*im[(i+1)*width+(j-2)] +  9.0*im[(i+1)*width+(j-1)] + 12.0*im[(i+1)*width+(j)] +  9.0*im[(i+1)*width+(j+1)] + 4.0*im[(i+1)*width+(j+2)]
+ 2.0*im[(i+2)*width+(j-2)] +  4.0*im[(i+2)*width+(j-1)] +  5.0*im[(i+2)*width+(j)] +  4.0*im[(i+2)*width+(j+1)] + 2.0*im[(i+2)*width+(j+2)])
/159.0;
}

```

In NR I’m storing the image after applying the Gaussian Filter.

### Finding the intensity gradient of the image and Non-maximum suppression.

To find the intensity gradient of the image.

and to apply the Non-maximum suppression we need to determine if the gradient magnitude (G) is a local maximum :

• if Θ = 0º (N or S) is border if gradient  G >  Ge  and Gw.
• if Θ = 90º (E or W) is border if gradient G > Gn and Gs.
• if Θ = 135º (NE or SW) is border if gradient G > Gnw and Gse.
• if Θ = 45º (NW or SW) is border if gradient G > Gne and Gsw.

For this step I’ll be needing NR

```
for(i=2; i<height-2; i++)
for(j=2; j<width-2; j++)
{
// Intensity gradient of the image
Gx[i*width+j] =
(1.0*NR[(i-2)*width+(j-2)] +  2.0*NR[(i-2)*width+(j-1)] +  (-2.0)*NR[(i-2)*width+(j+1)] + (-1.0)*NR[(i-2)*width+(j+2)]
+ 4.0*NR[(i-1)*width+(j-2)] +  8.0*NR[(i-1)*width+(j-1)] +  (-8.0)*NR[(i-1)*width+(j+1)] + (-4.0)*NR[(i-1)*width+(j+2)]
+ 6.0*NR[(i  )*width+(j-2)] + 12.0*NR[(i  )*width+(j-1)] + (-12.0)*NR[(i  )*width+(j+1)] + (-6.0)*NR[(i  )*width+(j+2)]
+ 4.0*NR[(i+1)*width+(j-2)] +  8.0*NR[(i+1)*width+(j-1)] +  (-8.0)*NR[(i+1)*width+(j+1)] + (-4.0)*NR[(i+1)*width+(j+2)]
+ 1.0*NR[(i+2)*width+(j-2)] +  2.0*NR[(i+2)*width+(j-1)] +  (-2.0)*NR[(i+2)*width+(j+1)] + (-1.0)*NR[(i+2)*width+(j+2)]);

Gy[i*width+j] =
((-1.0)*NR[(i-2)*width+(j-2)] + (-4.0)*NR[(i-2)*width+(j-1)] +  (-6.0)*NR[(i-2)*width+(j)] + (-4.0)*NR[(i-2)*width+(j+1)] + (-1.0)*NR[(i-2)*width+(j+2)]
+ (-2.0)*NR[(i-1)*width+(j-2)] + (-8.0)*NR[(i-1)*width+(j-1)] + (-12.0)*NR[(i-1)*width+(j)] + (-8.0)*NR[(i-1)*width+(j+1)] + (-2.0)*NR[(i-1)*width+(j+2)]
+    2.0*NR[(i+1)*width+(j-2)] +    8.0*NR[(i+1)*width+(j-1)] +    12.0*NR[(i+1)*width+(j)] +    8.0*NR[(i+1)*width+(j+1)] +    2.0*NR[(i+1)*width+(j+2)]
+    1.0*NR[(i+2)*width+(j-2)] +    4.0*NR[(i+2)*width+(j-1)] +     6.0*NR[(i+2)*width+(j)] +    4.0*NR[(i+2)*width+(j+1)] +    1.0*NR[(i+2)*width+(j+2)]);

G[i*width+j]   = sqrtf((Gx[i*width+j]*Gx[i*width+j])+(Gy[i*width+j]*Gy[i*width+j]));	//G = √Gx²+Gy²
phi[i*width+j] = atan2f(fabs(Gy[i*width+j]),fabs(Gx[i*width+j]));

if(fabs(phi[i*width+j])<=PI/8 )
phi[i*width+j] = 0;
else if (fabs(phi[i*width+j])<= 3*(PI/8))
phi[i*width+j] = 45;
else if (fabs(phi[i*width+j]) <= 5*(PI/8))
phi[i*width+j] = 90;
else if (fabs(phi[i*width+j]) <= 7*(PI/8))
phi[i*width+j] = 135;
else phi[i*width+j] = 0;
}

```

### Double threshold and Edge tracking by hysteresis.

```

// Edge
for(i=3; i<height-3; i++)
for(j=3; j<width-3; j++) 		{ 			pedge[i*width+j] = 0; 			if(phi[i*width+j] == 0){ 				if(G[i*width+j]>G[i*width+j+1] && G[i*width+j]>G[i*width+j-1]) //edge is in N-S
pedge[i*width+j] = 1;

} else if(phi[i*width+j] == 45) {
if(G[i*width+j]>G[(i+1)*width+j+1] && G[i*width+j]>G[(i-1)*width+j-1]) // edge is in NW-SE
pedge[i*width+j] = 1;

} else if(phi[i*width+j] == 90) {
if(G[i*width+j]>G[(i+1)*width+j] && G[i*width+j]>G[(i-1)*width+j]) //edge is in E-W
pedge[i*width+j] = 1;

} else if(phi[i*width+j] == 135) {
if(G[i*width+j]>G[(i+1)*width+j-1] && G[i*width+j]>G[(i-1)*width+j+1]) // edge is in NE-SW
pedge[i*width+j] = 1;
}
}

// Hysteresis Thresholding
lowthres = level/2;
hithres  = 2*(level);

for(i=3; i<height-3; i++)
for(j=3; j<width-3; j++) 		{ 			if(G[i*width+j]>hithres && pedge[i*width+j])
image_out[i*width+j] = 255;
else if(pedge[i*width+j] && G[i*width+j]>=lowthres && G[i*width+j]<hithres)
// check neighbours 3x3
for (ii=-1;ii<=1; ii++)
for (jj=-1;jj<=1; jj++) 						if (G[(i+ii)*width+j+jj]>hithres)
image_out[i*width+j] = 255;
}
```

Entire code implementing the CPU approach can be found on my git repository.

## The canny edge detector GPU 1D approach.

We already know all the theory so I’m gonna jump right in to the kernels.

I’ve coded four different kernels, one for each step of the algorithm.

### Noise reduction

```

__global__ void noiseReduction1d(float *d_NR, float *d_im, int height, int width)
{

int j;
j = blockIdx.x * blockDim.x + threadIdx.x;

if (j>=2 && j < width-2)
for (int i=2;i<height-2;i++){

d_NR[i*width+j] =  	(2.0*d_im[(i-2)*width+(j-2)] +  4.0*d_im[(i-2)*width+(j-1)] +  5.0*d_im[(i-2)*width+(j)] +  4.0*d_im[(i-2)*width+(j+1)] + 2.0*d_im[(i-2)*width+(j+2)]
+ 4.0*d_im[(i-1)*width+(j-2)] +  9.0*d_im[(i-1)*width+(j-1)] + 12.0*d_im[(i-1)*width+(j)] +  9.0*d_im[(i-1)*width+(j+1)] + 4.0*d_im[(i-1)*width+(j+2)]
+ 5.0*d_im[(i  )*width+(j-2)] + 12.0*d_im[(i  )*width+(j-1)] + 15.0*d_im[(i  )*width+(j)] + 12.0*d_im[(i  )*width+(j+1)] + 5.0*d_im[(i  )*width+(j+2)]
+ 4.0*d_im[(i+1)*width+(j-2)] +  9.0*d_im[(i+1)*width+(j-1)] + 12.0*d_im[(i+1)*width+(j)] +  9.0*d_im[(i+1)*width+(j+1)] + 4.0*d_im[(i+1)*width+(j+2)]
+ 2.0*d_im[(i+2)*width+(j-2)] +  4.0*d_im[(i+2)*width+(j-1)] +  5.0*d_im[(i+2)*width+(j)] +  4.0*d_im[(i+2)*width+(j+1)] + 2.0*d_im[(i+2)*width+(j+2)])
/159.0;
}

}

```

### Finding the intensity gradient of the image and Non-maximum suppression

```

__global__ void intensityGradienteIm1d(float *d_G, float *d_Gx, float *d_Gy ,float *d_phi ,float *d_NR ,int height, int width)
{
int j;
float PI = 3.141593;
j = blockIdx.x * blockDim.x + threadIdx.x;

if (j>=2 && j < width-2)
for (int i=2;i<height-2;i++){
// Intensity gradient of the image
d_Gx[i*width+j] =
(1.0*d_NR[(i-2)*width+(j-2)] +  2.0*d_NR[(i-2)*width+(j-1)] +  (-2.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)]
+ 4.0*d_NR[(i-1)*width+(j-2)] +  8.0*d_NR[(i-1)*width+(j-1)] +  (-8.0)*d_NR[(i-1)*width+(j+1)] + (-4.0)*d_NR[(i-1)*width+(j+2)]
+ 6.0*d_NR[(i  )*width+(j-2)] + 12.0*d_NR[(i  )*width+(j-1)] + (-12.0)*d_NR[(i  )*width+(j+1)] + (-6.0)*d_NR[(i  )*width+(j+2)]
+ 4.0*d_NR[(i+1)*width+(j-2)] +  8.0*d_NR[(i+1)*width+(j-1)] +  (-8.0)*d_NR[(i+1)*width+(j+1)] + (-4.0)*d_NR[(i+1)*width+(j+2)]
+ 1.0*d_NR[(i+2)*width+(j-2)] +  2.0*d_NR[(i+2)*width+(j-1)] +  (-2.0)*d_NR[(i+2)*width+(j+1)] + (-1.0)*d_NR[(i+2)*width+(j+2)]);

d_Gy[i*width+j] =
((-1.0)*d_NR[(i-2)*width+(j-2)] + (-4.0)*d_NR[(i-2)*width+(j-1)] +  (-6.0)*d_NR[(i-2)*width+(j)] + (-4.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)]
+ (-2.0)*d_NR[(i-1)*width+(j-2)] + (-8.0)*d_NR[(i-1)*width+(j-1)] + (-12.0)*d_NR[(i-1)*width+(j)] + (-8.0)*d_NR[(i-1)*width+(j+1)] + (-2.0)*d_NR[(i-1)*width+(j+2)]
+    2.0*d_NR[(i+1)*width+(j-2)] +    8.0*d_NR[(i+1)*width+(j-1)] +    12.0*d_NR[(i+1)*width+(j)] +    8.0*d_NR[(i+1)*width+(j+1)] +    2.0*d_NR[(i+1)*width+(j+2)]
+    1.0*d_NR[(i+2)*width+(j-2)] +    4.0*d_NR[(i+2)*width+(j-1)] +     6.0*d_NR[(i+2)*width+(j)] +    4.0*d_NR[(i+2)*width+(j+1)] +    1.0*d_NR[(i+2)*width+(j+2)]);

d_G[i*width+j]   = sqrtf((d_Gx[i*width+j]*d_Gx[i*width+j])+(d_Gy[i*width+j]*d_Gy[i*width+j]));	//G = √Gx²+Gy²
d_phi[i*width+j] = atan2f(fabs(d_Gy[i*width+j]),fabs(d_Gx[i*width+j]));
if(fabs(d_phi[i*width+j])<=PI/8 )
d_phi[i*width+j] = 0;
else if (fabs(d_phi[i*width+j])<= 3*(PI/8))
d_phi[i*width+j] = 45;
else if (fabs(d_phi[i*width+j]) <= 5*(PI/8))
d_phi[i*width+j] = 90;
else if (fabs(d_phi[i*width+j]) <= 7*(PI/8))
d_phi[i*width+j] = 135;
else d_phi[i*width+j] = 0;
}

}

```

### Double threshold

```

__global__ void edges1d(int *d_pedge, float *d_G,float *d_phi, int height, int width)
{
int j;
j = blockIdx.x * blockDim.x + threadIdx.x;

// Edge
if (j>=2 && j < width-2)
for (int i=2;i<height-2;i++){ 				d_pedge[i*width+j] = 0; 				if(d_phi[i*width+j] == 0){ 					if(d_G[i*width+j]>d_G[i*width+j+1] && d_G[i*width+j]>d_G[i*width+j-1]) //edge is in N-S
d_pedge[i*width+j] = 1;

} else if(d_phi[i*width+j] == 45) {
if(d_G[i*width+j]>d_G[(i+1)*width+j+1] && d_G[i*width+j]>d_G[(i-1)*width+j-1]) // edge is in NW-SE
d_pedge[i*width+j] = 1;

} else if(d_phi[i*width+j] == 90) {
if(d_G[i*width+j]>d_G[(i+1)*width+j] && d_G[i*width+j]>d_G[(i-1)*width+j]) //edge is in E-W
d_pedge[i*width+j] = 1;

} else if(d_phi[i*width+j] == 135) {
if(d_G[i*width+j]>d_G[(i+1)*width+j-1] && d_G[i*width+j]>d_G[(i-1)*width+j+1]) // edge is in NE-SW
d_pedge[i*width+j] = 1;
}
}

}

```

### Edge tracking by hysteresis

```

__global__ void hysteresisThresholding1d(float *d_image_out, int level ,int* d_pedge,float *d_G, int height, int width)
{

int j;
j = blockIdx.x * blockDim.x + threadIdx.x;

float lowthres,hithres;

// Hysteresis Thresholding
lowthres = level/2;
hithres  = 2*(level);
if (j<width)
for (int i=0;i<height;i++) 			d_image_out[i*width+j]=0.0; //Inicializamos a 0 	 __syncthreads(); 	 if (j>=3 && j < width-3)
for (int i=3;i<height-3;i++){ 			if(d_G[i*width+j]>hithres && d_pedge[i*width+j])
d_image_out[i*width+j] = 255;
else if(d_pedge[i*width+j] && d_G[i*width+j]>=lowthres && d_G[i*width+j]<hithres)
// check neighbours 3x3
for (int ii=-1;ii<=1; ii++)
for (int jj=-1;jj<=1; jj++) 						if (d_G[(i+ii)*width+j+jj]>hithres)
d_image_out[i*width+j] = 255;
}
}

```

so basically all I did was remove the inner loop and let the parallel magic do the work.

you can clone my entire project.

executing this code on an image (1920×1200)  .

speed Up of 5.01, pretty cool.

## The canny edge detector GPU 2D approach.

the same as the 1D approach, four kernels but now I paralleled both loops.

### Noise reduction

```

__global__ void noiseReduction2d(float *d_NR, float *d_im, int height, int width)
{

int j,i;

i = blockIdx.x * blockDim.x + threadIdx.x;
j = blockIdx.y * blockDim.y + threadIdx.y;

if (j>=2 && j<width-2 && i>=2 && i < height-2){
d_NR[i*width+j] =    (2.0*d_im[(i-2)*width+(j-2)] +  4.0*d_im[(i-2)*width+(j-1)] +  5.0*d_im[(i-2)*width+(j)] +  4.0*d_im[(i-2)*width+(j+1)] + 2.0*d_im[(i-2)*width+(j+2)]
+ 4.0*d_im[(i-1)*width+(j-2)] +  9.0*d_im[(i-1)*width+(j-1)] + 12.0*d_im[(i-1)*width+(j)] +  9.0*d_im[(i-1)*width+(j+1)] + 4.0*d_im[(i-1)*width+(j+2)]
+ 5.0*d_im[(i  )*width+(j-2)] + 12.0*d_im[(i  )*width+(j-1)] + 15.0*d_im[(i  )*width+(j)] + 12.0*d_im[(i  )*width+(j+1)] + 5.0*d_im[(i  )*width+(j+2)]
+ 4.0*d_im[(i+1)*width+(j-2)] +  9.0*d_im[(i+1)*width+(j-1)] + 12.0*d_im[(i+1)*width+(j)] +  9.0*d_im[(i+1)*width+(j+1)] + 4.0*d_im[(i+1)*width+(j+2)]
+ 2.0*d_im[(i+2)*width+(j-2)] +  4.0*d_im[(i+2)*width+(j-1)] +  5.0*d_im[(i+2)*width+(j)] +  4.0*d_im[(i+2)*width+(j+1)] + 2.0*d_im[(i+2)*width+(j+2)])
/159.0;

}

}
```

### Finding the intensity gradient of the image and Non-maximum suppression

```

{
int j,i;
float PI = 3.141593;
i = blockIdx.x * blockDim.x + threadIdx.x;
j = blockIdx.y * blockDim.y + threadIdx.y;

if (j>=2 && j<width-2 && i>=2 && i < height-2)
{
// Intensity gradient of the image
d_Gx[i*width+j] =
(1.0*d_NR[(i-2)*width+(j-2)] +  2.0*d_NR[(i-2)*width+(j-1)] +  (-2.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)]
+ 4.0*d_NR[(i-1)*width+(j-2)] +  8.0*d_NR[(i-1)*width+(j-1)] +  (-8.0)*d_NR[(i-1)*width+(j+1)] + (-4.0)*d_NR[(i-1)*width+(j+2)]
+ 6.0*d_NR[(i  )*width+(j-2)] + 12.0*d_NR[(i  )*width+(j-1)] + (-12.0)*d_NR[(i  )*width+(j+1)] + (-6.0)*d_NR[(i  )*width+(j+2)]
+ 4.0*d_NR[(i+1)*width+(j-2)] +  8.0*d_NR[(i+1)*width+(j-1)] +  (-8.0)*d_NR[(i+1)*width+(j+1)] + (-4.0)*d_NR[(i+1)*width+(j+2)]
+ 1.0*d_NR[(i+2)*width+(j-2)] +  2.0*d_NR[(i+2)*width+(j-1)] +  (-2.0)*d_NR[(i+2)*width+(j+1)] + (-1.0)*d_NR[(i+2)*width+(j+2)]);

d_Gy[i*width+j] =
((-1.0)*d_NR[(i-2)*width+(j-2)] + (-4.0)*d_NR[(i-2)*width+(j-1)] +  (-6.0)*d_NR[(i-2)*width+(j)] + (-4.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)]
+ (-2.0)*d_NR[(i-1)*width+(j-2)] + (-8.0)*d_NR[(i-1)*width+(j-1)] + (-12.0)*d_NR[(i-1)*width+(j)] + (-8.0)*d_NR[(i-1)*width+(j+1)] + (-2.0)*d_NR[(i-1)*width+(j+2)]
+    2.0*d_NR[(i+1)*width+(j-2)] +    8.0*d_NR[(i+1)*width+(j-1)] +    12.0*d_NR[(i+1)*width+(j)] +    8.0*d_NR[(i+1)*width+(j+1)] +    2.0*d_NR[(i+1)*width+(j+2)]
+    1.0*d_NR[(i+2)*width+(j-2)] +    4.0*d_NR[(i+2)*width+(j-1)] +     6.0*d_NR[(i+2)*width+(j)] +    4.0*d_NR[(i+2)*width+(j+1)] +    1.0*d_NR[(i+2)*width+(j+2)]);

d_G[i*width+j]   = sqrtf((d_Gx[i*width+j]*d_Gx[i*width+j])+(d_Gy[i*width+j]*d_Gy[i*width+j]));	//G = √Gx²+Gy²
d_phi[i*width+j] = atan2f(fabs(d_Gy[i*width+j]),fabs(d_Gx[i*width+j]));

if(fabs(d_phi[i*width+j])<=PI/8 )
d_phi[i*width+j] = 0;
else if (fabs(d_phi[i*width+j])<= 3*(PI/8))
d_phi[i*width+j] = 45;
else if (fabs(d_phi[i*width+j]) <= 5*(PI/8))
d_phi[i*width+j] = 90;
else if (fabs(d_phi[i*width+j]) <= 7*(PI/8))
d_phi[i*width+j] = 135;
else d_phi[i*width+j] = 0;
}

}
```

### Double threshold

```

__global__ void edges2d(int *d_pedge, float *d_G,float *d_phi, int height, int width)
{
int j,i;
i = blockIdx.x * blockDim.x + threadIdx.x;
j = blockIdx.y * blockDim.x + threadIdx.y;

// Edge
if (j>=2 && j<width-2 && i>=2 && i < height-2){ 				d_pedge[i*width+j] = 0; 				if(d_phi[i*width+j] == 0){ 					if(d_G[i*width+j]>d_G[i*width+j+1] && d_G[i*width+j]>d_G[i*width+j-1]) //edge is in N-S
d_pedge[i*width+j] = 1;

} else if(d_phi[i*width+j] == 45) {
if(d_G[i*width+j]>d_G[(i+1)*width+j+1] && d_G[i*width+j]>d_G[(i-1)*width+j-1]) // edge is in NW-SE
d_pedge[i*width+j] = 1;

} else if(d_phi[i*width+j] == 90) {
if(d_G[i*width+j]>d_G[(i+1)*width+j] && d_G[i*width+j]>d_G[(i-1)*width+j]) //edge is in E-W
d_pedge[i*width+j] = 1;

} else if(d_phi[i*width+j] == 135) {
if(d_G[i*width+j]>d_G[(i+1)*width+j-1] && d_G[i*width+j]>d_G[(i-1)*width+j+1]) // edge is in NE-SW
d_pedge[i*width+j] = 1;
}
}

}
```

### Edge tracking by hysteresis

```

__global__ void hysteresisThresholding2d(float *d_image_out, int level ,int* d_pedge,float *d_G, int height, int width)
{

int j,i;
i = blockIdx.x * blockDim.x + threadIdx.x;
j = blockIdx.y * blockDim.x + threadIdx.y;

float lowthres,hithres;

// Hysteresis Thresholding
lowthres = level/2;
hithres  = 2*(level);
if (j<width && i<height) 			d_image_out[i*width+j]=0.0; //Inicializamos a 0 	 __syncthreads(); 	if (j>=3 && j<width-3 && i>=3 && i<height-3) 		{ 			if(d_G[i*width+j]>hithres && d_pedge[i*width+j])
d_image_out[i*width+j] = 255;
else if(d_pedge[i*width+j] && d_G[i*width+j]>=lowthres && d_G[i*width+j]<hithres)
// check neighbours 3x3
for (int ii=-1;ii<=1; ii++)
for (int jj=-1;jj<=1; jj++) 						if (d_G[(i+ii)*width+j+jj]>hithres)
d_image_out[i*width+j] = 255;
}
}
```

Now with both loops executing on parallel things gets interesting.

you can clone the 2D approach.

executing this code on an image (1920×1200) .

Speed Up is even bigger now, 9.4X times faster than the CPU and 2X times faster than 1D version.

# First steps on CUDA

On this post I’m going to use the programming language “C” for our first steps in CUDA. Remember that “C++” is also compatible.

To start we have to know the basic structure of any CUDA program: is like a regular program but in this one we have to communicate with our Nvidia GPU, keeping in mind memory assignment, memory transactions and calling the kernel itself.

For example, we want to create an array of 10 float elements in “C”. It would be something like this:

This memory is allocated in our HOST memory, is not visible to our GPU. Now lets  do the same procedure in a GPU.

## cudaMalloc

This directive will let us allocate memory in our GPU global memory. On the contrary as before, it is only visible from our GPU, therefore we cannot use this in our HOST program.

On the Nvidia Documentation we can check how a cudaMalloc works.

cudaError_t cudaMalloc(void** devPtr, size_t size): applying this procedure we can allocate an array of 10 float elements in our GPU memory.

Now, lets say we populate our array in our HOST and we want to move that array to our GPU for massive parallel computations. How can we do that?

## cudaMemcpy

This directive will let us transfer memory from our host to our GPU and vice-versa.

On the Nvidia Documentation we can check how a cudaMemcpy works.

cudaError_t cudaMemcpy (void* dst, const void* src, size_t count, cudaMemcpyKind kind), where:

void* dst is our destination.

const void* src is our source.

size_t count is the total amount of memory to transfer.

cudaMemcpyKind kind is the kind of copy we want to make, it can be from our host to our device or from our device to our host. For a better explanation click here.

For our example it will go like this:

Now we have our memory in our device ready.

So let’s recap a little. We know how to allocate and transfer memory, so now it’s time for the fun part, the kernel.

Basically the kernel is the procedure that is going to be executed in our GPU.

## Kernel, grids, blocks and threads

So lets say we want our device to add a number ‘p’ to each element in the array.

This seems like a problem that our GPU can easily take care of. For example, “N” threads and each thread can make 1 addition, so instead of “N” iterations in a loop, it would only be 1 instruction.

### How to create a kernel procedure

The kernel procedure must be declared with the directive __global__. This indicates that is a kernel and it’s going to be executed in the GPU.

Our kernel will look something like this:

__global__ void addPtoArray(float* d_array,int lenght,int p)

{

Implementation of the kernel here:

}

And to invoke our kernel we have to use the triple chevron, define grids and threads and finally the name of the kernel.

It’s better explained with an image:

In order to call a kernel, we have to specify how many threads are going to be executed. We can manage the threads in a grid like the previous image.

The grid and blocks are specified using an structure called dim3. Each dim3 structure has 3 values (X,Y and Z), so we can create a 1d,2d or 3d grid.

Inside the kernel we have information about the grid and thread that is being executed.

blockIdx is the index of the block. In the image example, the block highlighted in green is blockIdx.x=2 and blockIdx.y=1

blockDim is the dimension of the grid, in the image example is blockDim.x=4 and blockDim.y=4

Going back to our example the kernel will look something like this:

Now we are set to work that GPU!

Our code will look something like this:

```#include<stdio.h>;
#include <stdlib.h>;
#include <sys/time.h>;
//CUDA
#include <cuda.h>;

//kernel implementation
__global__ void addPtoArray(float *d_array, int lenght,int p){
int i;
i = blockIdx.x * blockDim.x + threadIdx.x; // We calculate the current thread of our context
if (i<lenght)//if the thread is inside the lenght of our array
d_array[i]+=p;
}

int main(int argc, char *argv[])
{
float *h_array, *d_array,*h_hostResult;
int p = 20;
int lenght=5;

// Mallocs CPU
h_array = (float *)malloc(sizeof(float)*lenght);
for (int i=0; i<lenght; i++){ h_array[i] = i;}    /* Mallocs GPU */    cudaMalloc((void **) &amp;amp;amp;amp;amp;amp;amp;amp;amp;d_array,sizeof(float)*lenght);   /* CPU-&amp;amp;amp;amp;amp;amp;amp;amp;gt;GPU */
cudaMemcpy(d_array, h_array, sizeof(float)*lenght, cudaMemcpyHostToDevice);

/*****************/
/*****************/
dim3 GridBlocks(256);

/* GPU->CPU */
h_hostResult = (float *)malloc(sizeof(float)*lenght);
cudaMemcpy(h_hostResult,d_array,sizeof(float)*lenght,cudaMemcpyDeviceToHost);

/************/
/* Results */
/************/
for (int i=0; i<lenght; i++)
{
printf("h_array[%i] = %f \n",i, h_array[i]);
printf("h_hostResult[%i] = %f \n",i, h_hostResult[i]);

}

/* Free CPU */
free(h_hostResult);
free(h_array);
/* Free GPU */
cudaFree(d_array);
return(CUDA_SUCCESS);
}
```

Save this code on a file, let’s say addPtoArray.cu