# Canny edge detector

On this post I’m going to implement the canny edge detector algorithm made by John Canny.

I will provide multiple implementations on CUDA:

• The canny edge detector CPU approach.
• The canny edge detector GPU 1D approach.
• The canny edge detector GPU 2D approach.

The Process of Canny edge detection algorithm can be broken down to 5 different steps:

• Apply Gaussian filter to smooth the image in order to remove the noise.
• Find the intensity gradients of the image.
• Apply non-maximum suppression to get rid of spurious response to edge detection.
• Apply double threshold to determine potential edges.
• Track edge by hysteresis: Finalise the detection of edges by suppressing all the other edges that are weak and not connected to strong edges.

The explanation in depth can be found here.

I’m not going to demonstrate or explain the equations, that is up to you my dear reader. But what I’m going to do is to explain to you my very own implementation of this algorithm step by step.

First we need to know where do we want to go, assuming the canny edge detector algorithm is a box I want something like this:

OK, the global process seems pretty straight forward, an input image, do some process of the image, get output image.

For the box is going to be something like this:

## Canny edge detector CPU approach

### Gaussian filter for noise reduction:

Supposing my input image has dimensions width x Height and applying a 5×5 Gaussian filter:

the noise reduction of an image can go like this:

```
for(i=2; i<height-2; i++)
for(j=2; j<width-2; j++)
{
// Noise reduction
NR[i*width+j] =
(2.0*im[(i-2)*width+(j-2)] +  4.0*im[(i-2)*width+(j-1)] +  5.0*im[(i-2)*width+(j)] +  4.0*im[(i-2)*width+(j+1)] + 2.0*im[(i-2)*width+(j+2)]
+ 4.0*im[(i-1)*width+(j-2)] +  9.0*im[(i-1)*width+(j-1)] + 12.0*im[(i-1)*width+(j)] +  9.0*im[(i-1)*width+(j+1)] + 4.0*im[(i-1)*width+(j+2)]
+ 5.0*im[(i  )*width+(j-2)] + 12.0*im[(i  )*width+(j-1)] + 15.0*im[(i  )*width+(j)] + 12.0*im[(i  )*width+(j+1)] + 5.0*im[(i  )*width+(j+2)]
+ 4.0*im[(i+1)*width+(j-2)] +  9.0*im[(i+1)*width+(j-1)] + 12.0*im[(i+1)*width+(j)] +  9.0*im[(i+1)*width+(j+1)] + 4.0*im[(i+1)*width+(j+2)]
+ 2.0*im[(i+2)*width+(j-2)] +  4.0*im[(i+2)*width+(j-1)] +  5.0*im[(i+2)*width+(j)] +  4.0*im[(i+2)*width+(j+1)] + 2.0*im[(i+2)*width+(j+2)])
/159.0;
}

```

In NR I’m storing the image after applying the Gaussian Filter.

### Finding the intensity gradient of the image and Non-maximum suppression.

To find the intensity gradient of the image.

and to apply the Non-maximum suppression we need to determine if the gradient magnitude (G) is a local maximum :

• if Θ = 0º (N or S) is border if gradient  G >  Ge  and Gw.
• if Θ = 90º (E or W) is border if gradient G > Gn and Gs.
• if Θ = 135º (NE or SW) is border if gradient G > Gnw and Gse.
• if Θ = 45º (NW or SW) is border if gradient G > Gne and Gsw.

For this step I’ll be needing NR

```
for(i=2; i<height-2; i++)
for(j=2; j<width-2; j++)
{
// Intensity gradient of the image
Gx[i*width+j] =
(1.0*NR[(i-2)*width+(j-2)] +  2.0*NR[(i-2)*width+(j-1)] +  (-2.0)*NR[(i-2)*width+(j+1)] + (-1.0)*NR[(i-2)*width+(j+2)]
+ 4.0*NR[(i-1)*width+(j-2)] +  8.0*NR[(i-1)*width+(j-1)] +  (-8.0)*NR[(i-1)*width+(j+1)] + (-4.0)*NR[(i-1)*width+(j+2)]
+ 6.0*NR[(i  )*width+(j-2)] + 12.0*NR[(i  )*width+(j-1)] + (-12.0)*NR[(i  )*width+(j+1)] + (-6.0)*NR[(i  )*width+(j+2)]
+ 4.0*NR[(i+1)*width+(j-2)] +  8.0*NR[(i+1)*width+(j-1)] +  (-8.0)*NR[(i+1)*width+(j+1)] + (-4.0)*NR[(i+1)*width+(j+2)]
+ 1.0*NR[(i+2)*width+(j-2)] +  2.0*NR[(i+2)*width+(j-1)] +  (-2.0)*NR[(i+2)*width+(j+1)] + (-1.0)*NR[(i+2)*width+(j+2)]);

Gy[i*width+j] =
((-1.0)*NR[(i-2)*width+(j-2)] + (-4.0)*NR[(i-2)*width+(j-1)] +  (-6.0)*NR[(i-2)*width+(j)] + (-4.0)*NR[(i-2)*width+(j+1)] + (-1.0)*NR[(i-2)*width+(j+2)]
+ (-2.0)*NR[(i-1)*width+(j-2)] + (-8.0)*NR[(i-1)*width+(j-1)] + (-12.0)*NR[(i-1)*width+(j)] + (-8.0)*NR[(i-1)*width+(j+1)] + (-2.0)*NR[(i-1)*width+(j+2)]
+    2.0*NR[(i+1)*width+(j-2)] +    8.0*NR[(i+1)*width+(j-1)] +    12.0*NR[(i+1)*width+(j)] +    8.0*NR[(i+1)*width+(j+1)] +    2.0*NR[(i+1)*width+(j+2)]
+    1.0*NR[(i+2)*width+(j-2)] +    4.0*NR[(i+2)*width+(j-1)] +     6.0*NR[(i+2)*width+(j)] +    4.0*NR[(i+2)*width+(j+1)] +    1.0*NR[(i+2)*width+(j+2)]);

G[i*width+j]   = sqrtf((Gx[i*width+j]*Gx[i*width+j])+(Gy[i*width+j]*Gy[i*width+j]));	//G = √Gx²+Gy²
phi[i*width+j] = atan2f(fabs(Gy[i*width+j]),fabs(Gx[i*width+j]));

if(fabs(phi[i*width+j])<=PI/8 )
phi[i*width+j] = 0;
else if (fabs(phi[i*width+j])<= 3*(PI/8))
phi[i*width+j] = 45;
else if (fabs(phi[i*width+j]) <= 5*(PI/8))
phi[i*width+j] = 90;
else if (fabs(phi[i*width+j]) <= 7*(PI/8))
phi[i*width+j] = 135;
else phi[i*width+j] = 0;
}

```

### Double threshold and Edge tracking by hysteresis.

```

// Edge
for(i=3; i<height-3; i++)
for(j=3; j<width-3; j++) 		{ 			pedge[i*width+j] = 0; 			if(phi[i*width+j] == 0){ 				if(G[i*width+j]>G[i*width+j+1] && G[i*width+j]>G[i*width+j-1]) //edge is in N-S
pedge[i*width+j] = 1;

} else if(phi[i*width+j] == 45) {
if(G[i*width+j]>G[(i+1)*width+j+1] && G[i*width+j]>G[(i-1)*width+j-1]) // edge is in NW-SE
pedge[i*width+j] = 1;

} else if(phi[i*width+j] == 90) {
if(G[i*width+j]>G[(i+1)*width+j] && G[i*width+j]>G[(i-1)*width+j]) //edge is in E-W
pedge[i*width+j] = 1;

} else if(phi[i*width+j] == 135) {
if(G[i*width+j]>G[(i+1)*width+j-1] && G[i*width+j]>G[(i-1)*width+j+1]) // edge is in NE-SW
pedge[i*width+j] = 1;
}
}

// Hysteresis Thresholding
lowthres = level/2;
hithres  = 2*(level);

for(i=3; i<height-3; i++)
for(j=3; j<width-3; j++) 		{ 			if(G[i*width+j]>hithres && pedge[i*width+j])
image_out[i*width+j] = 255;
else if(pedge[i*width+j] && G[i*width+j]>=lowthres && G[i*width+j]<hithres)
// check neighbours 3x3
for (ii=-1;ii<=1; ii++)
for (jj=-1;jj<=1; jj++) 						if (G[(i+ii)*width+j+jj]>hithres)
image_out[i*width+j] = 255;
}
```

Entire code implementing the CPU approach can be found on my git repository.

## The canny edge detector GPU 1D approach.

We already know all the theory so I’m gonna jump right in to the kernels.

I’ve coded four different kernels, one for each step of the algorithm.

### Noise reduction

```

__global__ void noiseReduction1d(float *d_NR, float *d_im, int height, int width)
{

int j;
j = blockIdx.x * blockDim.x + threadIdx.x;

if (j>=2 && j < width-2)
for (int i=2;i<height-2;i++){

d_NR[i*width+j] =  	(2.0*d_im[(i-2)*width+(j-2)] +  4.0*d_im[(i-2)*width+(j-1)] +  5.0*d_im[(i-2)*width+(j)] +  4.0*d_im[(i-2)*width+(j+1)] + 2.0*d_im[(i-2)*width+(j+2)]
+ 4.0*d_im[(i-1)*width+(j-2)] +  9.0*d_im[(i-1)*width+(j-1)] + 12.0*d_im[(i-1)*width+(j)] +  9.0*d_im[(i-1)*width+(j+1)] + 4.0*d_im[(i-1)*width+(j+2)]
+ 5.0*d_im[(i  )*width+(j-2)] + 12.0*d_im[(i  )*width+(j-1)] + 15.0*d_im[(i  )*width+(j)] + 12.0*d_im[(i  )*width+(j+1)] + 5.0*d_im[(i  )*width+(j+2)]
+ 4.0*d_im[(i+1)*width+(j-2)] +  9.0*d_im[(i+1)*width+(j-1)] + 12.0*d_im[(i+1)*width+(j)] +  9.0*d_im[(i+1)*width+(j+1)] + 4.0*d_im[(i+1)*width+(j+2)]
+ 2.0*d_im[(i+2)*width+(j-2)] +  4.0*d_im[(i+2)*width+(j-1)] +  5.0*d_im[(i+2)*width+(j)] +  4.0*d_im[(i+2)*width+(j+1)] + 2.0*d_im[(i+2)*width+(j+2)])
/159.0;
}

}

```

### Finding the intensity gradient of the image and Non-maximum suppression

```

__global__ void intensityGradienteIm1d(float *d_G, float *d_Gx, float *d_Gy ,float *d_phi ,float *d_NR ,int height, int width)
{
int j;
float PI = 3.141593;
j = blockIdx.x * blockDim.x + threadIdx.x;

if (j>=2 && j < width-2)
for (int i=2;i<height-2;i++){
// Intensity gradient of the image
d_Gx[i*width+j] =
(1.0*d_NR[(i-2)*width+(j-2)] +  2.0*d_NR[(i-2)*width+(j-1)] +  (-2.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)]
+ 4.0*d_NR[(i-1)*width+(j-2)] +  8.0*d_NR[(i-1)*width+(j-1)] +  (-8.0)*d_NR[(i-1)*width+(j+1)] + (-4.0)*d_NR[(i-1)*width+(j+2)]
+ 6.0*d_NR[(i  )*width+(j-2)] + 12.0*d_NR[(i  )*width+(j-1)] + (-12.0)*d_NR[(i  )*width+(j+1)] + (-6.0)*d_NR[(i  )*width+(j+2)]
+ 4.0*d_NR[(i+1)*width+(j-2)] +  8.0*d_NR[(i+1)*width+(j-1)] +  (-8.0)*d_NR[(i+1)*width+(j+1)] + (-4.0)*d_NR[(i+1)*width+(j+2)]
+ 1.0*d_NR[(i+2)*width+(j-2)] +  2.0*d_NR[(i+2)*width+(j-1)] +  (-2.0)*d_NR[(i+2)*width+(j+1)] + (-1.0)*d_NR[(i+2)*width+(j+2)]);

d_Gy[i*width+j] =
((-1.0)*d_NR[(i-2)*width+(j-2)] + (-4.0)*d_NR[(i-2)*width+(j-1)] +  (-6.0)*d_NR[(i-2)*width+(j)] + (-4.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)]
+ (-2.0)*d_NR[(i-1)*width+(j-2)] + (-8.0)*d_NR[(i-1)*width+(j-1)] + (-12.0)*d_NR[(i-1)*width+(j)] + (-8.0)*d_NR[(i-1)*width+(j+1)] + (-2.0)*d_NR[(i-1)*width+(j+2)]
+    2.0*d_NR[(i+1)*width+(j-2)] +    8.0*d_NR[(i+1)*width+(j-1)] +    12.0*d_NR[(i+1)*width+(j)] +    8.0*d_NR[(i+1)*width+(j+1)] +    2.0*d_NR[(i+1)*width+(j+2)]
+    1.0*d_NR[(i+2)*width+(j-2)] +    4.0*d_NR[(i+2)*width+(j-1)] +     6.0*d_NR[(i+2)*width+(j)] +    4.0*d_NR[(i+2)*width+(j+1)] +    1.0*d_NR[(i+2)*width+(j+2)]);

d_G[i*width+j]   = sqrtf((d_Gx[i*width+j]*d_Gx[i*width+j])+(d_Gy[i*width+j]*d_Gy[i*width+j]));	//G = √Gx²+Gy²
d_phi[i*width+j] = atan2f(fabs(d_Gy[i*width+j]),fabs(d_Gx[i*width+j]));
if(fabs(d_phi[i*width+j])<=PI/8 )
d_phi[i*width+j] = 0;
else if (fabs(d_phi[i*width+j])<= 3*(PI/8))
d_phi[i*width+j] = 45;
else if (fabs(d_phi[i*width+j]) <= 5*(PI/8))
d_phi[i*width+j] = 90;
else if (fabs(d_phi[i*width+j]) <= 7*(PI/8))
d_phi[i*width+j] = 135;
else d_phi[i*width+j] = 0;
}

}

```

### Double threshold

```

__global__ void edges1d(int *d_pedge, float *d_G,float *d_phi, int height, int width)
{
int j;
j = blockIdx.x * blockDim.x + threadIdx.x;

// Edge
if (j>=2 && j < width-2)
for (int i=2;i<height-2;i++){ 				d_pedge[i*width+j] = 0; 				if(d_phi[i*width+j] == 0){ 					if(d_G[i*width+j]>d_G[i*width+j+1] && d_G[i*width+j]>d_G[i*width+j-1]) //edge is in N-S
d_pedge[i*width+j] = 1;

} else if(d_phi[i*width+j] == 45) {
if(d_G[i*width+j]>d_G[(i+1)*width+j+1] && d_G[i*width+j]>d_G[(i-1)*width+j-1]) // edge is in NW-SE
d_pedge[i*width+j] = 1;

} else if(d_phi[i*width+j] == 90) {
if(d_G[i*width+j]>d_G[(i+1)*width+j] && d_G[i*width+j]>d_G[(i-1)*width+j]) //edge is in E-W
d_pedge[i*width+j] = 1;

} else if(d_phi[i*width+j] == 135) {
if(d_G[i*width+j]>d_G[(i+1)*width+j-1] && d_G[i*width+j]>d_G[(i-1)*width+j+1]) // edge is in NE-SW
d_pedge[i*width+j] = 1;
}
}

}

```

### Edge tracking by hysteresis

```

__global__ void hysteresisThresholding1d(float *d_image_out, int level ,int* d_pedge,float *d_G, int height, int width)
{

int j;
j = blockIdx.x * blockDim.x + threadIdx.x;

float lowthres,hithres;

// Hysteresis Thresholding
lowthres = level/2;
hithres  = 2*(level);
if (j<width)
for (int i=0;i<height;i++) 			d_image_out[i*width+j]=0.0; //Inicializamos a 0 	 __syncthreads(); 	 if (j>=3 && j < width-3)
for (int i=3;i<height-3;i++){ 			if(d_G[i*width+j]>hithres && d_pedge[i*width+j])
d_image_out[i*width+j] = 255;
else if(d_pedge[i*width+j] && d_G[i*width+j]>=lowthres && d_G[i*width+j]<hithres)
// check neighbours 3x3
for (int ii=-1;ii<=1; ii++)
for (int jj=-1;jj<=1; jj++) 						if (d_G[(i+ii)*width+j+jj]>hithres)
d_image_out[i*width+j] = 255;
}
}

```

so basically all I did was remove the inner loop and let the parallel magic do the work.

you can clone my entire project.

executing this code on an image (1920×1200)  .

speed Up of 5.01, pretty cool.

## The canny edge detector GPU 2D approach.

the same as the 1D approach, four kernels but now I paralleled both loops.

### Noise reduction

```

__global__ void noiseReduction2d(float *d_NR, float *d_im, int height, int width)
{

int j,i;

i = blockIdx.x * blockDim.x + threadIdx.x;
j = blockIdx.y * blockDim.y + threadIdx.y;

if (j>=2 && j<width-2 && i>=2 && i < height-2){
d_NR[i*width+j] =    (2.0*d_im[(i-2)*width+(j-2)] +  4.0*d_im[(i-2)*width+(j-1)] +  5.0*d_im[(i-2)*width+(j)] +  4.0*d_im[(i-2)*width+(j+1)] + 2.0*d_im[(i-2)*width+(j+2)]
+ 4.0*d_im[(i-1)*width+(j-2)] +  9.0*d_im[(i-1)*width+(j-1)] + 12.0*d_im[(i-1)*width+(j)] +  9.0*d_im[(i-1)*width+(j+1)] + 4.0*d_im[(i-1)*width+(j+2)]
+ 5.0*d_im[(i  )*width+(j-2)] + 12.0*d_im[(i  )*width+(j-1)] + 15.0*d_im[(i  )*width+(j)] + 12.0*d_im[(i  )*width+(j+1)] + 5.0*d_im[(i  )*width+(j+2)]
+ 4.0*d_im[(i+1)*width+(j-2)] +  9.0*d_im[(i+1)*width+(j-1)] + 12.0*d_im[(i+1)*width+(j)] +  9.0*d_im[(i+1)*width+(j+1)] + 4.0*d_im[(i+1)*width+(j+2)]
+ 2.0*d_im[(i+2)*width+(j-2)] +  4.0*d_im[(i+2)*width+(j-1)] +  5.0*d_im[(i+2)*width+(j)] +  4.0*d_im[(i+2)*width+(j+1)] + 2.0*d_im[(i+2)*width+(j+2)])
/159.0;

}

}
```

### Finding the intensity gradient of the image and Non-maximum suppression

```

{
int j,i;
float PI = 3.141593;
i = blockIdx.x * blockDim.x + threadIdx.x;
j = blockIdx.y * blockDim.y + threadIdx.y;

if (j>=2 && j<width-2 && i>=2 && i < height-2)
{
// Intensity gradient of the image
d_Gx[i*width+j] =
(1.0*d_NR[(i-2)*width+(j-2)] +  2.0*d_NR[(i-2)*width+(j-1)] +  (-2.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)]
+ 4.0*d_NR[(i-1)*width+(j-2)] +  8.0*d_NR[(i-1)*width+(j-1)] +  (-8.0)*d_NR[(i-1)*width+(j+1)] + (-4.0)*d_NR[(i-1)*width+(j+2)]
+ 6.0*d_NR[(i  )*width+(j-2)] + 12.0*d_NR[(i  )*width+(j-1)] + (-12.0)*d_NR[(i  )*width+(j+1)] + (-6.0)*d_NR[(i  )*width+(j+2)]
+ 4.0*d_NR[(i+1)*width+(j-2)] +  8.0*d_NR[(i+1)*width+(j-1)] +  (-8.0)*d_NR[(i+1)*width+(j+1)] + (-4.0)*d_NR[(i+1)*width+(j+2)]
+ 1.0*d_NR[(i+2)*width+(j-2)] +  2.0*d_NR[(i+2)*width+(j-1)] +  (-2.0)*d_NR[(i+2)*width+(j+1)] + (-1.0)*d_NR[(i+2)*width+(j+2)]);

d_Gy[i*width+j] =
((-1.0)*d_NR[(i-2)*width+(j-2)] + (-4.0)*d_NR[(i-2)*width+(j-1)] +  (-6.0)*d_NR[(i-2)*width+(j)] + (-4.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)]
+ (-2.0)*d_NR[(i-1)*width+(j-2)] + (-8.0)*d_NR[(i-1)*width+(j-1)] + (-12.0)*d_NR[(i-1)*width+(j)] + (-8.0)*d_NR[(i-1)*width+(j+1)] + (-2.0)*d_NR[(i-1)*width+(j+2)]
+    2.0*d_NR[(i+1)*width+(j-2)] +    8.0*d_NR[(i+1)*width+(j-1)] +    12.0*d_NR[(i+1)*width+(j)] +    8.0*d_NR[(i+1)*width+(j+1)] +    2.0*d_NR[(i+1)*width+(j+2)]
+    1.0*d_NR[(i+2)*width+(j-2)] +    4.0*d_NR[(i+2)*width+(j-1)] +     6.0*d_NR[(i+2)*width+(j)] +    4.0*d_NR[(i+2)*width+(j+1)] +    1.0*d_NR[(i+2)*width+(j+2)]);

d_G[i*width+j]   = sqrtf((d_Gx[i*width+j]*d_Gx[i*width+j])+(d_Gy[i*width+j]*d_Gy[i*width+j]));	//G = √Gx²+Gy²
d_phi[i*width+j] = atan2f(fabs(d_Gy[i*width+j]),fabs(d_Gx[i*width+j]));

if(fabs(d_phi[i*width+j])<=PI/8 )
d_phi[i*width+j] = 0;
else if (fabs(d_phi[i*width+j])<= 3*(PI/8))
d_phi[i*width+j] = 45;
else if (fabs(d_phi[i*width+j]) <= 5*(PI/8))
d_phi[i*width+j] = 90;
else if (fabs(d_phi[i*width+j]) <= 7*(PI/8))
d_phi[i*width+j] = 135;
else d_phi[i*width+j] = 0;
}

}
```

### Double threshold

```

__global__ void edges2d(int *d_pedge, float *d_G,float *d_phi, int height, int width)
{
int j,i;
i = blockIdx.x * blockDim.x + threadIdx.x;
j = blockIdx.y * blockDim.x + threadIdx.y;

// Edge
if (j>=2 && j<width-2 && i>=2 && i < height-2){ 				d_pedge[i*width+j] = 0; 				if(d_phi[i*width+j] == 0){ 					if(d_G[i*width+j]>d_G[i*width+j+1] && d_G[i*width+j]>d_G[i*width+j-1]) //edge is in N-S
d_pedge[i*width+j] = 1;

} else if(d_phi[i*width+j] == 45) {
if(d_G[i*width+j]>d_G[(i+1)*width+j+1] && d_G[i*width+j]>d_G[(i-1)*width+j-1]) // edge is in NW-SE
d_pedge[i*width+j] = 1;

} else if(d_phi[i*width+j] == 90) {
if(d_G[i*width+j]>d_G[(i+1)*width+j] && d_G[i*width+j]>d_G[(i-1)*width+j]) //edge is in E-W
d_pedge[i*width+j] = 1;

} else if(d_phi[i*width+j] == 135) {
if(d_G[i*width+j]>d_G[(i+1)*width+j-1] && d_G[i*width+j]>d_G[(i-1)*width+j+1]) // edge is in NE-SW
d_pedge[i*width+j] = 1;
}
}

}
```

### Edge tracking by hysteresis

```

__global__ void hysteresisThresholding2d(float *d_image_out, int level ,int* d_pedge,float *d_G, int height, int width)
{

int j,i;
i = blockIdx.x * blockDim.x + threadIdx.x;
j = blockIdx.y * blockDim.x + threadIdx.y;

float lowthres,hithres;

// Hysteresis Thresholding
lowthres = level/2;
hithres  = 2*(level);
if (j<width && i<height) 			d_image_out[i*width+j]=0.0; //Inicializamos a 0 	 __syncthreads(); 	if (j>=3 && j<width-3 && i>=3 && i<height-3) 		{ 			if(d_G[i*width+j]>hithres && d_pedge[i*width+j])
d_image_out[i*width+j] = 255;
else if(d_pedge[i*width+j] && d_G[i*width+j]>=lowthres && d_G[i*width+j]<hithres)
// check neighbours 3x3
for (int ii=-1;ii<=1; ii++)
for (int jj=-1;jj<=1; jj++) 						if (d_G[(i+ii)*width+j+jj]>hithres)
d_image_out[i*width+j] = 255;
}
}
```

Now with both loops executing on parallel things gets interesting.

you can clone the 2D approach.

executing this code on an image (1920×1200) .

Speed Up is even bigger now, 9.4X times faster than the CPU and 2X times faster than 1D version.

# Grids, threads… THE ULTIMATE GUIDE

So most users struggle a lot with this topic, on CUDA is important to make a good use of how many grids and threads we need to use so let’s start:

## What are grids blocks and thread blocks?

In CUDA we have to define how our grid is going to be and how many threads are going to be inside a grid.

The kernels are executed in grids just like the following image:

## Identify the dimension of the problem

There are multiple ways to do this, but let’s try to make it simple.

On CUDA or GPU programming in general we have from 1 to 3 dimensions, values different than those will give runtime errors.

So for example, if we have an array, our problem is a one-dimensional problem, a matrix is a two-dimensional problem and a cube/parallelepiped is a three-dimensional problem.

### One dimensional problems

These are the easiest ones, we only need to work on the X axis, and basically our problem is reduced to the length of the array.

Let’s say we have an Array with length 1Million:

we have to decide how many grids blocks we want and how many threads per block we want.

First we have to remember, on most devices the maximum threads per block available is 512 or 1024, something bigger than this will give an error.

After deciding the threads per blocks (I’m going to use 1024) the grid is ceil(1M/1024).

Our declaration and call to the kernel will look something like this

```
dim3 gridBlocks(ceil(1000000/1024),1,1);

```

this seems pretty straight forward, the Y and Z axis is 1 since we are only working on the X axis.

## Two dimensional problems

These might be the most common, we need to work on the X and Y axis.

Let us divide this into two examples,

A square and a rectangle, the first being an exclusive case of the second one, when both sides of the rectangle are the same length, is a square (Duh!?).

### Square:

Let us assume we have a square image with A x A pixels, se we need to create a grid that minimizes the number of threads in order to use them the most efficient way

First we need to set the size of our grid, this is a problem specific matter but we can set some various examples.

For very small values of A (A >8 and A <64):

We can divide the square into 4 sub-squares, for example if A = 16 we have the following layout

So our grid will be 2 x 2 and each grid block will have 8 x 8 threads

We can code that as follow:

```dim3 gridBlocks(2,2,1);
```

For medium sizes (A>64 and A < 512):

We can divide the square into 16-sub squares

This is 32 x 32 I’m using a small grid just for illustration purposes

We can code this as follow

```dim3 gridBlocks(4,4,1); // 4 x 4 grid
```

This examples are just illustrations, is not by any means a rule, remember the number of grids and threads is a specific matter of the problem itself.

### Rectangle

So let us assume we have an image that is shaped like a rectangle with A x B pixels and A != B

Assuming A and B big enough for illustration purposes, I will set two cases, A > B and B > A

#### A < B:

We get the following layout

This can be coded as follow

```dim3 gridBlocks(4,2,1); // 4 x 2 grid
```

#### A > B:

We get the following layout

This can be coded as follow

```dim3 gridBlocks(2,4,1); // 4 x 2 grid
```

# My First CUDA Program

So for this post I wanted to show you guys the “Hello World” in CUDA, and no, we are not going to printf(“Hello World “) , we are going to add vectors!, yeah feeling the hype!?

so let’s start:

First, if you are a complete newbie I recommend my previous post First steps on CUDA, get a handle on how I work and then come back.

Since we are all good software developers, let’s define our problem and propose a simple solution in the programming language C.

# Our problem in C

We want to do something like C = A+B where A,B,C are vectors, obviously these 3 have the same lenght.

So:

Allocate memory for A, B and C (for simplicity, the lenght of the array is an input parameter, and the type will be float)

```        int length;
float *vA,*vB,*vC;

if (argc!=2){
printf("./exec n \n");
exit(-1);
}
length = atoi(argv[1]);
//Allocating memory
vA = (float*)malloc(sizeof(float)*length);
vB = (float*)malloc(sizeof(float)*length);
vC = (float*)malloc(sizeof(float)*length);
```

Initialize A, B

```/* Initialize A and B, A[i] will have it's own index by 2 and B[i] it's own index by 3*/
for (int i=0;i < length; i++ ){
vA[i]=i*2;
vB[i]=i*3;
}
```

Make the computation C = A + B

```        for (int i=0;i < length ; i++ )
vC[i] = vA[i]+ vB[i];
```

Show result on console

```         for (int i=0;i < length; i++ )
printf ("C[%i] = %f \n",i,vC[i]);
```

Free memory

```         free(vA);
free(vB);
free(vC);
```

Seems pretty straight forward right? it’s easy to see that the computation of C[i] and C[i+1]  are completely independent, making it perfect to our GPU to make.

# Our problem in CUDA

Now the fun part, is pretty similar to the solution in but now we have to add the GPU things ( DUH!).

To make the code more easy to understand, all the host variables will have the prefix h_

and our device variables will have the prefix d_

We need to:

Allocate memory for h_A, h_B and h_C on our Host. These Variables are our vectors

```        int length;
float *h_A,*h_B,*h_C,*d_A,*d_B,*d_C;

if (argc!=2){
printf("./exec n \n");
exit(-1);
}
n = atoi(argv[1]);

//Allocating memory
h_A = (float*)malloc(sizeof(float)*length);
h_B = (float*)malloc(sizeof(float)*length);
h_C = (float*)malloc(sizeof(float)*length);
```

Allocate memory for d_A,d_B and d_C on our Device. These variables are our Vector in the device

```        cudaMalloc((void **)&d_A,sizeof(float)*length);
cudaMalloc((void **)&d_B,sizeof(float)*length);
cudaMalloc((void **)&d_C,sizeof(float)*length);
```

Initialize the host vectors h_A and h_B.

```         for (int i=0;i < length;i++){
h_A[i]=i*2;
h_B[i]=i*3;
}
```

Transfer the memory from  (h_A, h_B) to (d_A,d_B).

```      /* CPU to GPU */
cudaMemcpy(d_A, h_A, sizeof(float)*length, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, sizeof(float)*length, cudaMemcpyHostToDevice);

```

Initialize grids and threads. Since is a 1D problem we will only work on the X axis making Y =Z=1.

```//Each block will have 512 threads and as many blocks as needed
dim3 dimGrid( ceil(length/512) +1, 1, 1); //	length/512 blocks
dim3 dimBlock(512, 1, 1);	//512 threads for each block
```

Implement and call the kernel.

```__global__ void addVector(float *d_A, float *d_B, float *d_C,int length )
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < length)//we need to make sure our threads are within bounds
d_C[index]= d_A[index]+d_B[index];
}

//call to the kernel in the main
addVector <<< dimGrid , dimBlock >>> (d_A, d_B,d_C,length);
cudaThreadSynchronize();/*let's make sure all the threads finish here to avoid race conditions.*/
```

Transfer memory from d_C to h_C.

```         cudaMemcpy(h_C, d_C, sizeof(float)*length, cudaMemcpyDeviceToHost);
```

Show Result on Console.

```         for (int i=0; i<length;i++)
printf ("C[%i] = %f \n",i,h_C[i]);
```

Free host and device memories.

```        free(h_A);
free(h_B);
free(h_C);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
```

also pretty straight forward.

Now… this code shows the result on console, but for extreme large vector  is not helpful at all .

Let’s workThatGPU! by measuring the time with very big vectors. And comparing our CUDA solution to the C solution.

## First  vectors with size 4096

CPU Execution time 0.000072 ms.
GPU Execution time 0.000270 ms.

The CPU time is way smaller than the GPU, why? because the cudaMemcpy , actually transferring the memory back and forth takes more time than the actual computation

## Vectors size 20000

CPU Execution time 0.000347 ms.
GPU Execution time 0.000471 ms.

very similar times but still the CPU is ahead.

## Vectors size 1000000

CPU Execution time 0.015172 ms.
GPU Execution time 0.009702 ms.

Wow, nice! our GPU is 1.66 times faster than our CPU!

## Vectors size 700000000!!

CPU Execution time 13.240476 ms.
GPU Execution time 0.035979 ms.

So for this size of vector, our GPU is 378,2 times faster than our CPU!! it makes sense that our GPU works better with very large amount of information but when it’s little, just don’t even bother.