On this post I’m going to implement the canny edge detector algorithm made by John Canny.
I will provide multiple implementations on CUDA:
- The canny edge detector CPU approach.
- The canny edge detector GPU 1D approach.
- The canny edge detector GPU 2D approach.
The Process of Canny edge detection algorithm can be broken down to 5 different steps:
- Apply Gaussian filter to smooth the image in order to remove the noise.
- Find the intensity gradients of the image.
- Apply non-maximum suppression to get rid of spurious response to edge detection.
- Apply double threshold to determine potential edges.
- Track edge by hysteresis: Finalise the detection of edges by suppressing all the other edges that are weak and not connected to strong edges.
The explanation in depth can be found here.
I’m not going to demonstrate or explain the equations, that is up to you my dear reader. But what I’m going to do is to explain to you my very own implementation of this algorithm step by step.
First we need to know where do we want to go, assuming the canny edge detector algorithm is a box I want something like this:

OK, the global process seems pretty straight forward, an input image, do some process of the image, get output image.
For the box is going to be something like this:
Canny edge detector CPU approach
Gaussian filter for noise reduction:
Supposing my input image has dimensions width x Height and applying a 5×5 Gaussian filter:
the noise reduction of an image can go like this:
for(i=2; i<height-2; i++) for(j=2; j<width-2; j++) { // Noise reduction NR[i*width+j] = (2.0*im[(i-2)*width+(j-2)] + 4.0*im[(i-2)*width+(j-1)] + 5.0*im[(i-2)*width+(j)] + 4.0*im[(i-2)*width+(j+1)] + 2.0*im[(i-2)*width+(j+2)] + 4.0*im[(i-1)*width+(j-2)] + 9.0*im[(i-1)*width+(j-1)] + 12.0*im[(i-1)*width+(j)] + 9.0*im[(i-1)*width+(j+1)] + 4.0*im[(i-1)*width+(j+2)] + 5.0*im[(i )*width+(j-2)] + 12.0*im[(i )*width+(j-1)] + 15.0*im[(i )*width+(j)] + 12.0*im[(i )*width+(j+1)] + 5.0*im[(i )*width+(j+2)] + 4.0*im[(i+1)*width+(j-2)] + 9.0*im[(i+1)*width+(j-1)] + 12.0*im[(i+1)*width+(j)] + 9.0*im[(i+1)*width+(j+1)] + 4.0*im[(i+1)*width+(j+2)] + 2.0*im[(i+2)*width+(j-2)] + 4.0*im[(i+2)*width+(j-1)] + 5.0*im[(i+2)*width+(j)] + 4.0*im[(i+2)*width+(j+1)] + 2.0*im[(i+2)*width+(j+2)]) /159.0; }
In NR I’m storing the image after applying the Gaussian Filter.
Finding the intensity gradient of the image and Non-maximum suppression.
To find the intensity gradient of the image.
and to apply the Non-maximum suppression we need to determine if the gradient magnitude (G) is a local maximum :
- if Θ = 0º (N or S) is border if gradient G > Ge and Gw.
- if Θ = 90º (E or W) is border if gradient G > Gn and Gs.
- if Θ = 135º (NE or SW) is border if gradient G > Gnw and Gse.
- if Θ = 45º (NW or SW) is border if gradient G > Gne and Gsw.
For this step I’ll be needing NR
for(i=2; i<height-2; i++) for(j=2; j<width-2; j++) { // Intensity gradient of the image Gx[i*width+j] = (1.0*NR[(i-2)*width+(j-2)] + 2.0*NR[(i-2)*width+(j-1)] + (-2.0)*NR[(i-2)*width+(j+1)] + (-1.0)*NR[(i-2)*width+(j+2)] + 4.0*NR[(i-1)*width+(j-2)] + 8.0*NR[(i-1)*width+(j-1)] + (-8.0)*NR[(i-1)*width+(j+1)] + (-4.0)*NR[(i-1)*width+(j+2)] + 6.0*NR[(i )*width+(j-2)] + 12.0*NR[(i )*width+(j-1)] + (-12.0)*NR[(i )*width+(j+1)] + (-6.0)*NR[(i )*width+(j+2)] + 4.0*NR[(i+1)*width+(j-2)] + 8.0*NR[(i+1)*width+(j-1)] + (-8.0)*NR[(i+1)*width+(j+1)] + (-4.0)*NR[(i+1)*width+(j+2)] + 1.0*NR[(i+2)*width+(j-2)] + 2.0*NR[(i+2)*width+(j-1)] + (-2.0)*NR[(i+2)*width+(j+1)] + (-1.0)*NR[(i+2)*width+(j+2)]); Gy[i*width+j] = ((-1.0)*NR[(i-2)*width+(j-2)] + (-4.0)*NR[(i-2)*width+(j-1)] + (-6.0)*NR[(i-2)*width+(j)] + (-4.0)*NR[(i-2)*width+(j+1)] + (-1.0)*NR[(i-2)*width+(j+2)] + (-2.0)*NR[(i-1)*width+(j-2)] + (-8.0)*NR[(i-1)*width+(j-1)] + (-12.0)*NR[(i-1)*width+(j)] + (-8.0)*NR[(i-1)*width+(j+1)] + (-2.0)*NR[(i-1)*width+(j+2)] + 2.0*NR[(i+1)*width+(j-2)] + 8.0*NR[(i+1)*width+(j-1)] + 12.0*NR[(i+1)*width+(j)] + 8.0*NR[(i+1)*width+(j+1)] + 2.0*NR[(i+1)*width+(j+2)] + 1.0*NR[(i+2)*width+(j-2)] + 4.0*NR[(i+2)*width+(j-1)] + 6.0*NR[(i+2)*width+(j)] + 4.0*NR[(i+2)*width+(j+1)] + 1.0*NR[(i+2)*width+(j+2)]); G[i*width+j] = sqrtf((Gx[i*width+j]*Gx[i*width+j])+(Gy[i*width+j]*Gy[i*width+j])); //G = √Gx²+Gy² phi[i*width+j] = atan2f(fabs(Gy[i*width+j]),fabs(Gx[i*width+j])); if(fabs(phi[i*width+j])<=PI/8 ) phi[i*width+j] = 0; else if (fabs(phi[i*width+j])<= 3*(PI/8)) phi[i*width+j] = 45; else if (fabs(phi[i*width+j]) <= 5*(PI/8)) phi[i*width+j] = 90; else if (fabs(phi[i*width+j]) <= 7*(PI/8)) phi[i*width+j] = 135; else phi[i*width+j] = 0; }
Double threshold and Edge tracking by hysteresis.
// Edge for(i=3; i<height-3; i++) for(j=3; j<width-3; j++) { pedge[i*width+j] = 0; if(phi[i*width+j] == 0){ if(G[i*width+j]>G[i*width+j+1] && G[i*width+j]>G[i*width+j-1]) //edge is in N-S pedge[i*width+j] = 1; } else if(phi[i*width+j] == 45) { if(G[i*width+j]>G[(i+1)*width+j+1] && G[i*width+j]>G[(i-1)*width+j-1]) // edge is in NW-SE pedge[i*width+j] = 1; } else if(phi[i*width+j] == 90) { if(G[i*width+j]>G[(i+1)*width+j] && G[i*width+j]>G[(i-1)*width+j]) //edge is in E-W pedge[i*width+j] = 1; } else if(phi[i*width+j] == 135) { if(G[i*width+j]>G[(i+1)*width+j-1] && G[i*width+j]>G[(i-1)*width+j+1]) // edge is in NE-SW pedge[i*width+j] = 1; } } // Hysteresis Thresholding lowthres = level/2; hithres = 2*(level); for(i=3; i<height-3; i++) for(j=3; j<width-3; j++) { if(G[i*width+j]>hithres && pedge[i*width+j]) image_out[i*width+j] = 255; else if(pedge[i*width+j] && G[i*width+j]>=lowthres && G[i*width+j]<hithres) // check neighbours 3x3 for (ii=-1;ii<=1; ii++) for (jj=-1;jj<=1; jj++) if (G[(i+ii)*width+j+jj]>hithres) image_out[i*width+j] = 255; }
Entire code implementing the CPU approach can be found on my git repository.
git clone https://adri1988@bitbucket.org/adri1988/cannyedgedetectorcpu.git
The canny edge detector GPU 1D approach.
We already know all the theory so I’m gonna jump right in to the kernels.
I’ve coded four different kernels, one for each step of the algorithm.
Noise reduction
__global__ void noiseReduction1d(float *d_NR, float *d_im, int height, int width) { int j; j = blockIdx.x * blockDim.x + threadIdx.x; if (j>=2 && j < width-2) for (int i=2;i<height-2;i++){ d_NR[i*width+j] = (2.0*d_im[(i-2)*width+(j-2)] + 4.0*d_im[(i-2)*width+(j-1)] + 5.0*d_im[(i-2)*width+(j)] + 4.0*d_im[(i-2)*width+(j+1)] + 2.0*d_im[(i-2)*width+(j+2)] + 4.0*d_im[(i-1)*width+(j-2)] + 9.0*d_im[(i-1)*width+(j-1)] + 12.0*d_im[(i-1)*width+(j)] + 9.0*d_im[(i-1)*width+(j+1)] + 4.0*d_im[(i-1)*width+(j+2)] + 5.0*d_im[(i )*width+(j-2)] + 12.0*d_im[(i )*width+(j-1)] + 15.0*d_im[(i )*width+(j)] + 12.0*d_im[(i )*width+(j+1)] + 5.0*d_im[(i )*width+(j+2)] + 4.0*d_im[(i+1)*width+(j-2)] + 9.0*d_im[(i+1)*width+(j-1)] + 12.0*d_im[(i+1)*width+(j)] + 9.0*d_im[(i+1)*width+(j+1)] + 4.0*d_im[(i+1)*width+(j+2)] + 2.0*d_im[(i+2)*width+(j-2)] + 4.0*d_im[(i+2)*width+(j-1)] + 5.0*d_im[(i+2)*width+(j)] + 4.0*d_im[(i+2)*width+(j+1)] + 2.0*d_im[(i+2)*width+(j+2)]) /159.0; } }
Finding the intensity gradient of the image and Non-maximum suppression
__global__ void intensityGradienteIm1d(float *d_G, float *d_Gx, float *d_Gy ,float *d_phi ,float *d_NR ,int height, int width) { int j; float PI = 3.141593; j = blockIdx.x * blockDim.x + threadIdx.x; if (j>=2 && j < width-2) for (int i=2;i<height-2;i++){ // Intensity gradient of the image d_Gx[i*width+j] = (1.0*d_NR[(i-2)*width+(j-2)] + 2.0*d_NR[(i-2)*width+(j-1)] + (-2.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)] + 4.0*d_NR[(i-1)*width+(j-2)] + 8.0*d_NR[(i-1)*width+(j-1)] + (-8.0)*d_NR[(i-1)*width+(j+1)] + (-4.0)*d_NR[(i-1)*width+(j+2)] + 6.0*d_NR[(i )*width+(j-2)] + 12.0*d_NR[(i )*width+(j-1)] + (-12.0)*d_NR[(i )*width+(j+1)] + (-6.0)*d_NR[(i )*width+(j+2)] + 4.0*d_NR[(i+1)*width+(j-2)] + 8.0*d_NR[(i+1)*width+(j-1)] + (-8.0)*d_NR[(i+1)*width+(j+1)] + (-4.0)*d_NR[(i+1)*width+(j+2)] + 1.0*d_NR[(i+2)*width+(j-2)] + 2.0*d_NR[(i+2)*width+(j-1)] + (-2.0)*d_NR[(i+2)*width+(j+1)] + (-1.0)*d_NR[(i+2)*width+(j+2)]); d_Gy[i*width+j] = ((-1.0)*d_NR[(i-2)*width+(j-2)] + (-4.0)*d_NR[(i-2)*width+(j-1)] + (-6.0)*d_NR[(i-2)*width+(j)] + (-4.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)] + (-2.0)*d_NR[(i-1)*width+(j-2)] + (-8.0)*d_NR[(i-1)*width+(j-1)] + (-12.0)*d_NR[(i-1)*width+(j)] + (-8.0)*d_NR[(i-1)*width+(j+1)] + (-2.0)*d_NR[(i-1)*width+(j+2)] + 2.0*d_NR[(i+1)*width+(j-2)] + 8.0*d_NR[(i+1)*width+(j-1)] + 12.0*d_NR[(i+1)*width+(j)] + 8.0*d_NR[(i+1)*width+(j+1)] + 2.0*d_NR[(i+1)*width+(j+2)] + 1.0*d_NR[(i+2)*width+(j-2)] + 4.0*d_NR[(i+2)*width+(j-1)] + 6.0*d_NR[(i+2)*width+(j)] + 4.0*d_NR[(i+2)*width+(j+1)] + 1.0*d_NR[(i+2)*width+(j+2)]); __syncthreads(); d_G[i*width+j] = sqrtf((d_Gx[i*width+j]*d_Gx[i*width+j])+(d_Gy[i*width+j]*d_Gy[i*width+j])); //G = √Gx²+Gy² d_phi[i*width+j] = atan2f(fabs(d_Gy[i*width+j]),fabs(d_Gx[i*width+j])); __syncthreads(); if(fabs(d_phi[i*width+j])<=PI/8 ) d_phi[i*width+j] = 0; else if (fabs(d_phi[i*width+j])<= 3*(PI/8)) d_phi[i*width+j] = 45; else if (fabs(d_phi[i*width+j]) <= 5*(PI/8)) d_phi[i*width+j] = 90; else if (fabs(d_phi[i*width+j]) <= 7*(PI/8)) d_phi[i*width+j] = 135; else d_phi[i*width+j] = 0; } }
Double threshold
__global__ void edges1d(int *d_pedge, float *d_G,float *d_phi, int height, int width) { int j; j = blockIdx.x * blockDim.x + threadIdx.x; // Edge if (j>=2 && j < width-2) for (int i=2;i<height-2;i++){ d_pedge[i*width+j] = 0; if(d_phi[i*width+j] == 0){ if(d_G[i*width+j]>d_G[i*width+j+1] && d_G[i*width+j]>d_G[i*width+j-1]) //edge is in N-S d_pedge[i*width+j] = 1; } else if(d_phi[i*width+j] == 45) { if(d_G[i*width+j]>d_G[(i+1)*width+j+1] && d_G[i*width+j]>d_G[(i-1)*width+j-1]) // edge is in NW-SE d_pedge[i*width+j] = 1; } else if(d_phi[i*width+j] == 90) { if(d_G[i*width+j]>d_G[(i+1)*width+j] && d_G[i*width+j]>d_G[(i-1)*width+j]) //edge is in E-W d_pedge[i*width+j] = 1; } else if(d_phi[i*width+j] == 135) { if(d_G[i*width+j]>d_G[(i+1)*width+j-1] && d_G[i*width+j]>d_G[(i-1)*width+j+1]) // edge is in NE-SW d_pedge[i*width+j] = 1; } } }
Edge tracking by hysteresis
__global__ void hysteresisThresholding1d(float *d_image_out, int level ,int* d_pedge,float *d_G, int height, int width) { int j; j = blockIdx.x * blockDim.x + threadIdx.x; float lowthres,hithres; // Hysteresis Thresholding lowthres = level/2; hithres = 2*(level); if (j<width) for (int i=0;i<height;i++) d_image_out[i*width+j]=0.0; //Inicializamos a 0 __syncthreads(); if (j>=3 && j < width-3) for (int i=3;i<height-3;i++){ if(d_G[i*width+j]>hithres && d_pedge[i*width+j]) d_image_out[i*width+j] = 255; else if(d_pedge[i*width+j] && d_G[i*width+j]>=lowthres && d_G[i*width+j]<hithres) // check neighbours 3x3 for (int ii=-1;ii<=1; ii++) for (int jj=-1;jj<=1; jj++) if (d_G[(i+ii)*width+j+jj]>hithres) d_image_out[i*width+j] = 255; } }
so basically all I did was remove the inner loop and let the parallel magic do the work.
you can clone my entire project.
git clone https://adri1988@bitbucket.org/adri1988/cannyedgedetectorcpuandgpu1d.git
executing this code on an image (1920×1200) .
speed Up of 5.01, pretty cool.
The canny edge detector GPU 2D approach.
the same as the 1D approach, four kernels but now I paralleled both loops.
Noise reduction
__global__ void noiseReduction2d(float *d_NR, float *d_im, int height, int width) { int j,i; i = blockIdx.x * blockDim.x + threadIdx.x; j = blockIdx.y * blockDim.y + threadIdx.y; if (j>=2 && j<width-2 && i>=2 && i < height-2){ d_NR[i*width+j] = (2.0*d_im[(i-2)*width+(j-2)] + 4.0*d_im[(i-2)*width+(j-1)] + 5.0*d_im[(i-2)*width+(j)] + 4.0*d_im[(i-2)*width+(j+1)] + 2.0*d_im[(i-2)*width+(j+2)] + 4.0*d_im[(i-1)*width+(j-2)] + 9.0*d_im[(i-1)*width+(j-1)] + 12.0*d_im[(i-1)*width+(j)] + 9.0*d_im[(i-1)*width+(j+1)] + 4.0*d_im[(i-1)*width+(j+2)] + 5.0*d_im[(i )*width+(j-2)] + 12.0*d_im[(i )*width+(j-1)] + 15.0*d_im[(i )*width+(j)] + 12.0*d_im[(i )*width+(j+1)] + 5.0*d_im[(i )*width+(j+2)] + 4.0*d_im[(i+1)*width+(j-2)] + 9.0*d_im[(i+1)*width+(j-1)] + 12.0*d_im[(i+1)*width+(j)] + 9.0*d_im[(i+1)*width+(j+1)] + 4.0*d_im[(i+1)*width+(j+2)] + 2.0*d_im[(i+2)*width+(j-2)] + 4.0*d_im[(i+2)*width+(j-1)] + 5.0*d_im[(i+2)*width+(j)] + 4.0*d_im[(i+2)*width+(j+1)] + 2.0*d_im[(i+2)*width+(j+2)]) /159.0; } }
Finding the intensity gradient of the image and Non-maximum suppression
{ int j,i; float PI = 3.141593; i = blockIdx.x * blockDim.x + threadIdx.x; j = blockIdx.y * blockDim.y + threadIdx.y; if (j>=2 && j<width-2 && i>=2 && i < height-2) { // Intensity gradient of the image d_Gx[i*width+j] = (1.0*d_NR[(i-2)*width+(j-2)] + 2.0*d_NR[(i-2)*width+(j-1)] + (-2.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)] + 4.0*d_NR[(i-1)*width+(j-2)] + 8.0*d_NR[(i-1)*width+(j-1)] + (-8.0)*d_NR[(i-1)*width+(j+1)] + (-4.0)*d_NR[(i-1)*width+(j+2)] + 6.0*d_NR[(i )*width+(j-2)] + 12.0*d_NR[(i )*width+(j-1)] + (-12.0)*d_NR[(i )*width+(j+1)] + (-6.0)*d_NR[(i )*width+(j+2)] + 4.0*d_NR[(i+1)*width+(j-2)] + 8.0*d_NR[(i+1)*width+(j-1)] + (-8.0)*d_NR[(i+1)*width+(j+1)] + (-4.0)*d_NR[(i+1)*width+(j+2)] + 1.0*d_NR[(i+2)*width+(j-2)] + 2.0*d_NR[(i+2)*width+(j-1)] + (-2.0)*d_NR[(i+2)*width+(j+1)] + (-1.0)*d_NR[(i+2)*width+(j+2)]); d_Gy[i*width+j] = ((-1.0)*d_NR[(i-2)*width+(j-2)] + (-4.0)*d_NR[(i-2)*width+(j-1)] + (-6.0)*d_NR[(i-2)*width+(j)] + (-4.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)] + (-2.0)*d_NR[(i-1)*width+(j-2)] + (-8.0)*d_NR[(i-1)*width+(j-1)] + (-12.0)*d_NR[(i-1)*width+(j)] + (-8.0)*d_NR[(i-1)*width+(j+1)] + (-2.0)*d_NR[(i-1)*width+(j+2)] + 2.0*d_NR[(i+1)*width+(j-2)] + 8.0*d_NR[(i+1)*width+(j-1)] + 12.0*d_NR[(i+1)*width+(j)] + 8.0*d_NR[(i+1)*width+(j+1)] + 2.0*d_NR[(i+1)*width+(j+2)] + 1.0*d_NR[(i+2)*width+(j-2)] + 4.0*d_NR[(i+2)*width+(j-1)] + 6.0*d_NR[(i+2)*width+(j)] + 4.0*d_NR[(i+2)*width+(j+1)] + 1.0*d_NR[(i+2)*width+(j+2)]); d_G[i*width+j] = sqrtf((d_Gx[i*width+j]*d_Gx[i*width+j])+(d_Gy[i*width+j]*d_Gy[i*width+j])); //G = √Gx²+Gy² d_phi[i*width+j] = atan2f(fabs(d_Gy[i*width+j]),fabs(d_Gx[i*width+j])); if(fabs(d_phi[i*width+j])<=PI/8 ) d_phi[i*width+j] = 0; else if (fabs(d_phi[i*width+j])<= 3*(PI/8)) d_phi[i*width+j] = 45; else if (fabs(d_phi[i*width+j]) <= 5*(PI/8)) d_phi[i*width+j] = 90; else if (fabs(d_phi[i*width+j]) <= 7*(PI/8)) d_phi[i*width+j] = 135; else d_phi[i*width+j] = 0; } }
Double threshold
__global__ void edges2d(int *d_pedge, float *d_G,float *d_phi, int height, int width) { int j,i; i = blockIdx.x * blockDim.x + threadIdx.x; j = blockIdx.y * blockDim.x + threadIdx.y; // Edge if (j>=2 && j<width-2 && i>=2 && i < height-2){ d_pedge[i*width+j] = 0; if(d_phi[i*width+j] == 0){ if(d_G[i*width+j]>d_G[i*width+j+1] && d_G[i*width+j]>d_G[i*width+j-1]) //edge is in N-S d_pedge[i*width+j] = 1; } else if(d_phi[i*width+j] == 45) { if(d_G[i*width+j]>d_G[(i+1)*width+j+1] && d_G[i*width+j]>d_G[(i-1)*width+j-1]) // edge is in NW-SE d_pedge[i*width+j] = 1; } else if(d_phi[i*width+j] == 90) { if(d_G[i*width+j]>d_G[(i+1)*width+j] && d_G[i*width+j]>d_G[(i-1)*width+j]) //edge is in E-W d_pedge[i*width+j] = 1; } else if(d_phi[i*width+j] == 135) { if(d_G[i*width+j]>d_G[(i+1)*width+j-1] && d_G[i*width+j]>d_G[(i-1)*width+j+1]) // edge is in NE-SW d_pedge[i*width+j] = 1; } } }
Edge tracking by hysteresis
__global__ void hysteresisThresholding2d(float *d_image_out, int level ,int* d_pedge,float *d_G, int height, int width) { int j,i; i = blockIdx.x * blockDim.x + threadIdx.x; j = blockIdx.y * blockDim.x + threadIdx.y; float lowthres,hithres; // Hysteresis Thresholding lowthres = level/2; hithres = 2*(level); if (j<width && i<height) d_image_out[i*width+j]=0.0; //Inicializamos a 0 __syncthreads(); if (j>=3 && j<width-3 && i>=3 && i<height-3) { if(d_G[i*width+j]>hithres && d_pedge[i*width+j]) d_image_out[i*width+j] = 255; else if(d_pedge[i*width+j] && d_G[i*width+j]>=lowthres && d_G[i*width+j]<hithres) // check neighbours 3x3 for (int ii=-1;ii<=1; ii++) for (int jj=-1;jj<=1; jj++) if (d_G[(i+ii)*width+j+jj]>hithres) d_image_out[i*width+j] = 255; } }
Now with both loops executing on parallel things gets interesting.
you can clone the 2D approach.
git clone https://adri1988@bitbucket.org/adri1988/cannyedgedetectotgpu2d.git
executing this code on an image (1920×1200) .
Speed Up is even bigger now, 9.4X times faster than the CPU and 2X times faster than 1D version.