So most users struggle a lot with this topic, on CUDA is important to make a good use of how many grids and threads we need to use so let’s start:
What are grids blocks and thread blocks?
In CUDA we have to define how our grid is going to be and how many threads are going to be inside a grid.
The kernels are executed in grids just like the following image:

Identify the dimension of the problem
There are multiple ways to do this, but let’s try to make it simple.
On CUDA or GPU programming in general we have from 1 to 3 dimensions, values different than those will give runtime errors.
So for example, if we have an array, our problem is a one-dimensional problem, a matrix is a two-dimensional problem and a cube/parallelepiped is a three-dimensional problem.
One dimensional problems
These are the easiest ones, we only need to work on the X axis, and basically our problem is reduced to the length of the array.
Let’s say we have an Array with length 1Million:
we have to decide how many grids blocks we want and how many threads per block we want.
First we have to remember, on most devices the maximum threads per block available is 512 or 1024, something bigger than this will give an error.
After deciding the threads per blocks (I’m going to use 1024) the grid is ceil(1M/1024).
Our declaration and call to the kernel will look something like this
dim3 gridBlocks(ceil(1000000/1024),1,1); dim3 threadsPerBlock(1024,1,1); kernel<<<gridBlocks, threadsPerBlock>>> (arg1, arg2...);
this seems pretty straight forward, the Y and Z axis is 1 since we are only working on the X axis.
Two dimensional problems
These might be the most common, we need to work on the X and Y axis.
Let us divide this into two examples,
A square and a rectangle, the first being an exclusive case of the second one, when both sides of the rectangle are the same length, is a square (Duh!?).
Square:
Let us assume we have a square image with A x A pixels, se we need to create a grid that minimizes the number of threads in order to use them the most efficient way
First we need to set the size of our grid, this is a problem specific matter but we can set some various examples.
For very small values of A (A >8 and A <64):
We can divide the square into 4 sub-squares, for example if A = 16 we have the following layout

So our grid will be 2 x 2 and each grid block will have 8 x 8 threads
We can code that as follow:
dim3 gridBlocks(2,2,1); dim3 threadsPerBlock(8,8,1); kernel<<<gridBlocks, threadsPerBlock>>>(arg1, arg2...);
For medium sizes (A>64 and A < 512):
We can divide the square into 16-sub squares

This is 32 x 32 I’m using a small grid just for illustration purposes
We can code this as follow
dim3 gridBlocks(4,4,1); // 4 x 4 grid dim3 threadsPerBlock(8,8,1);// each grid will have 8 x 8 threads kernel<<<gridBlocks, threadsPerBlock>>>(arg1, arg2...);
This examples are just illustrations, is not by any means a rule, remember the number of grids and threads is a specific matter of the problem itself.
Rectangle
So let us assume we have an image that is shaped like a rectangle with A x B pixels and A != B
Assuming A and B big enough for illustration purposes, I will set two cases, A > B and B > A
A < B:
We get the following layout

This can be coded as follow
dim3 gridBlocks(4,2,1); // 4 x 2 grid dim3 threadsPerBlock(8,8,1);// each grid will have 8 x 8 threads kernel<<<gridBlocks, threadsPerBlock>>>(arg1, arg2...);
A > B:
We get the following layout

This can be coded as follow
dim3 gridBlocks(2,4,1); // 4 x 2 grid dim3 threadsPerBlock(8,8,1);// each grid will have 8 x 8 threads kernel<<<gridBlocks, threadsPerBlock>>>(arg1, arg2...);