Multiplying two matrices in CUDA using BLAS…CUBLAS

On this post I’ll be making a simple program, multiplying two matrices using BLAS.

We already know the basics of CUDA, if not you can check me previous post here. To the matrices operation we are going to use the GEMM operation.

C = αop ( A ) op ( B ) + βC, Were α and β are scalar numbers α is going to be 1 and β is going to be 0.
In the CUDA toolkit documentation, the GEMM operation is declared as the following:

cublasStatus_t cublasSgemm(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m,
int n,
int k,
const float *alpha,
const float *A,
int lda,
const float *B,
int ldb,
const float *beta,
float *C,
int ldc)


  • M is number of rows of matrix op(A) and C.
  • N is number of columns of matrix op(B) and C.
  • K is number of columns of op(A) and rows of op(B).
  • transa operation op(A) that is non- or (conj.) transpose.
  • transb operation op(B) that is non- or (conj.) transpose.
  • A array of dimensions lda x k with lda>=max(1,m) if transa == CUBLAS_OP_N and lda x m with lda>=max(1,k) otherwise.
  • B array of dimension ldb x n with ldb>=max(1,k) if transa == CUBLAS_OP_N and ldb x k with ldb>=max(1,n) otherwise.
  • Lda leading dimension of two-dimensional array used to store the matrix A.
  • ldb leading dimension of two-dimensional array used to store matrix B.

For more information about this, like return values and tons of other operations you can check the documentation here.

Now let’s us go to the code:
Before starting let me declare some auxiliary functions just to make my code clearer.


void init_matrix(float *M, int hM, int wM, float k)
int i,j;

for (i=0; i<hM; i++)
    for (j=0; j<wM; j++)
        if (i==j)
            M[i*wM+j] = k*k*1.0f;
            M[i*wM+j] = -k*1.0f;

This first function I just initialize a matrix with hM and wM dimensions the k value is just a random for the matrix to be populated with

void print_matrix(float *M, int hM, int wM)
int i,j;

for (i=0; i<hM; i++){
     for (j=0; j<wM; j++)
          printf(&quot;%4.1f &quot;, M[i*wM+j]);

this functions prints the matrix on screen

For the multiplication, let us declare a function, which accept 6 parameters, 2 inputs matrices, 1 output matrix, height and width of the first matrix and the width of the second matrix.

void Mul(float* A, float* B, int hA, int wA, int wB, float* C){

//Now we have to allocate memory on GPU for the 3 matrices

float* Ad;
size = hA * wA * sizeof(float);
cudaMalloc((void**)&amp;Ad, size);

float* Bd;
size = wA * wB * sizeof(float);
cudaMalloc((void**)&amp;Bd, size);

// Allocate C on the device
float* Cd;
size = hA * wB * sizeof(float);
cudaMalloc((void**)&amp;Cd, size);

//declare the alfa and beta
const float alfa = 1;
const float beta = 0;
const float *alphac = &amp;alfa;
const float *betac = &amp;beta;

int lda=wA,ldb=hA,ldc=wA;

in order to use the CUBLAS methods first we need to create a handle and when we no loger need to use the CUBLAS method we should destroy it.

For the handle creation

cublasHandle_t handle;

cublas gemm method

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, wA, wB, hA, alphac, Ad,lda,Bd,ldb, betac, Cd, ldc);

Need to remember that CUBLAS method WORK ON GPU, that means, all input matrices must be on the device memory.

handle destruction


bring output to host memory

cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);

final touches


So the core of the program is finished, now we need a main method, initialise some matrices and test.

int main(int argc, char *argv[])
	// Matrix variables
	float *A, *B, *C;//Matrices declarations
	int hA, wA, hB, wB;//declaration of dimensions

	setbuf(stdout, NULL);

	if (argc!=4){
		printf(&quot;./exec hA hB/WA wB\n&quot;);
        //transform the input to integers
	hA = atoi(argv[1]);
	hB = wA = atoi(argv[2]);
	wB = atoi(argv[3]);

	// Init A and B, malloc C
	int size_A = wA * hA;
	A = (float*)malloc(size_A*sizeof(float));
	init_matrix(A, hA, wA,2.0);

	int size_B = wB * hB;
	B = (float*)malloc(size_B*sizeof(float));
	init_matrix(B, hB, wB,1.0);

	int size_C = wB * hA;
	C = (float*)malloc(size_C*sizeof(float));

        //matrix multiplication
	Mul(A, B, hA, wA, wB, C);

        //Print on console the 3 matrices
	printf(&quot;\n\nMATRIX A\n&quot;);print_matrix(A, hA, wA);
	printf(&quot;\n\nMATRIX B\n&quot;);print_matrix(B, hB, wB);
	printf(&quot;\n\nMATRIX C\n&quot;);print_matrix(C, hA, wB);

        //free the allocated memory
	return (1);

to compile remember to add the flags -lcudart -lcublas.

An output example of 4×4 matrices multiplication is


the full code can be found on my git repository, feel free to clone

git clone


One thought on “Multiplying two matrices in CUDA using BLAS…CUBLAS

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s