# Multiplying two matrices in CUDA using BLAS…CUBLAS

On this post I’ll be making a simple program, multiplying two matrices using BLAS.

We already know the basics of CUDA, if not you can check me previous post here. To the matrices operation we are going to use the GEMM operation.

C = αop ( A ) op ( B ) + βC, Were α and β are scalar numbers α is going to be 1 and β is going to be 0.
In the CUDA toolkit documentation, the GEMM operation is declared as the following:

```cublasStatus_t cublasSgemm(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m,
int n,
int k,
const float *alpha,
const float *A,
int lda,
const float *B,
int ldb,
const float *beta,
float *C,
int ldc)
```

where

• M is number of rows of matrix op(A) and C.
• N is number of columns of matrix op(B) and C.
• K is number of columns of op(A) and rows of op(B).
• transa operation op(A) that is non- or (conj.) transpose.
• transb operation op(B) that is non- or (conj.) transpose.
• A array of dimensions lda x k with lda>=max(1,m) if transa == CUBLAS_OP_N and lda x m with lda>=max(1,k) otherwise.
• B array of dimension ldb x n with ldb>=max(1,k) if transa == CUBLAS_OP_N and ldb x k with ldb>=max(1,n) otherwise.
• Lda leading dimension of two-dimensional array used to store the matrix A.
• ldb leading dimension of two-dimensional array used to store matrix B.

Now let’s us go to the code:
Before starting let me declare some auxiliary functions just to make my code clearer.

```
void init_matrix(float *M, int hM, int wM, float k)
{
int i,j;

for (i=0; i<hM; i++)
for (j=0; j<wM; j++)
if (i==j)
M[i*wM+j] = k*k*1.0f;
else
M[i*wM+j] = -k*1.0f;
}

```

This first function I just initialize a matrix with hM and wM dimensions the k value is just a random for the matrix to be populated with

```void print_matrix(float *M, int hM, int wM)
{
int i,j;

for (i=0; i<hM; i++){
for (j=0; j<wM; j++)
printf(&quot;%4.1f &quot;, M[i*wM+j]);
printf(&quot;\n&quot;);
}
}

```

this functions prints the matrix on screen

For the multiplication, let us declare a function, which accept 6 parameters, 2 inputs matrices, 1 output matrix, height and width of the first matrix and the width of the second matrix.

```void Mul(float* A, float* B, int hA, int wA, int wB, float* C){

//Now we have to allocate memory on GPU for the 3 matrices

size = hA * wA * sizeof(float);

float* Bd;
size = wA * wB * sizeof(float);
cudaMalloc((void**)&amp;Bd, size);
cudaMemcpy(Bd,B,size,cudaMemcpyHostToDevice);

// Allocate C on the device
float* Cd;
size = hA * wB * sizeof(float);
cudaMalloc((void**)&amp;Cd, size);

//declare the alfa and beta
const float alfa = 1;
const float beta = 0;
const float *alphac = &amp;alfa;
const float *betac = &amp;beta;

int lda=wA,ldb=hA,ldc=wA;

```

in order to use the CUBLAS methods first we need to create a handle and when we no loger need to use the CUBLAS method we should destroy it.

For the handle creation

```
cublasHandle_t handle;
cublasCreate(&amp;handle);

```

cublas gemm method

```
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, wA, wB, hA, alphac, Ad,lda,Bd,ldb, betac, Cd, ldc);

```

Need to remember that CUBLAS method WORK ON GPU, that means, all input matrices must be on the device memory.

handle destruction

```cublasDestroy(handle);
```

bring output to host memory

```cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);
```

final touches

```cudaFree(Ad);
cudaFree(Bd);
cudaFree(Cd);
```

So the core of the program is finished, now we need a main method, initialise some matrices and test.

```
int main(int argc, char *argv[])
{
// Matrix variables
float *A, *B, *C;//Matrices declarations
int hA, wA, hB, wB;//declaration of dimensions

setbuf(stdout, NULL);

if (argc!=4){
printf(&quot;./exec hA hB/WA wB\n&quot;);
exit(-1);
}
//transform the input to integers
hA = atoi(argv);
hB = wA = atoi(argv);
wB = atoi(argv);

// Init A and B, malloc C
int size_A = wA * hA;
A = (float*)malloc(size_A*sizeof(float));
init_matrix(A, hA, wA,2.0);

int size_B = wB * hB;
B = (float*)malloc(size_B*sizeof(float));
init_matrix(B, hB, wB,1.0);

int size_C = wB * hA;
C = (float*)malloc(size_C*sizeof(float));

//matrix multiplication
Mul(A, B, hA, wA, wB, C);

//Print on console the 3 matrices
printf(&quot;\n\nMATRIX A\n&quot;);print_matrix(A, hA, wA);
printf(&quot;\n\nMATRIX B\n&quot;);print_matrix(B, hB, wB);
printf(&quot;\n\nMATRIX C\n&quot;);print_matrix(C, hA, wB);

//free the allocated memory
free(A)
free(B)
free(C)
return (1);
}

```

to compile remember to add the flags -lcudart -lcublas.

An output example of 4×4 matrices multiplication is the full code can be found on my git repository, feel free to clone