On this post I’ll be making a simple program, multiplying two matrices using BLAS.
We already know the basics of CUDA, if not you can check me previous post here. To the matrices operation we are going to use the GEMM operation.
C = αop ( A ) op ( B ) + βC, Were α and β are scalar numbers α is going to be 1 and β is going to be 0.
In the CUDA toolkit documentation, the GEMM operation is declared as the following:
cublasStatus_t cublasSgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const float *alpha, const float *A, int lda, const float *B, int ldb, const float *beta, float *C, int ldc)
where
- M is number of rows of matrix op(A) and C.
- N is number of columns of matrix op(B) and C.
- K is number of columns of op(A) and rows of op(B).
- transa operation op(A) that is non- or (conj.) transpose.
- transb operation op(B) that is non- or (conj.) transpose.
- A array of dimensions lda x k with lda>=max(1,m) if transa == CUBLAS_OP_N and lda x m with lda>=max(1,k) otherwise.
- B array of dimension ldb x n with ldb>=max(1,k) if transa == CUBLAS_OP_N and ldb x k with ldb>=max(1,n) otherwise.
- Lda leading dimension of two-dimensional array used to store the matrix A.
- ldb leading dimension of two-dimensional array used to store matrix B.
For more information about this, like return values and tons of other operations you can check the documentation here.
Now let’s us go to the code:
Before starting let me declare some auxiliary functions just to make my code clearer.
void init_matrix(float *M, int hM, int wM, float k) { int i,j; for (i=0; i<hM; i++) for (j=0; j<wM; j++) if (i==j) M[i*wM+j] = k*k*1.0f; else M[i*wM+j] = -k*1.0f; }
This first function I just initialize a matrix with hM and wM dimensions the k value is just a random for the matrix to be populated with
void print_matrix(float *M, int hM, int wM) { int i,j; for (i=0; i<hM; i++){ for (j=0; j<wM; j++) printf("%4.1f ", M[i*wM+j]); printf("\n"); } }
this functions prints the matrix on screen
For the multiplication, let us declare a function, which accept 6 parameters, 2 inputs matrices, 1 output matrix, height and width of the first matrix and the width of the second matrix.
void Mul(float* A, float* B, int hA, int wA, int wB, float* C){ //Now we have to allocate memory on GPU for the 3 matrices float* Ad; size = hA * wA * sizeof(float); cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad,A,size,cudaMemcpyHostToDevice); float* Bd; size = wA * wB * sizeof(float); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd,B,size,cudaMemcpyHostToDevice); // Allocate C on the device float* Cd; size = hA * wB * sizeof(float); cudaMalloc((void**)&Cd, size); //declare the alfa and beta const float alfa = 1; const float beta = 0; const float *alphac = &alfa; const float *betac = &beta; int lda=wA,ldb=hA,ldc=wA;
in order to use the CUBLAS methods first we need to create a handle and when we no loger need to use the CUBLAS method we should destroy it.
For the handle creation
cublasHandle_t handle; cublasCreate(&handle);
cublas gemm method
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, wA, wB, hA, alphac, Ad,lda,Bd,ldb, betac, Cd, ldc);
Need to remember that CUBLAS method WORK ON GPU, that means, all input matrices must be on the device memory.
handle destruction
cublasDestroy(handle);
bring output to host memory
cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);
final touches
cudaFree(Ad); cudaFree(Bd); cudaFree(Cd);
So the core of the program is finished, now we need a main method, initialise some matrices and test.
int main(int argc, char *argv[]) { // Matrix variables float *A, *B, *C;//Matrices declarations int hA, wA, hB, wB;//declaration of dimensions setbuf(stdout, NULL); if (argc!=4){ printf("./exec hA hB/WA wB\n"); exit(-1); } //transform the input to integers hA = atoi(argv[1]); hB = wA = atoi(argv[2]); wB = atoi(argv[3]); // Init A and B, malloc C int size_A = wA * hA; A = (float*)malloc(size_A*sizeof(float)); init_matrix(A, hA, wA,2.0); int size_B = wB * hB; B = (float*)malloc(size_B*sizeof(float)); init_matrix(B, hB, wB,1.0); int size_C = wB * hA; C = (float*)malloc(size_C*sizeof(float)); //matrix multiplication Mul(A, B, hA, wA, wB, C); //Print on console the 3 matrices printf("\n\nMATRIX A\n");print_matrix(A, hA, wA); printf("\n\nMATRIX B\n");print_matrix(B, hB, wB); printf("\n\nMATRIX C\n");print_matrix(C, hA, wB); //free the allocated memory free(A) free(B) free(C) return (1); }
to compile remember to add the flags -lcudart -lcublas.
An output example of 4×4 matrices multiplication is
the full code can be found on my git repository, feel free to clone
git clone https://adri1988@bitbucket.org/adri1988/cublas_multiplication.git
One thought on “Multiplying two matrices in CUDA using BLAS…CUBLAS”