Due: December 21
Submission Link:
https://canvas.umn.edu/courses/518528/assignments/4943255
This assignment is a group submission with the same groups as used for your final project. Only one member of your group needs to submit this assignment. This score will apply to all students in your group.
In this assignment, you will implement an optimized CUDA kernel for performing pair-wise batched matrix multiplication. Given:
each of size 4ร4, your goal is to compute:
$ C = {i=0}^{N_A - 1} {j=0}^{N_B - 1} A_i B_j $
This requires multiplying every pair \(A_i, B_j\) and summing their results.
There are no scaling coefficients, just plain matrix multiplication with accumulation.
Your job is to dramatically reduce global memory traffic, increase arithmetic intensity, and build a high-performance CUDA kernel.
Your starter code is split across several .cu and
.h files.
optimized_kernel.cu
This is also the only file you will submit.
It contains:
sgemm_optimized_batched_kernelbatched_pairwise_optimized(..)main.cucommon.hutilities.[ch]naive_kernel.[ch]Changing these will likely break the autograder and result in a score of 0.
The provided version of optimized_kernel.cu initially
duplicates the naive functionality so the project runs immediately. You
must replace it with an optimized implementation. You can update this
file any way you like, so long as the autograder functions properly and
your function signature for batched_pairwise_optimized
remains the same as in the present file.
In order to run the autograder locally, be sure that you are on your assigned csel-cuda lab machine, which will have a T4 GPU. You can run the autograder with
chmod +u+x run.sh
./run.sh
You are given:
Your task is to compute:
\[ C = \sum_{i=0}^{511} \sum_{j=0}^{511} A_i B_j \]
Your kernel must compute this efficiently.
This assignment is not about writing a general GEMM implementation.
All matrices are 4ร4, and the batch sizes are fixed:
M = 4 N = 4 K = 4 NUM_A = 512 NUM_B = 512
Anything that increases speed is allowed as long as the results are correct and you do not make use of any more complex cuda library functions (cuBLAS, tensor cores) which abstract shared memory, coarsening, vectorized loading away from you.
To maximize performance, focus on:
Load data once, reuse it many times.
Make use of reductions both within and between threadblocks for improved speedups.
Use float4 to perform aligned, coalesced loads.
Be sure to make extensive use of shared memory constructs.
The autograder evaluates:
The autograder runs the benchmark 5 times and uses the maximum speedup:
\[ \text{speedup} = \frac{T_\text{naive}}{T_\text{optimized}} \]
Let:
S = your measured speedupSโ = instructor-defined benchmark (we are using a
speedup of 100x. If you achieve at least 100x speedups over the naive
kernel, you will get full credit.)Your performance score is:
\[ \text{score}_\text{perf} = \begin{cases} 0, & S \le 1 \\ 10 \cdot \frac{S}{S_0}, & 1 < S < S_0 + 1 \\ 10, & S \ge S_0 + 1 \end{cases} \]
Total assignment score:
\[ \text{score} = \text{correctness (0โ5)} + \text{performance (0โ10)} \]
The autograder prints all timing information and computed scores.
You must submit exactly one file:
optimized_kernel.cu
Do not compress it.
Do not rename it.
Do not modify other files.
The autograder compiles using:
nvcc -O3 -arch=sm_75 \
main.cu utilities.cu naive_kernel.cu optimized_kernel.cu \
-o batched_gemm && ./batched_gemm && rm -f batched_gemmYour file must compile cleanly under this exact command.
The assignment is 100% autograded.
Only modify optimized_kernel.cu.
Hardcoding optimizations is encouraged.
Shared memory, tiling, vectorized loads, and reduction patterns will be essential for speed.
Readability is appreciated, but performance matters most.