This project implements CUDA kernels for the convolution forward pass of a modified LeNet-5 neural network. The goal is to improve large-batch inference performance on the Fashion-MNIST dataset while understanding where GPU time is actually spent.
I applied GPU optimization techniques including im2col input unrolling and kernel fusion, then profiled the kernels with Nsight Systems and Nsight Compute to locate memory bandwidth and compute bottlenecks.
Technologies
- CUDA
- C++
- Nsight Systems
- Nsight Compute