GPU-Accelerated CNN Inference

Portrait image used as a project placeholder.

This project implements CUDA kernels for the convolution forward pass of a modified LeNet-5 neural network. The goal is to improve large-batch inference performance on the Fashion-MNIST dataset while understanding where GPU time is actually spent.

I applied GPU optimization techniques including im2col input unrolling and kernel fusion, then profiled the kernels with Nsight Systems and Nsight Compute to locate memory bandwidth and compute bottlenecks.

Technologies

CUDA
C++
Nsight Systems
Nsight Compute