Work

GPU-Accelerated CNN Inference

CUDA
C++
GPU

CUDA kernels and profiling work for accelerating the convolution forward pass of a modified LeNet-5 network.

Portrait image used as a project placeholder.

This project implements CUDA kernels for the convolution forward pass of a modified LeNet-5 neural network. The goal is to improve large-batch inference performance on the Fashion-MNIST dataset while understanding where GPU time is actually spent.

I applied GPU optimization techniques including im2col input unrolling and kernel fusion, then profiled the kernels with Nsight Systems and Nsight Compute to locate memory bandwidth and compute bottlenecks.

Technologies

  • CUDA
  • C++
  • Nsight Systems
  • Nsight Compute