Inferencing Transformers in Real-Time
May 13, 2025 • 20 min read
A journal on writing CUDA kernels from scratch to run GPT-2 at almost 70 tokens per second on an A40 GPU, exploring optimization techniques from tensor cores to flash attention.
CUDAtransformersGPUoptimizationinferencetensor-coresflash-attention
Read the full article: Inferencing Transformers in Real-Time
Co-authored with Tianhao Chen and Vasunandan Dar.