Posts

← Back to Posts

Inferencing Transformers in Real-Time

May 13, 202520 min read

A journal on writing CUDA kernels from scratch to run GPT-2 at almost 70 tokens per second on an A40 GPU, exploring optimization techniques from tensor cores to flash attention.

CUDAtransformersGPUoptimizationinferencetensor-coresflash-attention

Read the full article: Inferencing Transformers in Real-Time

Co-authored with Tianhao Chen and Vasunandan Dar.