Posts

Inferencing Transformers in Real-Time

May 13, 202520 min read

A journal on writing CUDA kernels from scratch to run GPT-2 at almost 70 tokens per second on an A40 GPU, exploring optimization techniques from tensor cores to flash attention.