Why You Can't Serve LLMs Like Regular Models (And How to Fix It)

Varun Varia — Fri, 01 May 2026 00:00:00 GMT

If training an AI model is like sending a chef to culinary school for years, inference is putting that chef in a live Michelin-star kitchen on a Friday night at 8pm, taking real orders from paying customers who have expectations, dietary restrictions, and varying levels of patience.

In the world of Generative AI, the single most important number is Tokens Per Second (TPS). Traditional ML inference has predictable characteristics — fixed-size inputs, fixed computation per example, fixed-size outputs. Large Language Models break all three assumptions simultaneously, which creates five deep technical problems that simply do not exist in classical model serving.

Continue reading at thevarunvaria.github.io.

Varun Varia — Essays

Why You Can't Serve LLMs Like Regular Models (And How to Fix It)