<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Varun Varia — Essays</title>
    <link>https://thevarunvaria.github.io/</link>
    <atom:link href="https://thevarunvaria.github.io/feed.xml" rel="self" type="application/rss+xml" />
    <description>Long-form essays and notes on building, serving, and governing modern AI systems.</description>
    <language>en-us</language>
    <managingEditor>thevarunvaria@gmail.com (Varun Varia)</managingEditor>
    <webMaster>thevarunvaria@gmail.com (Varun Varia)</webMaster>
    <lastBuildDate>Wed, 27 May 2026 00:00:00 GMT</lastBuildDate>
    <generator>hand-rolled HTML</generator>

    <item>
      <title>Why You Can't Serve LLMs Like Regular Models (And How to Fix It)</title>
      <link>https://thevarunvaria.github.io/llm-inference.html</link>
      <guid isPermaLink="true">https://thevarunvaria.github.io/llm-inference.html</guid>
      <pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate>
      <dc:creator>Varun Varia</dc:creator>
      <category>AI Infrastructure</category>
      <description><![CDATA[A practical guide to the five fundamental differences between traditional ML inference and modern LLM serving — continuous batching, prefill/decode disaggregation, PagedAttention, prefix-aware routing, and MoE sharding.]]></description>
      <content:encoded><![CDATA[
        <p>If <strong>training</strong> an AI model is like sending a chef to culinary school for years, <strong>inference</strong> is putting that chef in a live Michelin-star kitchen on a Friday night at 8pm, taking real orders from paying customers who have expectations, dietary restrictions, and varying levels of patience.</p>
        <p>In the world of Generative AI, the single most important number is <strong>Tokens Per Second (TPS)</strong>. Traditional ML inference has predictable characteristics — fixed-size inputs, fixed computation per example, fixed-size outputs. Large Language Models break all three assumptions simultaneously, which creates five deep technical problems that simply do not exist in classical model serving.</p>
        <p>Continue reading at <a href="https://thevarunvaria.github.io/llm-inference.html">thevarunvaria.github.io</a>.</p>
      ]]></content:encoded>
    </item>
  </channel>
</rss>
