Posts

Design Uber / a Ride-Hailing System - System Design

“Design Uber” sounds like a CRUD app with a map on top. A rider taps a pin, a driver shows up, money moves. The interviewer lets you believe that for about thirty seconds, then asks the question that breaks the toy version: there are 5 million drivers on the road right now, each one broadcasting its GPS position every few seconds, and a rider standing on a corner wants the nearest available car in under a second. How do you find “the closest driver” out of millions of constantly-moving points, hundreds of thousands of times per second, without scanning the whole planet on every request? ...

Prepay the Home Loan or Invest the Surplus? The Breakeven Math

The standard advice is a one-liner: your home loan is at 8.5%, equity returns 11-12% over the long run, so never prepay - invest the surplus and let the gap compound in your favour. It sounds airtight and it is repeated everywhere. It is also often wrong, because it quietly assumes the entire interest you pay is tax-deductible. On a large home loan in the early years, most of it is not. The Section 24 cap of Rs. 2 lakh is where the clean story falls apart. ...

Spec-Driven Development - Writing the Spec Is Writing the Code Now

The most productive engineers I know stopped bragging about how fast they type. When a coding agent can produce 400 lines of correct code from a paragraph, typing speed is not the bottleneck anymore. The bottleneck is the paragraph. If the paragraph is vague, you get 400 lines of confidently wrong code, fast. If the paragraph is precise, you get something you can ship. The spec is now the leverage point, and most people are still treating it like a throwaway comment. ...

Choosing an Embedding Model in 2026 - It's Not the Leaderboard

Most teams pick an embedding model the same way they pick a sorting algorithm: look up the benchmark, take the top result, ship it. For sorting algorithms that works fine. For embeddings it reliably produces a retrieval system that looks great on paper and underperforms on the actual product. The MTEB leaderboard is not useless. But it is measuring retrieval on academic corpora with clean, well-formed queries against documents that look nothing like your internal docs, your customer support tickets, or your codebase. Ranking third on MTEB while being the worst model for your domain is entirely possible, and it happens constantly. ...

Design a URL Shortener (TinyURL) - System Design

Everyone thinks the URL shortener is a trivial problem. “It’s a hash map. Store long URL, return short URL, done.” Then the interviewer asks: how do you generate the key, how do you avoid collisions, what happens when one popular link gets 50,000 redirects a second, and how do you serve that redirect in under 10ms across the globe. Now it’s a real system. The whole problem is deceptively read-heavy and deceptively about one decision: how you mint short keys. Get the key generation wrong and everything downstream (collisions, hot shards, wasted storage) gets worse. Get it right and the rest is caching and sharding you already know. ...

Design Twitter's News Feed - System Design

The Twitter timeline looks trivial until you say the numbers out loud. “Show me a list of tweets from people I follow, newest first.” It is a join. SELECT * FROM tweets WHERE author IN (my followees) ORDER BY time DESC LIMIT 50. Done. Then the interviewer points out that some users follow 5,000 accounts, some accounts have 100 million followers, the timeline must load in under 200ms, and 300 million people refresh it all day. The join is now the most expensive query on the internet. ...

Design WhatsApp / a Chat Messaging System - System Design

A chat app sounds like the easiest system you will ever build. “User A sends a message, user B receives it.” One INSERT, one SELECT. The interviewer lets you say that, then asks the questions that turn it into one of the hardest real-time systems in the building: how does B receive it instantly when B might be offline, on a train, or logged in on three devices at once? How do you show the second grey tick the moment it lands on B’s phone, and the two blue ticks the moment B actually opens the chat? How do hundreds of millions of phones hold an open connection to your servers at the same time without melting? ...

The Real Rupee Cost of a 1% Expense Ratio Over 25 Years

A 1% expense ratio sounds harmless. It is less than what Zomato charges in delivery fees, less than a bank locker’s annual rent. Surely it is not worth losing sleep over. Run that 1% through 25 years of compounding and it does not cost you 1% of your corpus. It costs you 16% of your final wealth - roughly Rs 27 lakh per Rs 10 lakh invested, or Rs 24 lakh on a Rs 10,000/month SIP. The money does not disappear in a single charge you can see on your statement. It is silently extracted each year from a growing base, accelerating in rupee terms every single year. ...

Multi-Agent Systems - When Splitting the Work Actually Helps

The instinct when a task is complex is to throw more agents at it. Spin up a researcher, a writer, a critic, a planner - a whole crew. It feels like good engineering. It is usually not. Most multi-agent systems in production are slower, more expensive, and less reliable than a single well-designed agent. The coordination overhead is real: more LLM calls, more context to manage, more failure points, and latency that multiplies rather than shrinks. The only reason to use multiple agents is if the task structure genuinely requires it. Most tasks do not. ...

Observability for LLM Apps - You Can't Fix What You Can't Trace

When your web service throws a 500, you have a stack trace. When your LLM app returns a bad answer, the status code is 200, the latency looks normal, and you have no idea what happened. That is the problem. Standard observability - error rate, latency percentiles, throughput - tells you nothing about the most common failure mode in LLM applications: the model returned something plausible but wrong. You need a different class of instrumentation, one that captures the full context of every inference call - what you sent, what the model returned, how many tokens it used, and which tools it called along the way. ...

What a ₹10,000 SIP Became in Every 10-Year Window Since 2005

“Nifty 50 SIP has historically returned 12-15% over 10 years.” You will see this claim in every mutual fund advertisement. It is not wrong, but it hides the most important part of the story. That 12-15% is an average across all possible 10-year windows. The actual outcome for any real investor depended entirely on which specific decade they happened to invest in. Two investors who each put in exactly ₹10,000 per month for 10 years in a Nifty 50 index fund - starting just five years apart - could have ended up with a difference of over ₹14 lakh on identical total contributions of ₹12 lakh. ...

Semantic Caching - The Cheapest 40% Off Your LLM Bill

The cheapest optimization most teams skip is not routing to smaller models or trimming the context window. It is not calling the model at all. When ten users ask “how do I reset my password?” your app pays to generate that answer ten times. Every one of those generations after the first is pure waste. Caching LLM responses is not a new idea, but most implementations are either too naive (exact-match string hashing that misses 90% of cacheable requests) or too aggressive (semantic matching that returns wrong answers). The gap between those two extremes is where production systems live, and it is worth understanding the mechanics before you build. ...

Structured Outputs - Stop Parsing LLM JSON by Hand

If you have a regex somewhere that strips markdown fences to pull JSON out of an LLM response, you have a time bomb. It works 95% of the time in development. It breaks on the 5% of production traffic that has slightly different phrasing, a model version bump, or a user input the model has never seen. You fix it, it breaks again three weeks later in a different way. ...

Context Engineering - The Discipline That Replaced Prompt Engineering

The question engineers ask most often when an LLM pipeline underperforms is: “how should I reword this prompt?” That is almost never the right question. The right question is: “what am I putting in the context window, and is it the right information in the right order?” This shift - from prompt wording to context construction - is what context engineering means. It is not a rebranding. The two disciplines require different skills, different tooling, and produce different categories of wins. A 10% improvement from rewording a prompt is about as good as it gets. A 40-60% quality improvement from restructuring how you build context is routine. ...

Small Language Models Are Eating the Easy 80%

Most production AI costs are paid to frontier models for tasks that a 3-billion-parameter model running locally could handle just as well. Not the hard reasoning, not the creative synthesis - the classification, the extraction, the summarization of short content, the fill-in-the-template work that makes up the majority of real inference load. Small language models (SLMs) are not a compromise you accept when you cannot afford the real thing. In 2026, they are the deliberate choice for the 70-80% of tasks where frontier models are overkill, and the routing layer that separates them is the actual engineering problem worth solving. ...

Diffusion LLMs - The Text Models That Don't Predict Left to Right

Every LLM you have used in production generates text the same way: one token at a time, left to right, each token depending on everything that came before it. That is autoregressive decoding, and it has a hard constraint baked in. The sequential nature is not an implementation detail you can optimize away - it is the mathematical structure of the model. Diffusion LLMs take a different path. They generate an entire sequence in parallel across multiple denoising steps, rather than one token per step. The practical result is lower latency on long outputs. The catch is that the tradeoffs are subtle enough that most coverage of this topic either oversells the speed claims or undersells the real limitations. ...

What 'Thinking' Actually Costs - Reasoning Models and Test-Time Compute

When you enable extended thinking on Claude or switch to an o-series model, the price per request jumps 3 to 10x. That is not because you are getting a smarter model. You are getting the same model spending more computation on your specific question at inference time. Understanding the difference between training-time compute and inference-time compute changes how you decide when these models are worth using. How a Standard Model Generates Output A standard LLM takes all the input tokens (system prompt, history, user message), runs them through the transformer layers in a single forward pass to predict the next token, then samples that token and appends it to the sequence. It repeats this process until it generates an end token. Each forward pass scales with model size. Total cost scales linearly with output token count. ...

Your AI Agent Is a Security Hole - Prompt Injection in 2026

Most engineers building agents spend time worrying about hallucinations. The more immediate risk is that your agent will faithfully execute instructions planted by an attacker in a web page, a retrieved document, or an MCP server response. No jailbreak required. The model follows the injected instructions because to the model, they look exactly like legitimate instructions. This is not a theoretical risk. Researchers have demonstrated successful attacks against agents with web search tools, email agents that read malicious messages, and code agents that load poisoned documentation. As agents gain more autonomy - file access, email, database writes, API calls - the blast radius of a successful injection grows proportionally. ...

You're Shipping AI Features Blind - Eval-Driven Development in 2026

The dirty secret of most AI product teams in 2026: when someone asks “how do you know the new prompt is better?” the honest answer is “we ran it on a few examples and it felt better.” That is vibes-based QA. It works for a demo and collapses in production. LLM features break in ways that unit tests do not catch. You change the prompt to fix one user’s complaint, and it silently regresses the output for 30 other inputs you never checked. You upgrade the model version and the tone shifts. You add a retrieved document to the context and the model stops following the output schema. None of these have stack traces. The failure arrives as a support ticket three days later. ...

How AI Agents Actually Remember - Memory Architectures in 2026

Ask anyone building agents what their biggest problem is and “memory” comes up within the first two sentences. The agent solved the bug yesterday and has no idea today. It learned the user prefers TypeScript, then suggested JavaScript an hour later. It re-read the same 4,000-line file three times in one session because nothing told it that it already knew the answer. The confusing part is that everyone thinks they already have memory. “I have a 1M token context window.” That is not memory. That is a desk. Memory is the filing cabinet you reach for when the desk is full and the meeting ended. This post is about how agents actually remember in 2026 - starting from the naive version everyone builds first, showing exactly where it breaks, and ending at the hybrid architecture that production agents converge on. ...