What Makes AI Features Different in Production

intermediate487 reads

Why AI Features Need a Different Mindset

Traditional software is deterministic. Given the same input, it produces the same output. You can write a test, run it, and know the feature works.

AI features are not like that. An LLM-powered feature produces different outputs for the same input across runs. Its quality degrades silently when a model provider updates their model. A change to your prompt can break behavior in ways that are hard to catch with a unit test. The cost of the feature is tied to usage in ways that are hard to predict.

This does not mean AI features are unshippable. It means they need a different set of practices.

The Production Failure Modes Unique to AI

When you ship a traditional feature, you worry about bugs, performance, and availability. When you ship an AI feature, you have all of those concerns plus several new ones.

Non-determinism. The same prompt can produce different outputs. Your tests need to account for this. A test that checks for an exact string match will fail on good output and pass on bad output, depending on the run.

Hallucination. The model may generate plausible-sounding content that is factually wrong. In a customer-facing feature, this can cause real harm: wrong advice, false information, made-up citations.

Silent quality degradation. When a model provider updates their model (even a patch release), the behavior of your prompts can change. This happens without warning. You will not see an error. Responses will just be subtly different, sometimes worse.

Prompt injection. Users can craft inputs that hijack your prompt and make the model behave in ways you did not intend. This is a security concern, not just a quality concern.

Latency variability. LLM inference is slow compared to a database query, and the latency is not fixed. A prompt that takes 500ms on average might take 5 seconds on a bad day. Your users will feel this.

Cost unpredictability. Token usage scales with input and output length. A user who figures out how to get your feature to generate very long outputs can significantly increase your costs. Multiply this by thousands of users.

What This Means for Your Engineering Practices

You need to build evaluation before you ship, not after. You need logging and observability from day one, not as an afterthought. You need fallback behavior for when the model is slow, expensive, or wrong.

You also need to treat prompts like code. They should be versioned, reviewed, and tested. A casual change to a system prompt is the equivalent of a casual change to a core library: the blast radius is large and often invisible.

This course covers the practices that make the difference between an AI feature that is a liability and one that is a reliable part of your product.

What You Need Before This Course

This course assumes you have:

Experience building software features and shipping them to production
Basic familiarity with calling an LLM API (OpenAI, Anthropic, or similar)
Some understanding of how prompts and responses work

You do not need experience with ML training, fine-tuning, or model deployment. This course is about the software engineering layer: prompts, evals, observability, and deployment.

Next Steps

In the next tutorial, you will get a framework for thinking about production readiness for AI features, before you write a single line of evaluation code.

Discussion

Loading…

← Back to course