Designing Experiments for AI Features
Three Experimental Approaches
There are three ways to test AI features before full launch.
Wizard of Oz Testing
Wizard of Oz means you fake the AI. A human generates the output that users think came from AI.
Why use this? You learn if users want the feature without building it yet.
Example: You want to test if users would find AI-generated task suggestions helpful. You manually write suggestions for 20 beta users. They interact with them like they are AI-generated. You collect feedback. Do users like the suggestions?
If feedback is good, you build the real AI. If feedback is bad, you saved engineering time.
Cost: Low. One person manually creating outputs.
Learnings: You validate the core idea before investing in AI engineering.
Shadow Mode
Shadow mode means the AI runs behind the scenes. Users do not see the output. You measure it anyway.
Why use this? You test AI quality without affecting user experience.
Example: You build the AI task suggestion feature. You roll it out in shadow mode. Users do not see suggestions. But the AI is running, generating suggestions, and logging them.
You measure: How often are suggestions actually good? Compare AI suggestions against what users actually do.
After two weeks, if AI quality is high, you show suggestions to users. If quality is low, you improve the AI privately.
Cost: Moderate. Requires engineering to wire up logging.
Learnings: Real user data tells you if the AI is good enough for real use.
Limited Rollout
Limited rollout means you release the feature to a subset of users.
Why use this? You get real usage data from real users in real conditions.
Example: 10% of users get the AI task suggestion feature. 90% do not. After one week, you measure: Do the 10% of users with the feature complete more tasks? Are they happier? Do they use the product more?
If yes, roll out to 50%. If no, fix the feature and try again.
Cost: High. Requires full engineering and rollout infrastructure.
Learnings: You understand real user impact before full launch.
Writing an Experiment Brief
Before you run any experiment, write a brief.
Example brief:
"Experiment: AI Task Suggestions
Hypothesis: If we show users AI-generated task suggestions, they will complete more tasks and feel more productive.
Method: A/B test with 30% of users. Treatment gets suggestions. Control does not. Run for 2 weeks.
Success metrics:
- Treatment group completes 20% more tasks than control
- Treatment group retention is higher at day 7
- Survey: 70% of treatment group finds suggestions helpful
Decision rule:
- If all metrics hit targets, roll out to 100%
- If metrics miss, disable feature and iterate
- If suggestions are wrong for a specific user cohort, add guardrails
Timeline: 2 weeks
Owner: [PM name]
Risks: Users may distrust AI suggestions if they are wrong. We will monitor for negative feedback daily."
This brief keeps everyone aligned.
Hypothesis Design
Your hypothesis should be specific, not vague.
Bad hypothesis: "AI features are good."
Good hypothesis: "Users who see AI task suggestions will spend 30% less time planning their day."
Bad hypothesis: "Users will like the feature."
Good hypothesis: "75% of active users will use the feature at least once per week."
Specific hypotheses let you measure success clearly.
Success Criteria vs Guardrails
Success criteria are what you are trying to achieve.
Guardrails are what you must protect against.
Example:
Success criteria: 20% improvement in task completion.
Guardrails: If feature breaks for users with 200+ tasks, disable it. If accuracy drops below 70%, notify users.
Build guardrails so you can experiment safely.
When to Invest More vs Kill the Experiment
After an experiment, you have three choices:
Expand: Metrics hit targets. Roll out to more users.
Iterate: Metrics are close but miss targets. Change something (tweak the AI output, target different users, adjust messaging) and try again.
Kill: Metrics are far from targets or reveal fundamental problems. Cancel the feature and move on.
Here is a decision framework:
Expand if:
- All success metrics hit targets
- No safety issues or negative feedback
- User demand is clear
Iterate if:
- Metrics miss targets but are close (within 10%)
- Feedback shows clear use case but execution is off
- Engineering can quickly improve
Kill if:
- Metrics miss by 30% or more
- Feature breaks for important user segments
- Technical limitations prevent fixing it
- User feedback is negative despite good metrics
Sample Experiment Brief Template
"[Feature Name] Experiment
Hypothesis: [If we do X, then Y will happen]
Method: [A/B test / Shadow mode / Wizard of Oz]. [Sample size]. [Duration].
Primary metric: [Main thing you are trying to improve]
Secondary metrics: [Supporting metrics]
Success criteria: [Exact targets for each metric]
Guardrails: [What could go wrong and when to stop]
Decision rule: [Expand / iterate / kill based on results]
Timeline: [Start date, end date, decision date]
Owner: [Person running experiment]
Risks: [What could go wrong]."
Use this template for every AI experiment. It keeps you honest and your team aligned.
Discussion
Sign in to comment. Your account must be at least 1 day old.