Your AI product experiment reaches statistical significance on day 14 of a planned 30-day run, measuring a causal inference question: did the LLM-based feature genuinely improve outcomes? Every product manager in the room wants to ship. Your statistician says to wait the full 30 days, or the p-value is invalid.

You wait. On day 30, the effect is still there. But you spent 16 days running a feature you already knew worked with 95% confidence, delaying the next experiment and burning opportunity cost.

The statistician is technically right, if you’re running a classical fixed-sample test. The p-value in a standard t-test is valid only when you commit to a sample size in advance and look at the results exactly once. Look earlier and stop when p < 0.05, and your false positive rate climbs toward 30%.

The p-value was designed for a single pre-committed look: it was built for a static experiment with a fixed endpoint. Applying it to a live stream where you can check at any point requires a different mathematical object entirely.

Sequential testing was designed for exactly this situation. The mixture Sequential Probability Ratio Test (mSPRT) (Johari et al.) produces always-valid inference using a mathematical object called an e-value: you can check results every day, stop when the evidence is strong enough, and your false positive rate stays at 5%.

Netflix has documented the production use of always-valid sequential testing frameworks (Lindon et al.), and the underlying ideas trace back to Wald’s 1945 work on sequential analysis and Ville’s 1939 inequality.

This tutorial makes the connection explicit. You’ll simulate the peeking problem to see the inflated error rate directly, implement a working mSPRT from scratch in Python, apply it to the shared synthetic LLM product dataset, and understand exactly when sequential testing fails.

Companion notebook: every code block in this article runs end-to-end in msprt_demo.ipynb in the companion repo.

Table of Contents

Why Optional Stopping Breaks Classical Tests

Peeking at running p-values inflates your false positive rate toward 30%. That’s the number that should give you pause, and you’ll reproduce it in Step 1 below.

The p-value in a classical hypothesis test answers a specific question: given the null is true, what’s the probability of seeing data this extreme when you run the experiment exactly as planned with the sample size you committed to upfront?

The “exactly as planned” clause is the problem. When you check results on day 5, day 10, day 14, and stop on day 14 because p < 0.05, you haven’t run the experiment you planned. You’ve run 14 different experiments, looked at the results of each, and stopped at the one that passed your threshold. The p-value formula doesn’t know that.

Here’s the intuition. Under the null hypothesis (no effect), your p-value bounces around randomly between 0 and 1. It doesn’t stay parked at 0.5. Over a 30-day run, a null experiment will dip below 0.05 at some point with high probability. If you’re watching every day and ready to stop the moment you see p < 0.05, you’ll almost always catch one of those dips. You’ll declare a winner. But the effect isn’t real.

Looking less often just delays the same problem. You need to look often: products move fast, and running an experiment 16 days longer than necessary costs real money, delays launches, and burns opportunity cost. You need a test statistic that stays valid regardless of when you stop.

What a Sequential Test Actually Does

Sequential tests are designed for optional stopping by replacing the p-value with an alternative statistic called an e-value.

Unlike a p-value, an e-value is nonnegative, and the process formed by e-values over time satisfies a supermartingale property under the null: conditional on the history, the expected next e-value is at most the current one.

This path-level supermartingale condition is what makes optional stopping safe. Having a marginal mean below 1 at each step is necessary but not sufficient: the supermartingale condition is strictly stronger, holding the bound uniformly across all stopping times.

Here’s why. If the e-value process is a nonneg supermartingale with E[e_t] ≤ 1 under H0, then a classical result called Ville’s inequality gives: the probability that the running maximum of the process ever exceeds 1/α is at most α. With α = 0.05 and stopping threshold 1/α = 20, the probability that a null e-value process ever reaches 20 is at most 5%.

That Type I error bound holds no matter when you stop or how many times you check. The guarantee is time-uniform: it covers all possible stopping times simultaneously.

A classical p-value’s guarantee applies only at the pre-committed sample size. Check repeatedly and the bound dissolves. There is no time-uniform analog.

The mSPRT computes the e-value as a Bayes factor: the ratio of the likelihood of the observed data under the alternative to that under the null.

The “mixture” part means you don’t specify a single effect size under H1. You average the likelihood ratio over a prior distribution on effect sizes.

For Bernoulli outcomes (did the task complete: yes or no), placing a Beta(1,1) prior on each arm’s completion rate makes the Bayes factor tractable in closed form using the log-beta function. The math is less intimidating than it looks: the entire computation reduces to four calls to betaln, as Step 2 shows.

The practical consequence is concrete: accumulate data, compute the running e-value each day, and stop when it crosses 20. When it remains below 20 across your maximum sample size, you fail to reject the null. Check every day, every hour, or every minute. The Type I error rate holds at 5%.

Identification Assumptions

mSPRT’s always-valid guarantee rests on four conditions. Each can break, and the failure modes section below maps each failure mode to the condition it violates.

  1. Nonneg supermartingale property under H0. The e-value process must satisfy E[e_{t+1} | e_1, …, e_t] ≤ e_t under H0. For the Beta-Binomial Bayes factor used here, this holds as long as the prior is proper (Beta(1,1) qualifies) and the observations are i.i.d. within each arm.

  2. Stationarity. The data-generating process must be stationary across the experiment window. If the underlying completion rate shifts mid-experiment due to an unrelated change — a model update, a cohort shift from a marketing campaign, or a day-of-week effect — the e-value picks up noise that your experiment can’t separate from the treatment effect.

  3. Independent observations within each arm. Each user’s outcome must be independent of other users’. Network effects, shared workspaces, or spillover from recommendation systems can violate this.

  4. Prior specification. The Beta(1,1) prior is a modeling assumption. The mSPRT’s power depends on whether the prior places reasonable mass on the true effect size. A badly misspecified prior won’t break the Type I error guarantee, but it can make the e-value grow so slowly that you exhaust your sample budget without crossing the threshold.

Prerequisites

  • Python 3.11+
  • pandas 2.x (pip install pandas)
  • numpy 1.26+ (pip install numpy)
  • scipy 1.12+ (pip install scipy)
  • matplotlib 3.8+ (pip install matplotlib)

Clone the companion repo to get the synthetic dataset:

git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv

Here’s what’s happening: this clones the repo that contains all 13 companion notebooks for this series, generates the shared 50,000-user synthetic dataset, and saves it to data/synthetic_llm_logs.csv. Every article in the series runs against this same CSV so the methods are directly comparable. The data generator bakes in a +5 percentage-point causal effect on task completion for wave 1 users.

Setting Up the Working Example

The synthetic dataset simulates a SaaS AI assistant product with 50,000 users. The task_completed column records whether the AI successfully completed the user’s task (1) or not (0). The wave column assigns users to groups: wave 1 receives the new AI feature, wave 2 is the holdout control.

Step 1: Simulate the Peeking Problem

This step reproduces the inflated false positive rate empirically. You run 10,000 null experiments — where the treatment and control arms are drawn from the same distribution — and apply repeated peeking with a classical p-value at each step. The result demonstrates that peeking pushes the false positive rate from the nominal 5% to roughly 30%, matching the theoretical prediction.

Step 2: Implement the mSPRT E-Value

The mSPRT e-value for Bernoulli data with a Beta(1,1) prior reduces to a ratio of beta functions evaluated at the observed successes and failures in each arm. Each update multiplies the running e-value by the new Bayes factor as each observation arrives. The computation uses scipy.special.betaln to work in log space for numerical stability, then exponentiates at the end.

Step 3: Apply mSPRT to the Real Dataset

Load synthetic_llm_logs.csv, filter to wave 1 (treatment) and wave 2 (control), and stream the task_completed outcomes through the mSPRT update function. Plot the running e-value over users processed. When the e-value crosses the threshold of 20, the test stops and rejects the null. With a true +5 percentage-point effect baked into the data, the e-value should cross the threshold well before the full sample is exhausted.

Step 4: Compare Power Against a Fixed-Sample Test

Run a two-proportion z-test on the same dataset using the full sample. Compare the sample size at which the mSPRT crossed its threshold against the full sample the z-test required. The mSPRT typically stops earlier when the true effect is present, quantifying the opportunity cost savings from sequential testing.

Validate Against Ground Truth

The data generator sets the true average treatment effect at +5 percentage points on task_completed. After the mSPRT stops, compute the raw difference in completion rates between the two arms and confirm it lands close to 0.05. This sanity check confirms the test stopped on a real signal rather than noise.

Step 5: Bootstrap Confidence Intervals

After stopping, resample the observed arm-level outcomes with replacement 2,000 times and recompute the difference in completion rates each time. The 2.5th and 97.5th percentiles of the bootstrap distribution give a 95% confidence interval for the effect size. This is a separate inference step — the mSPRT tells you when to stop; the bootstrap interval tells you the magnitude and uncertainty of the effect.

When mSPRT Fails

Four failure modes correspond directly to the identification assumptions above.

Non-stationary data. If completion rates shift mid-experiment due to a model rollout or a seasonal cohort, the e-value accumulates signal that isn’t attributable to your feature. The test may cross the threshold for the wrong reason, or fail to cross it even when your feature works.

Correlated observations. Users who share a workspace or influence each other’s behavior violate the independence assumption. The effective sample size is smaller than the row count suggests, and the e-value grows faster than it should under the null, inflating Type I error.

Severe prior misspecification. If the true effect is an order of magnitude smaller than the mass in your Beta(1,1) prior, the Bayes factor grows slowly. You may exhaust your sample budget without crossing 20 even when the effect is real, producing a false negative.

Delayed outcomes. If task_completed is only recorded days after the session — common in products where success requires a follow-up action — streaming the e-value update in real time will use incomplete outcome data. The e-value computation assumes each observation is final when it arrives.

What to Do Next

The mSPRT gives you a principled way to stop experiments early without inflating your false positive rate, grounded in Ville’s inequality and the supermartingale structure of e-values. The practical workflow is straightforward: stream outcomes, update the running e-value, stop when it crosses 1/α.

The next natural extensions are handling continuous outcomes (Gaussian mSPRT), incorporating covariate adjustment to improve power, and integrating sequential testing into a multi-armed bandit framework where you’re simultaneously allocating traffic and testing. Each of those builds on the same e-value foundation introduced here.