Stratified data: Stratified PPI++¶
Stratified PPI++ extends PPI++ to datasets naturally partitioned into strata (e.g. by region, language, or data source). It adapts the bias-correction per stratum, yielding narrower confidence intervals when strata differ in proxy quality or hallucination rates. This guide walks through a realistic scenario where your global product handles conversations from different regions.
In many real-world scenarios, your data naturally falls into distinct groups or strata. These might be:
- Geographic regions (like our example: NA, EU, Asia)
- Languages (English, French, Mandarin, etc.)
- Data sources (in-house data vs. partner data)
- Time periods (model versions deployed at different times)
- Customer segments (free tier vs. premium)
Key insight: When strata exist and proxy quality differs across them, ignoring strata wastes statistical precision. Taking strata into account helps make more precise estimations.
What you will learn:
- Why LLM-as-Judge metrics are regionally biased
- How Stratified PPI++ adapts to different strata simultaneously
The problem: your LLM judge has different biases in different regions¶
Let's assume you run a global customer-facing assistant handling conversations in three regions: North America, Europe, and Asia.
The signals¶
You deploy an LLM judge to measure hallucination rates per region. However, you notice something troubling:
| Region | Judge Reports | User Complaints | Discrepancy |
|---|---|---|---|
| North America | 2% | 5% | Judge too optimistic by 3 pp |
| Europe | 4% | 7% | Judge too optimistic by 3 pp |
| Asia | 5% | 12% | Judge too optimistic by 7 pp |
The bias is regional: the judge systematically underestimates hallucinations, but the severity varies. Asia (likely due to language and translation nuances) sees the worst judge performance.
The challenge¶
You have:
- ~4,400 LLM judgements (cheap, fast, but regionally biased)
- ~400 human annotations spread across three regions (accurate, but sparse)
Bias correction is necessary in order to combine both data sources into statistically valid estimates.
Stratified PPI++ additionally partitions data by region, adapting the bias correction to each region's signal characteristics, and producing a reliable global estimate that respects regional differences.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from glide.estimators import StratifiedClassicalMeanEstimator, StratifiedPPIMeanEstimator
from glide.samplers import StratifiedSampler
from glide.simulators import generate_stratified_binary_dataset, simulate_annotation
# ── Colour palette ──────────────────────────────────────────
C_JUDGE = "#E74C3C" # LLM judge — red-orange
C_HUMAN = "#2E86AB" # Human-only — blue
C_GLIDE = "#27AE60" # GLIDE — green
C_TRUTH = "#2C3E50" # True value — dark slate
C_NA = "#3498DB" # North America — lighter blue
C_EU = "#E67E22" # Europe — orange
C_ASIA = "#E74C3C" # Asia — red
# ── Global plot style ───────────────────────────────────────
plt.rcParams.update(
{
"figure.facecolor": "white",
"axes.facecolor": "#FAFAFA",
"axes.grid": True,
"grid.color": "#E5E5E5",
"grid.linewidth": 0.8,
"font.size": 18,
"axes.labelsize": 18,
"axes.titlesize": 18,
"legend.fontsize": 16,
"xtick.labelsize": 16,
"ytick.labelsize": 16,
"figure.titlesize": 19,
}
)
Sampling data: partitioning by region¶
To apply stratified inference, we first partition our dataset into strata — one per region. Each stratum contains:
- A small number of manually annotated conversations (labeled)
- A larger number of LLM-judged only conversations (proxy signal)
The simulation replicates the practical workflow: proxy labels are generated for every conversation, a random subset is selected per stratum for human annotation, and the remaining ground-truth labels are masked with np.nan to reflect what would be unobserved in practice.
- Generate synthetic data with
generate_stratified_binary_dataset: proxy labels cover all samples. - Sample a subset for annotation per stratum with
StratifiedSampler. - Annotate selected samples and mask the rest via
simulate_annotation.
The table below summarizes the arrays produced and what each contains.
| Array | Meaning |
|---|---|
y_true_oracle |
Ground-truth labels for all conversations (revealed only after annotation) |
y_proxy |
Proxy labels for all samples |
groups |
Integer stratum identifiers: 0 = North America, 1 = Europe, 2 = Asia |
true_means_by_stratum = np.array([0.05, 0.07, 0.12]) # NA, EU, Asia
proxy_means_by_stratum = np.array([0.02, 0.04, 0.05])
correlations_by_stratum = np.array([0.6, 0.65, 0.60])
population_sizes_by_stratum = np.array([1000, 1500, 1500])
stratum_weights = population_sizes_by_stratum / (population_sizes_by_stratum.sum())
TRUE_RATE = np.average(true_means_by_stratum, weights=stratum_weights)
CONFIDENCE_LEVEL = 0.95
CONFIDENCE_LEVEL_PCT = int(CONFIDENCE_LEVEL * 100)
RANDOM_SEED = 5
N_TOTAL_BY_STRATUM = [1100, 1650, 1650]
N_LABELED = 400
RANDOM_SEED = 5
# Generate stratified data: 3 regions with different hallucination rates and judge biases
y_true_oracle, y_proxy, groups = generate_stratified_binary_dataset(
n_total=N_TOTAL_BY_STRATUM,
true_mean=true_means_by_stratum.tolist(),
proxy_mean=proxy_means_by_stratum.tolist(),
correlation=correlations_by_stratum.tolist(),
random_seed=RANDOM_SEED,
)
# Sample annotations proportionally across regions
xi = StratifiedSampler().sample(y_proxy, groups, budget=N_LABELED, strategy="proportional", random_seed=RANDOM_SEED)
y_true = simulate_annotation(y_true_oracle, xi)
# Label strata for readability
region_ids_to_names = {0: "North America", 1: "Europe", 2: "Asia"}
region_names_to_ids = {v: k for k, v in region_ids_to_names.items()}
region_counts = np.bincount(groups.astype(int), minlength=len(region_ids_to_names))
# Count labeled samples per region
labeled_mask = ~np.isnan(y_true)
label_region_counts = np.bincount(groups[labeled_mask].astype(int), minlength=len(region_ids_to_names))
table_width = 75
print("=" * table_width)
print("Stratified data: Partitioned by Region")
print("=" * table_width)
print()
total_size = len(y_true)
labeled_count = np.sum(~np.isnan(y_true))
unlabeled_count = total_size - labeled_count
print(f"Total samples : {total_size:,}")
print(f"Labeled samples : {labeled_count:,}")
print(f"Unlabeled samples: {unlabeled_count:,}")
print()
print("Distribution by region (stratum):")
print("-" * table_width)
for region_id in range(len(region_ids_to_names)):
region_name = region_ids_to_names[region_id]
total = region_counts[region_id]
unlabeled_in_region = total - label_region_counts[region_id]
print(
f" {region_name:<15} | Total: {total:5,} | "
f"Labeled: {label_region_counts[region_id]:3} | Unlabeled: {unlabeled_in_region:5,}"
)
print("-" * table_width)
=========================================================================== Stratified data: Partitioned by Region =========================================================================== Total samples : 4,400 Labeled samples : 400 Unlabeled samples: 4,000 Distribution by region (stratum): --------------------------------------------------------------------------- North America | Total: 1,100 | Labeled: 100 | Unlabeled: 1,000 Europe | Total: 1,650 | Labeled: 150 | Unlabeled: 1,500 Asia | Total: 1,650 | Labeled: 150 | Unlabeled: 1,500 ---------------------------------------------------------------------------
Two naive strategies both fail¶
Below are the three estimators we'll evaluate:
Option A — Use LLM judge across all regions. Fast and cheap (~4,400 conversations), but regionally biased. Underestimates hallucinations everywhere; bias is worst in Asia.
Option B — Trust only human annotations. Unbiased, but sparse (~400 labeled conversations). Wide confidence intervals due to small sample size.
Stratified PPI++, introduced next, partitions by region, computes a tailored bias correction per region, and combines estimates with population-proportional weights — giving you unbiased estimates and narrow confidence intervals by respecting regional data structure.
# Option A: LLM judge on all conversations
judge_estimate = StratifiedClassicalMeanEstimator().estimate(y_proxy, groups, confidence_level=CONFIDENCE_LEVEL)
# Option B: Human labels only, small sample
human_estimate = StratifiedClassicalMeanEstimator().estimate(y_true, groups, confidence_level=CONFIDENCE_LEVEL)
# Extract CI bounds for cleaner formatting
judge_lo = judge_estimate.confidence_interval.lower_bound
judge_hi = judge_estimate.confidence_interval.upper_bound
human_lo = human_estimate.confidence_interval.lower_bound
human_hi = human_estimate.confidence_interval.upper_bound
sep = "-" * 75
print(f"{'Method':<38} {'Estimate':>8} {CONFIDENCE_LEVEL_PCT}{'% Confidence Interval':>18}")
print(sep)
print(f"{'Option A: LLM judge (all regions)':<38} {judge_estimate.mean:>7.1%} [{judge_lo:.1%}, {judge_hi:.1%}]")
print(f"{'Option B: Human-only (small sample)':<38} {human_estimate.mean:>7.1%} [{human_lo:.1%}, {human_hi:.1%}]")
print(sep)
print(f"{'True rate (simulation)':<38} {TRUE_RATE:>7.1%}")
print(sep)
Method Estimate 95% Confidence Interval --------------------------------------------------------------------------- Option A: LLM judge (all regions) 3.9% [3.3%, 4.5%] Option B: Human-only (small sample) 7.0% [4.5%, 9.5%] --------------------------------------------------------------------------- True rate (simulation) 8.4% ---------------------------------------------------------------------------
Note that, even when taking regional structure into account, option A (LLM judge) produces a wrong estimate and interval due to its bias. Option B produces a valid result but the interval remains wide.
The root cause: the LLM judge is biased everywhere, worst in Asia¶
Per-region analysis shows the problem clearly: the judge underestimates hallucinations in all regions, but the bias is especially severe in Asia.
judge_means = []
human_means = []
biases = []
# Filter to labeled samples only
labeled_mask = ~np.isnan(y_true)
y_true_filtered = y_true[labeled_mask]
y_proxy_labeled = y_proxy[labeled_mask]
groups_labeled = groups[labeled_mask]
for region_id in range(len(region_ids_to_names)):
region_mask = groups_labeled == region_id
judge_mean = np.mean(y_proxy_labeled[region_mask])
human_mean = np.mean(y_true_filtered[region_mask])
bias = judge_mean - human_mean
judge_means.append(judge_mean)
human_means.append(human_mean)
biases.append(bias)
x_pos = np.arange(len(region_ids_to_names))
bar_width = 0.35
colors_list = [C_NA, C_EU, C_ASIA]
region_labels = ["North America", "Europe", "Asia"]
fig, ax = plt.subplots(figsize=(10, 5))
# Plot bars
ax.bar(
x_pos - bar_width / 2,
judge_means,
bar_width,
label="LLM Judge (on annotated subset)",
color=C_JUDGE,
edgecolor="white",
linewidth=2,
zorder=3,
)
ax.bar(
x_pos + bar_width / 2,
human_means,
bar_width,
label="Human Annotation (ground truth)",
color=C_HUMAN,
edgecolor="white",
linewidth=2,
zorder=3,
)
# Annotate biases
for i, (j_mean, h_mean, bias) in enumerate(zip(judge_means, human_means, biases)):
max_y = max(j_mean, h_mean)
ax.annotate(
"",
xy=(i + bar_width / 2 + 0.05, h_mean + 0.01),
xytext=(i - bar_width / 2 - 0.05, j_mean + 0.01),
arrowprops=dict(arrowstyle="<->", color="#666666", lw=2),
)
ax.text(
i,
max_y + 0.04,
f"Bias={bias:+.1%}",
ha="center",
fontsize=11,
bbox=dict(boxstyle="round,pad=0.3", facecolor="#FFFDE7", edgecolor="#CCCCCC"),
)
ax.set_xticks(x_pos)
ax.set_xticklabels(region_labels)
ax.set_ylabel("Hallucination Rate")
ax.set_ylim(0, 0.20)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.spines[["top", "right"]].set_visible(False)
ax.legend(loc="upper left")
plt.tight_layout()
plt.show()
print("Per-region bias (Judge - Human):")
for region_id, region_label in region_ids_to_names.items():
print(f" {region_label:<15} : {biases[region_id]:+.1%}")
Per-region bias (Judge - Human): North America : -1.0% Europe : -3.3% Asia : -5.3%
Stratified PPI++: adaptive bias correction per region¶
StratifiedPPIMeanEstimator partitions your set of samples into strata and computes the bias correction independently within each region:
- Partition the samples by
stratum_id(region) - Measure per-region bias — in each region's labeled subset, compute the gap between judge and human annotations
- Adapt to each region's data quality — regions with worse judge quality (like Asia) get a more conservative estimate; regions with better judge quality leverage more of the proxy signal
- Combine with population weights — the final global estimate is a weighted mean of per region estimates
This way, Asia's worse judge performance doesn't drag down the estimate for North America and Europe — each region's bias correction is tailored to its own signal characteristics.
estimator = StratifiedPPIMeanEstimator()
stratified_result = estimator.estimate(
y_true,
y_proxy,
groups,
metric_name="Global Hallucination Rate",
confidence_level=CONFIDENCE_LEVEL,
power_tuning=True,
)
print(stratified_result.summary())
Metric: Global Hallucination Rate Point Estimate: 0.070 Confidence Interval (95%): [0.050, 0.090] Estimator : StratifiedPPIMeanEstimator n_true: 400 n_proxy: 4400 Effective Sample Size: 603
The variance reduction thanks to Stratified PPI++ is quantified by the effective sample size displayed above. It is the annotation budget required to achieve the same variance level using human labels only.
How Stratified PPI++ reduces uncertainty by exploiting regional structure¶
The plot below compares all estimators, note that Stratified PPI++ provides a valid estimate with the narrowest confidence interval.
# Build estimates list
n_total = len(y_true)
n_labeled = np.sum(~np.isnan(y_true))
estimates = [
(
f"Option A — LLM judge\n({n_total:,} | raw proxy)",
judge_estimate.mean,
judge_estimate.confidence_interval.lower_bound,
judge_estimate.confidence_interval.upper_bound,
C_JUDGE,
),
(
f"Option B — Human annotations\n({n_labeled:,} | small sample)",
human_estimate.mean,
human_estimate.confidence_interval.lower_bound,
human_estimate.confidence_interval.upper_bound,
C_HUMAN,
),
(
f"Stratified PPI++\n({stratified_result.n_true} + {stratified_result.n_proxy})",
stratified_result.mean,
stratified_result.confidence_interval.lower_bound,
stratified_result.confidence_interval.upper_bound,
C_GLIDE,
),
]
fig, ax = plt.subplots(figsize=(12, 6))
y_pos = [2, 1, 0]
for y, (label, mean, lo, hi, color) in zip(y_pos, estimates):
# Confidence interval line
ax.plot([lo, hi], [y, y], color=color, linewidth=5, solid_capstyle="round", zorder=3)
# Cap marks
for xc in [lo, hi]:
ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=3, zorder=3)
# Point estimate
ax.scatter(mean, y, s=250, color=color, zorder=5, edgecolors="white", linewidths=2.5)
# Value label above
ax.text(mean, y + 0.38, f"{mean:.1%}", ha="center", va="bottom", fontsize=12, color=color, fontweight="bold")
# Confidence interval bounds below
ax.text(mean, y - 0.38, f"[{lo:.1%}, {hi:.1%}]", ha="center", va="top", fontsize=10, color="#888888")
# True value
ax.axvline(TRUE_RATE, color=C_TRUTH, linestyle="--", linewidth=2.5, zorder=4)
ax.text(TRUE_RATE + 0.004, 2.75, f"True rate {TRUE_RATE:.1%}", color=C_TRUTH, fontsize=11, fontweight="bold")
ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("Estimated Global Hallucination Rate")
ax.set_xlim(-0.01, 0.22)
ax.set_ylim(-0.8, 3.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()
Testing hypothesis: is the true rate significantly above 5%?¶
We would like to test whether our AI system's global hallucination rate is below a business tolerance of 5%.
GLIDE's estimation output enables a formal hypothesis test: is the true hallucination rate within the fixed limit ?
$$H_0 : \mu \leq 5\% \qquad H_1 : \mu > 5\%$$
Without Stratified PPI++, we can use the LLM judge estimate only which gives us a misleading confidence interval making it unlikely that we would reject the null hypothesis. On the other hand, the confidence interval obtained thanks to human annotators only is wide and leads to conservative decisions.
threshold = 0.05
threshold_pct = 100 * threshold
z_stat, p_value, _ = stratified_result.confidence_interval.test_null_hypothesis(
h0_value=threshold,
alternative="larger",
)
sep = "-" * 60
print("Hypothesis Test — Stratified PPI++ Estimator")
print(sep)
print(f"H0 : global hallucination rate <= {threshold_pct:.1f}%")
print(f"H1 : global hallucination rate > {threshold_pct:.1f}%")
print()
print(f"z-statistic : {z_stat:.2f}")
print(f"p-value : {p_value:.6f}")
print()
if p_value < 0.05:
print("Decision : Reject H0 at the 5% level.")
print(f"=> The global hallucination rate is significantly larger than {threshold_pct:.1f}%.")
else:
print("Decision : Cannot reject H0 at the 5% level.")
Hypothesis Test — Stratified PPI++ Estimator ------------------------------------------------------------ H0 : global hallucination rate <= 5.0% H1 : global hallucination rate > 5.0% z-statistic : 1.94 p-value : 0.026134 Decision : Reject H0 at the 5% level. => The global hallucination rate is significantly larger than 5.0%.
Notice that the null hypothesis is rejected signalling that the hallucination rate is significantly above the fixed threshold.
Let us try the same hypothesis test using human annotations only.
z_stat, p_value, _ = human_estimate.confidence_interval.test_null_hypothesis(
h0_value=threshold,
alternative="larger",
)
sep = "-" * 60
print("Hypothesis test — Classical Mean Estimator (Human labels only)")
print(sep)
print(f"H0 : global hallucination rate <= {threshold_pct:.1f}%")
print(f"H1 : global hallucination rate > {threshold_pct:.1f}%")
print()
print(f"z-statistic : {z_stat:.2f}")
print(f"p-value : {p_value:.6f}")
print()
if p_value < 0.05:
print("Decision : Reject H0 at the 5% level.")
print(f"=> The global hallucination rate is significantly larger than {threshold_pct:.1f}%.")
else:
print("Decision : Cannot reject H0 at the 5% level.")
Hypothesis test — Classical Mean Estimator (Human labels only) ------------------------------------------------------------ H0 : global hallucination rate <= 5.0% H1 : global hallucination rate > 5.0% z-statistic : 1.57 p-value : 0.057883 Decision : Cannot reject H0 at the 5% level.
The null hypothesis is not rejected due to high uncertainty and it is not possible to draw the same conclusion. As we saw above, Stratified PPI++ fixes this by leveraging the LLM judge labels and the stratified data structure to reduce uncertainty.
Summary: Stratified PPI++ adapts to regional data structure¶
| Baseline A — LLM judge | Baseline B — Human-only | Stratified PPI++ | |
|---|---|---|---|
| Sample size | 4,400 (all) | 400 | 400 + 4,400 |
| Unbiased | ❌ | ✅ | ✅ |
| Respects regional structure | ❌ | ❌ | ✅ |
| Narrow confidence interval | 🟠 (misleading) | ❌ | ✅ |
| Labeling cost | Low | High | Small |
Key takeaways:
Regional bias matters. A global LLM judge may have systematically different error rates across regions. Ignoring this structure leaves precision on the table.
Stratified PPI++. It carries out the estimation per region, then combines with population-proportional weights. Thereby respecting the data structure.
Want to go further? The Stratified PPI++ scientific validation notebook provides rigorous empirical evidence — coverage validity and confidence interval width across a sweep of proxy correlations and confidence levels.