Scientific Validity of Clustered PPI++ for Mean Estimation¶
This notebook provides empirical evidence that GLIDE's Clustered Prediction-Powered Inference (Clustered PPI++) implementation is statistically valid.
Setup: We estimate the mean of a binary outcome (e.g., the hallucination rate of an AI system). Observations are grouped into clusters, and entire clusters are annotated together rather than individual samples. We have:
- A small set of fully annotated clusters with true labels (
y_true), expensive but unbiased - A large set of unannotated clusters with proxy labels only (
y_proxy), cheap but potentially biased
Clustered PPI++ combines both to produce confidence intervals that are:
- Valid : they cover the true mean at the specified rate (e.g. 90% confidence), accounting for within-cluster correlation
- Shorter : compared to those obtained with annotated clusters only, especially when the proxy is strongly correlated with the truth
We test these two claims empirically across a range of proxy/true correlation levels.
from functools import partial
import matplotlib.pyplot as plt
import numpy as np
from glide.estimators import ClusterClassicalMeanEstimator, ClusterPPIMeanEstimator
from glide.samplers import UniformClusterSampler
from glide.scientific_validation import compute_hits, coverage_with_error_bar, run_monte_carlo
from glide.simulators import generate_clustered_binary_dataset, simulate_annotation
plt.rcParams.update(
{
"font.size": 18,
"axes.labelsize": 18,
"axes.titlesize": 18,
"legend.fontsize": 16,
"xtick.labelsize": 16,
"ytick.labelsize": 16,
"figure.titlesize": 19,
}
)
Experiment Parameters¶
We fix all parameters up front so every section of this notebook uses a consistent setup. We define:
CONFIDENCE_LEVEL: the confidence level at which we will compute confidence intervals.N_TOTAL: the total number of observations across all clusters.N_CLUSTERS: the total number of clusters in the dataset.BUDGET: the number of clusters selected for annotation. Each annotated cluster contributes its full set of true labels.TRUE_MEAN: the true mean value of human labels.PROXY_MEAN: the (biased) proxy mean value.N_SEEDS: the number of simulations we will make in our Monte Carlo experiments.
Note on correlation bounds: Depending on the values of
TRUE_MEANandPROXY_MEAN, extreme correlation values (close to -1 or 1) may not be possible. Correlation sweeps are kept within these limits.
Finally, we define the estimation methods for comparison:
True only: refers to using true labels from annotated clusters only, computed with the cluster-aware classical estimator.Proxy only: refers to using proxy labels only, computed with the cluster-aware classical estimator.Cluster PPI++: refers to Cluster Prediction-Powered Inference with power-tuning, details are provided below.
CONFIDENCE_LEVEL = 0.9
N_TOTAL = 2500
N_CLUSTERS = 500
AVERAGE_CLUSTER_SIZE = N_TOTAL / N_CLUSTERS
BUDGET = 100
TRUE_MEAN = 0.55
PROXY_MEAN = 0.5
N_SEEDS = 1000
METHODS = ["True only", "Proxy only", "Cluster PPI++"]
correlations = np.arange(0.1, 0.95, 0.1)
n_correlations = len(correlations)
correlations_lmh = [
correlations[n_correlations // 4],
correlations[n_correlations // 2],
correlations[3 * n_correlations // 4],
] # low, medium and high values
corr_labels = ["Low", "Medium", "High"]
Data Simulation¶
We use generate_clustered_binary_dataset to simulate a realistic evaluation scenario. It draws N_TOTAL pairs (y_true, y_proxy) randomly partitioned into N_CLUSTERS clusters. The absence of ground-truth labels for unannotated clusters is simulated by randomly selecting BUDGET clusters to annotate via UniformClusterSampler, which marks every observation in a selected cluster as annotated. The remaining observations are masked with np.nan via simulate_annotation.
The correlation parameter controls the Pearson correlation between true and proxy labels.
# Single example dataset for illustration
y_true, y_proxy, clusters = generate_clustered_binary_dataset(
n_total=N_TOTAL,
n_clusters=N_CLUSTERS,
true_mean=TRUE_MEAN,
proxy_mean=PROXY_MEAN,
correlation=0.8,
random_seed=42,
)
xi = UniformClusterSampler().sample(clusters=clusters, n_clusters=BUDGET, random_seed=42)
y_true = simulate_annotation(y_true, xi)
Inference Results¶
We compare three estimation methods, all cluster-aware:
| Estimation method | Data used | Notes |
|---|---|---|
| True only | y_true (annotated clusters) |
Cluster classical estimator, the gold standard for validity |
| Proxy only | y_proxy (all clusters) |
Biased, cheap but wrong |
| Cluster PPI++ | y_true + y_proxy (rectified) |
Cluster-aware, valid and efficient |
The function below simulates a dataset for a given seed and correlation level, then runs all three estimation methods on it.
def simulate_estimates(seed, correlation):
y_true, y_proxy, clusters = generate_clustered_binary_dataset(
n_total=N_TOTAL,
n_clusters=N_CLUSTERS,
true_mean=TRUE_MEAN,
proxy_mean=PROXY_MEAN,
correlation=correlation,
random_seed=seed,
)
xi = UniformClusterSampler().sample(clusters, n_clusters=BUDGET, random_seed=seed)
y_true = simulate_annotation(y_true, xi)
# --- Cluster PPI++ ---
estimator = ClusterPPIMeanEstimator()
ppi_result = estimator.estimate(y_true, y_proxy, clusters, confidence_level=CONFIDENCE_LEVEL)
# --- Classical baselines ---
classical_estimator = ClusterClassicalMeanEstimator()
true_only_result = classical_estimator.estimate(y_true, clusters, confidence_level=CONFIDENCE_LEVEL)
proxy_only_result = classical_estimator.estimate(y_proxy, clusters, confidence_level=CONFIDENCE_LEVEL)
return {
"True only": {
"mean": true_only_result.mean,
"std": true_only_result.std,
"confidence_interval": true_only_result.confidence_interval,
},
"Proxy only": {
"mean": proxy_only_result.mean,
"std": proxy_only_result.std,
"confidence_interval": proxy_only_result.confidence_interval,
},
"Cluster PPI++": {
"mean": ppi_result.mean,
"std": ppi_result.std,
"confidence_interval": ppi_result.confidence_interval,
"effective_sample_size": ppi_result.effective_sample_size,
},
}
Cluster PPI++ is implemented by ClusterPPIMeanEstimator whereas ClusterClassicalMeanEstimator implements conventional cluster-aware mean estimation.
Coverage Validity¶
A confidence interval is valid if it reliably captures the true value at the nominal rate: a 90% confidence interval is valid if, across many repetitions, around 90% of the resulting intervals contain the true value.
We run a Monte Carlo experiment to verify this for each method. We check that the empirical coverage tracks the nominal level throughout. See the Scientific Validation Methodology page for more details about the verification protocol.
Coverage vs confidence level for three correlation levels¶
We sweep the confidence level from 0.55 to 0.95 and plot the observed coverage. For a valid estimation method, the dots should fall on or above the black diagonal $y = \text{confidence level}$.
We do this for low, medium and high proxy correlation.
# Run Monte Carlo simulations for each correlation level
confidence_levels = np.arange(0.55, 1.00, 0.05)
confidence_levels = np.round(confidence_levels, 2)
raw_stats = {
corr: run_monte_carlo(confidence_levels, partial(simulate_estimates, correlation=corr)) for corr in correlations
}
# Derive coverage for every (correlation, confidence_level) pair
coverages_confidence_intervals = {}
for correlation in correlations_lmh:
coverages_confidence_intervals[correlation] = {}
for confidence_level in confidence_levels:
hits = compute_hits(raw_stats[correlation], confidence_level, TRUE_MEAN)
coverages_confidence_intervals[correlation][confidence_level] = dict()
for method in METHODS:
coverage_confidence_interval = coverage_with_error_bar(hits[method], confidence_level=CONFIDENCE_LEVEL)
coverages_confidence_intervals[correlation][confidence_level][method] = coverage_confidence_interval
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
colors = {"True only": "steelblue", "Cluster PPI++": "darkorange", "Proxy only": "red"}
for ax, correlation, label in zip(axes, correlations_lmh, corr_labels):
ax.plot(confidence_levels, confidence_levels, color="black", lw=1.5, linestyle="--", label="Ideal")
for method in METHODS:
mean_ci = np.array([coverages_confidence_intervals[correlation][cl][method] for cl in confidence_levels])
mean = mean_ci[:, 0]
lo = mean_ci[:, 1]
hi = mean_ci[:, 2]
ax.plot(confidence_levels, mean, marker="o", color=colors[method], label=method)
ax.fill_between(confidence_levels, lo, hi, alpha=0.15, color=colors[method])
ax.set_title(f"{label} correlation (${round(correlation, 2)}$)")
ax.set_xlabel("Target confidence level")
ax.set_ylabel("Observed coverage")
ax.legend()
ax.set_xlim(0.5, 1.0)
ax.set_ylim(0.5, 1.0)
plt.tight_layout()
plt.show()
Both Cluster PPI++ and True only track the diagonal closely across all correlation levels, confirming that Cluster PPI++ achieves valid coverage regardless of proxy quality. The Proxy only method uses biased data so that its coverage is invalid.
Coverage vs correlation for fixed confidence level¶
We now fix the confidence level and sweep a range of proxy-true correlation levels. This shows that Cluster PPI++'s validity does not degrade as the proxy becomes weaker.
coverage_by_corr = {} # {correlation: {method: observed mean coverage}}
coverage_ci_by_corr = {} # {correlation: {method: (lower, upper) Confidence Interval on coverage}}
for correlation in correlations:
hits = compute_hits(raw_stats[correlation], CONFIDENCE_LEVEL, TRUE_MEAN)
coverage_by_corr[correlation] = {}
coverage_ci_by_corr[correlation] = {}
for method in METHODS:
mean_cov, lo, hi = coverage_with_error_bar(hits[method], CONFIDENCE_LEVEL)
coverage_by_corr[correlation][method] = mean_cov
coverage_ci_by_corr[correlation][method] = (lo, hi)
fig, ax = plt.subplots(figsize=(8, 5))
method_colors = {"True only": "steelblue", "Cluster PPI++": "darkorange"}
for method in ["True only", "Cluster PPI++"]:
obs = np.array([coverage_by_corr[correlation][method] for correlation in correlations])
ci_bounds = np.array([coverage_ci_by_corr[correlation][method] for correlation in correlations])
lo = ci_bounds[:, 0]
hi = ci_bounds[:, 1]
ax.plot(correlations, obs, marker="o", color=method_colors[method], label=method)
ax.fill_between(correlations, lo, hi, alpha=0.15, color=method_colors[method])
ax.axhline(y=CONFIDENCE_LEVEL, color="red", linestyle="--", lw=2, label=f"Target coverage {CONFIDENCE_LEVEL:.0%}")
ax.set_xlabel("Proxy–true correlation")
ax.set_ylabel("Observed coverage")
ax.set_xlim(0, 1)
ax.set_ylim(0.8, 1.0)
ax.yaxis.set_ticks(ax.get_yticks()[1:-1:2])
ax.legend()
plt.tight_layout()
plt.show()
Note that Proxy only is not plotted because the proxy is biased (proxy mean ≠ true mean). Therefore it has invalid coverage whereas Cluster PPI++ and True only remain valid across all correlation levels.
Confidence Interval Width¶
Coverage validity is necessary but not sufficient, we also want short intervals. Cluster PPI++'s promise is that by leveraging the unannotated proxy data, it remains statistically valid, just like using annotated data alone, but with a shorter interval when the proxy is informative.
We compare mean confidence interval widths for Cluster PPI++ and True only across correlation levels.
width_by_corr = {}
for correlation in correlations:
width_by_corr[correlation] = {}
for method in METHODS:
lower_bound = raw_stats[correlation][method]["lower_bounds"][CONFIDENCE_LEVEL]
upper_bound = raw_stats[correlation][method]["upper_bounds"][CONFIDENCE_LEVEL]
width_by_corr[correlation][method] = upper_bound - lower_bound
fig, ax = plt.subplots(figsize=(9, 5))
plot_methods = ["True only", "Cluster PPI++"]
colors_w = {"True only": "steelblue", "Cluster PPI++": "darkorange"}
# Compute percentiles based on CONFIDENCE_LEVEL
lower_percentile = round(((1 - CONFIDENCE_LEVEL) / 2) * 100)
upper_percentile = 100 - lower_percentile
for method in plot_methods:
means_w = [np.mean(width_by_corr[correlation][method]) for correlation in correlations]
q_lower = [np.percentile(width_by_corr[correlation][method], lower_percentile) for correlation in correlations]
q_upper = [np.percentile(width_by_corr[correlation][method], upper_percentile) for correlation in correlations]
ax.plot(correlations, means_w, marker="o", label=method, color=colors_w[method])
ax.fill_between(correlations, q_lower, q_upper, alpha=0.15, color=colors_w[method])
ax.set_xlabel("Proxy–true correlation")
ax.set_ylabel("Confidence Interval width")
ax.set_xlim(0.05, 0.95)
ax.legend()
plt.tight_layout()
plt.show()
As expected, Cluster PPI++'s interval width decreases with increasing correlation. Leveraging the unannotated proxy data is only beneficial when the proxy is informative.
Effective Sample Size¶
A natural summary of Cluster PPI++'s efficiency gain is the effective sample size (ESS): the number of true labeled observations that would be needed by the cluster classical estimator to match Cluster PPI++'s confidence interval width.
We report Cluster PPI++'s effective sample size across correlation levels, translating the width reduction into an equivalent number of true labeled observations. See the Scientific Validation Methodology page for the formal definition and formula of ESS.
ess_mean = [np.mean(raw_stats[correlation]["Cluster PPI++"]["effective_sample_sizes"]) for correlation in correlations]
ess_q_lower = [
np.percentile(raw_stats[correlation]["Cluster PPI++"]["effective_sample_sizes"], lower_percentile)
for correlation in correlations
]
ess_q_upper = [
np.percentile(raw_stats[correlation]["Cluster PPI++"]["effective_sample_sizes"], upper_percentile)
for correlation in correlations
]
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(correlations, ess_mean, marker="o", color="darkorange", label="Cluster PPI++ ESS (mean)")
ax.fill_between(
correlations,
ess_q_lower,
ess_q_upper,
alpha=0.15,
color="darkorange",
label=f"{lower_percentile:.0f}th–{upper_percentile:.0f}th percentile",
)
ax.axhline(y=BUDGET * AVERAGE_CLUSTER_SIZE, color="steelblue", linestyle="--", lw=2, label="Baseline (True only)")
ax.set_xlabel("Proxy–true correlation")
ax.set_ylabel("Effective sample size")
ax.set_xlim(0.05, 0.95)
ax.legend()
plt.tight_layout()
plt.show()
Summary¶
This notebook has empirically validated that GLIDE's Cluster PPI++ implementation satisfies two key statistical properties:
| Property | Result |
|---|---|
| Coverage validity | Cluster PPI++ achieves the nominal coverage across all correlation levels and confidence levels tested |
| Efficiency | Cluster PPI++ produces shorter confidence intervals than annotated-only whenever correlation is positive, with the gain growing with correlation |
Crucially, the biased baseline (Proxy only) fails the coverage test. It appears precise but is systematically wrong. Cluster PPI++ avoids this by correcting for proxy bias using the annotated clusters.
The ESS analysis shows that the benefit of leveraging the unannotated proxy data increases with the proxy's correlation with the truth and can be equivalent to having over twice more labeled data, a significant practical gain in scenarios where annotation is expensive. This highlights the importance of a good LLM judge to evaluate an AI system.