Scientific Validity of Multi-PPI++ for Mean Estimation¶
This notebook provides empirical evidence that GLIDE's Multi Proxy Prediction-Powered Inference (Multi-PPI++) implementation is statistically valid.
Setup: We estimate the mean of a binary outcome (e.g., the hallucination rate of an AI system). We have:
- A small set of true labels (
y_true), expensive but unbiased - A large set of proxy labels from 2 proxy models (
y_proxies), cheap but potentially biased
Multi-PPI++ finds the optimal linear combination of the 2 proxies that minimises the confidence interval width, then applies the PPI rectifier with this combined proxy. This yields confidence intervals that are:
- Valid : they cover the true mean at the specified rate (e.g. 90% confidence)
- Shorter : as compared to those obtained with true labels only, especially when at least one proxy is strongly correlated with the truth
We test these claims empirically across a range of proxy/true correlation levels.
from functools import partial
import matplotlib.pyplot as plt
import numpy as np
from glide.estimators import ClassicalMeanEstimator, MultiPPIMeanEstimator
from glide.samplers import UniformSampler
from glide.scientific_validation import compute_hits, coverage_with_error_bar, run_monte_carlo
from glide.simulators import generate_multi_binary_dataset, simulate_annotation
plt.rcParams.update(
{
"font.size": 18,
"axes.labelsize": 18,
"axes.titlesize": 18,
"legend.fontsize": 16,
"xtick.labelsize": 16,
"ytick.labelsize": 16,
"figure.titlesize": 19,
}
)
Experiment Parameters¶
We fix all parameters up front so every section of this notebook uses a consistent setup. We define:
CONFIDENCE_LEVEL: the confidence level at which we will compute confidence intervals.N_TOTAL: total number of samples.BUDGET: total number of human annotated samples.TRUE_MEAN: the true mean value of human labels.PROXY_MEANS: per-proxy (biased) mean values. Both proxies have a mean different fromTRUE_MEAN.PROXY_CORRELATION_OFFSETS: per-proxy offsets to the base correlation level between true and proxy values. Proxy 2 has higher correlation offset, making it the stronger proxy.N_SEEDS: number of simulations in Monte Carlo experiments.
Note on correlation bounds: Depending on the values of
TRUE_MEANandPROXY_MEANS, extreme correlation values (close to -1 or 1) may not be possible. The correlation sweep is kept within safe limits for all proxies.
Finally, we define the baseline estimation methods for comparison:
True only: uses true human labels only, the gold standard for validityProxy 1 only: uses the first proxy labels only, biased but cheapProxy 2 only: uses the second proxy labels only, also biased but with a higher correlation to the truthMulti-PPI++: Multi Prediction-Powered Inference with optimal lambda tuning across both proxies
CONFIDENCE_LEVEL = 0.9
N_TOTAL = 1500
BUDGET = 500
TRUE_MEAN = 0.55
PROXY_MEANS = np.array([0.5, 0.45])
PROXY_CORRELATION_OFFSETS = np.array([-0.05, 0.0])
N_SEEDS = 1000
METHODS = ["True only", "Proxy 1 only", "Proxy 2 only", "Multi-PPI++"]
# Correlation sweep — kept within feasible range for all proxies
correlations = np.arange(0.0, 0.85, 0.1)
n_correlations = len(correlations)
correlations_lmh = [
correlations[n_correlations // 4],
correlations[n_correlations // 2],
correlations[3 * n_correlations // 4],
] # low, medium and high values
corr_labels = ["Low", "Medium", "High"]
print(f"TRUE_MEAN = {TRUE_MEAN}")
print(f"PROXY_MEANS = {PROXY_MEANS}")
TRUE_MEAN = 0.55 PROXY_MEANS = [0.5 0.45]
Data Simulation¶
We use generate_multi_binary_dataset to simulate a realistic evaluation scenario with multiple proxy models. It generates correlated binary labels: a single ground-truth array y_true_oracle and a 2D proxy array y_proxies with one column per proxy model. Each proxy is independently generated given y_true, so proxies are correlated with the truth but conditionally independent of one another.
The absence of certain ground-truths is then simulated by randomly selecting BUDGET samples to annotate via UniformSampler and masking the rest with np.nan via simulate_annotation.
The correlation parameter controls the Pearson correlation between true labels and each proxy. In the sweep below, all proxies receive the same base correlation value, shifted by PROXY_CORRELATION_OFFSETS to simulate heterogeneous proxy quality.
# Single example dataset for illustration
y_true_oracle, y_proxies = generate_multi_binary_dataset(
n_total=N_TOTAL,
true_mean=TRUE_MEAN,
proxy_means=PROXY_MEANS,
correlations=0.7 + PROXY_CORRELATION_OFFSETS,
random_seed=42,
)
xi = UniformSampler().sample(n_total=N_TOTAL, n_samples=BUDGET, random_seed=42)
y_true = simulate_annotation(y_true_oracle, xi)
n_labeled = int(np.sum(~np.isnan(y_true)))
n_unlabeled = len(y_true) - n_labeled
print(f"Total samples: {len(y_true)}")
print(f"Labeled samples: {n_labeled}")
print(f"Unlabeled samples: {n_unlabeled}")
print(f"y_proxies shape: {y_proxies.shape}")
Total samples: 1500 Labeled samples: 500 Unlabeled samples: 1000 y_proxies shape: (1500, 2)
Inference Results¶
We compare four estimation methods:
| Estimation method | Data used | Notes |
|---|---|---|
| True only | y_true |
Classical CLT confidence interval, the gold standard for validity |
| Proxy 1 only | y_proxies[:, 0] |
Biased, cheap but wrong |
| Proxy 2 only | y_proxies[:, 1] |
Also biased, but more correlated with the truth |
| Multi-PPI++ | y_true + y_proxies (optimally combined and rectified) |
Valid and efficient, exploiting both proxies |
The function below simulates a dataset for a given seed and base correlation level, then runs all four estimation methods on it.
Note that we add PROXY_CORRELATION_OFFSETS to the base correlation, simulating heterogeneous proxy quality: proxy 2 receives higher correlation compared to proxy 1.
def simulate_estimates(seed, correlation):
y_true_oracle, y_proxies = generate_multi_binary_dataset(
n_total=N_TOTAL,
true_mean=TRUE_MEAN,
proxy_means=PROXY_MEANS,
correlations=(correlation + PROXY_CORRELATION_OFFSETS),
random_seed=seed,
)
xi = UniformSampler().sample(N_TOTAL, BUDGET, random_seed=seed)
y_true = simulate_annotation(y_true_oracle, xi)
# --- Multi-PPI++ ---
estimator = MultiPPIMeanEstimator()
multi_ppi_result = estimator.estimate(y_true, y_proxies, confidence_level=CONFIDENCE_LEVEL)
# --- Classical baselines ---
classical_estimator = ClassicalMeanEstimator()
true_only_result = classical_estimator.estimate(y_true, confidence_level=CONFIDENCE_LEVEL)
proxy1_only_result = classical_estimator.estimate(y_proxies[:, 0], confidence_level=CONFIDENCE_LEVEL)
proxy2_only_result = classical_estimator.estimate(y_proxies[:, 1], confidence_level=CONFIDENCE_LEVEL)
return {
"True only": {
"mean": true_only_result.mean,
"std": true_only_result.std,
"confidence_interval": true_only_result.confidence_interval,
},
"Proxy 1 only": {
"mean": proxy1_only_result.mean,
"std": proxy1_only_result.std,
"confidence_interval": proxy1_only_result.confidence_interval,
},
"Proxy 2 only": {
"mean": proxy2_only_result.mean,
"std": proxy2_only_result.std,
"confidence_interval": proxy2_only_result.confidence_interval,
},
"Multi-PPI++": {
"mean": multi_ppi_result.mean,
"std": multi_ppi_result.std,
"confidence_interval": multi_ppi_result.confidence_interval,
"effective_sample_size": multi_ppi_result.effective_sample_size,
},
}
MultiPPIMeanEstimator solves for the optimal lambda vector that minimises the confidence interval width, forms a single combined proxy prediction, and applies the PPI rectifier. ClassicalMeanEstimator implements conventional mean estimation using true labels (or a single proxy column) only.
Coverage Validity¶
A confidence interval is valid if it reliably captures the true value at the nominal rate: a 90% confidence interval is valid if, across many repetitions, around 90% of the resulting intervals contain the true value.
We run a Monte Carlo experiment to verify this for each method. We check that the empirical coverage tracks the nominal level throughout. See the Scientific Validation Methodology page for more details about the verification protocol.
Coverage vs confidence level for three correlation levels¶
We sweep the confidence level from 0.55 to 0.95 and plot the observed coverage. For a valid estimation method, the dots should fall on or above the black diagonal $y = \text{confidence level}$.
We do this for low, medium and high base proxy correlation.
# Run Monte Carlo simulations for each correlation level
confidence_levels = np.arange(0.55, 1.00, 0.05)
confidence_levels = np.round(confidence_levels, 2)
raw_stats = {
corr: run_monte_carlo(confidence_levels, partial(simulate_estimates, correlation=corr)) for corr in correlations
}
# Derive coverage for every (correlation, confidence_level) pair
coverages_confidence_intervals = {}
for correlation in correlations_lmh:
coverages_confidence_intervals[correlation] = {}
for confidence_level in confidence_levels:
hits = compute_hits(raw_stats[correlation], confidence_level, TRUE_MEAN)
coverages_confidence_intervals[correlation][confidence_level] = dict()
for method in METHODS:
coverage_confidence_interval = coverage_with_error_bar(hits[method], confidence_level=CONFIDENCE_LEVEL)
coverages_confidence_intervals[correlation][confidence_level][method] = coverage_confidence_interval
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
colors = {"True only": "steelblue", "Multi-PPI++": "darkorange", "Proxy 1 only": "red", "Proxy 2 only": "firebrick"}
for ax, correlation, label in zip(axes, correlations_lmh, corr_labels):
ax.plot(confidence_levels, confidence_levels, color="black", lw=1.5, linestyle="--", label="Ideal")
for method in METHODS:
mean_ci = np.array([coverages_confidence_intervals[correlation][cl][method] for cl in confidence_levels])
mean = mean_ci[:, 0]
lo = mean_ci[:, 1]
hi = mean_ci[:, 2]
ax.plot(confidence_levels, mean, marker="o", color=colors[method], label=method)
ax.fill_between(confidence_levels, lo, hi, alpha=0.15, color=colors[method])
ax.set_title(f"{label} correlation (${round(correlation, 2)}$)")
ax.set_xlabel("Target confidence level")
ax.set_ylabel("Observed coverage")
ax.legend(loc="lower right")
ax.set_xlim(0.5, 1.0)
ax.set_ylim(0.5, 1.0)
plt.tight_layout()
plt.show()
Both Multi-PPI++ and True only track the diagonal closely across all correlation levels, confirming that Multi-PPI++ achieves valid coverage regardless of proxy quality. Both proxy-only baselines fall far from the diagonal because their means differ from TRUE_MEAN, so their coverage is invalid.
Coverage vs correlation for fixed confidence level¶
We now fix the confidence level and sweep a range of base proxy-true correlation levels. This shows that Multi-PPI++ validity does not degrade as the proxies become weaker.
coverage_by_corr = {} # {correlation: {method: observed mean coverage}}
coverage_ci_by_corr = {} # {correlation: {method: (lower, upper) confidence interval on coverage}}
for correlation in correlations:
hits = compute_hits(raw_stats[correlation], CONFIDENCE_LEVEL, TRUE_MEAN)
coverage_by_corr[correlation] = {}
coverage_ci_by_corr[correlation] = {}
for method in METHODS:
mean_cov, lo, hi = coverage_with_error_bar(hits[method], CONFIDENCE_LEVEL)
coverage_by_corr[correlation][method] = mean_cov
coverage_ci_by_corr[correlation][method] = (lo, hi)
fig, ax = plt.subplots(figsize=(8, 5))
method_colors = {"True only": "steelblue", "Multi-PPI++": "darkorange"}
for method in ["True only", "Multi-PPI++"]:
obs = np.array([coverage_by_corr[correlation][method] for correlation in correlations])
ci_bounds = np.array([coverage_ci_by_corr[correlation][method] for correlation in correlations])
lo = ci_bounds[:, 0]
hi = ci_bounds[:, 1]
ax.plot(correlations, obs, marker="o", color=method_colors[method], label=method)
ax.fill_between(correlations, lo, hi, alpha=0.15, color=method_colors[method])
ax.axhline(y=CONFIDENCE_LEVEL, color="red", linestyle="--", lw=2, label=f"Target coverage {CONFIDENCE_LEVEL:.0%}")
ax.set_xlabel("Base proxy–true correlation")
ax.set_ylabel("Observed coverage")
ax.set_xlim(-0.05, 0.85)
ax.set_ylim(0.8, 1.0)
ax.yaxis.set_ticks(ax.get_yticks()[1:-1:2])
ax.legend()
plt.tight_layout()
plt.show()
Note that Proxy 1 only and Proxy 2 only are not plotted because both proxies are biased (proxy mean ≠ true mean). Therefore they both have invalid coverage (close to 0) whereas Multi-PPI++ and True only remain valid across all correlation levels.
Confidence Interval Width¶
Coverage validity is necessary but not sufficient: we also want short intervals. The width difference between True only and Multi-PPI++ is attributable solely to the proxy labels.
We compare mean confidence interval widths for Multi-PPI++ and True only across correlation levels.
width_by_corr = {}
for correlation in correlations:
width_by_corr[correlation] = {}
for method in METHODS:
lower_bound = raw_stats[correlation][method]["lower_bounds"][CONFIDENCE_LEVEL]
upper_bound = raw_stats[correlation][method]["upper_bounds"][CONFIDENCE_LEVEL]
width_by_corr[correlation][method] = upper_bound - lower_bound
fig, ax = plt.subplots(figsize=(9, 5))
plot_methods = ["True only", "Multi-PPI++"]
colors_w = {"True only": "steelblue", "Multi-PPI++": "darkorange"}
# Compute percentiles based on CONFIDENCE_LEVEL
lower_percentile = round(((1 - CONFIDENCE_LEVEL) / 2) * 100)
upper_percentile = 100 - lower_percentile
for method in plot_methods:
means_w = [np.mean(width_by_corr[correlation][method]) for correlation in correlations]
q_lower = [np.percentile(width_by_corr[correlation][method], lower_percentile) for correlation in correlations]
q_upper = [np.percentile(width_by_corr[correlation][method], upper_percentile) for correlation in correlations]
ax.plot(correlations, means_w, marker="o", label=method, color=colors_w[method])
ax.fill_between(correlations, q_lower, q_upper, alpha=0.15, color=colors_w[method])
ax.set_xlabel("Base proxy–true correlation")
ax.set_ylabel("Confidence interval width")
ax.set_xlim(-0.05, 0.85)
ax.yaxis.set_ticks(ax.get_yticks()[1:-1:2])
ax.legend()
plt.tight_layout()
plt.show()
As expected, Multi-PPI++'s interval width decreases with increasing correlation. The optimal lambda tuning allows the estimator to upweight informative proxies and downweight weaker ones, automatically exploiting the best available proxy signal.
Effective Sample Size¶
A natural summary of Multi-PPI's efficiency gain is the effective sample size (ESS): the number of true labels that would be needed to match Multi-PPI's mean confidence interval width.
We report Multi-PPI's effective sample size across correlation levels, translating the width reduction into an equivalent number of true labels. See the Scientific Validation Methodology page for the formal definition and formula of ESS.
ess_mean = [np.mean(raw_stats[correlation]["Multi-PPI++"]["effective_sample_sizes"]) for correlation in correlations]
ess_q_lower = [
np.percentile(raw_stats[correlation]["Multi-PPI++"]["effective_sample_sizes"], lower_percentile)
for correlation in correlations
]
ess_q_upper = [
np.percentile(raw_stats[correlation]["Multi-PPI++"]["effective_sample_sizes"], upper_percentile)
for correlation in correlations
]
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(correlations, ess_mean, marker="o", color="darkorange", label="Multi-PPI++ ESS (mean)")
ax.fill_between(
correlations,
ess_q_lower,
ess_q_upper,
alpha=0.15,
color="darkorange",
label=f"{lower_percentile:.0f}th–{upper_percentile:.0f}th percentile",
)
ax.axhline(y=BUDGET, color="steelblue", linestyle="--", lw=2, label=f"Baseline (True only, n={BUDGET})")
ax.set_xlabel("Base proxy–true correlation")
ax.set_ylabel("Effective sample size")
ax.set_xlim(-0.05, 0.85)
ax.legend()
plt.tight_layout()
plt.show()
Summary¶
This notebook has empirically validated that GLIDE's Multi-PPI++ implementation satisfies two key statistical properties:
| Property | Result |
|---|---|
| Coverage validity | Multi-PPI++ achieves the nominal coverage across all correlation levels and confidence levels tested |
| Efficiency | Multi-PPI++ produces shorter confidence intervals than labeled-only whenever at least one proxy has positive correlation, with the gain growing with correlation |
Crucially, both biased baselines (Proxy 1 only and Proxy 2 only) fail the coverage test. They appear precise but are systematically wrong because their means differ from TRUE_MEAN. Multi-PPI++ avoids this by correcting for proxy bias using the labeled subset.
The ESS analysis shows that with moderate proxy correlation, Multi-PPI++ is equivalent to having twice more labeled data, a substantial practical gain in scenarios where true annotation is expensive. By jointly optimising the combination weights across both proxies, the estimator can fully exploit the more informative proxy while gracefully degrading toward the labeled-only estimate when both proxies are weak.