Scientific Validity of IPW PTD for Mean Estimation¶
This notebook provides empirical evidence that GLIDE's Inverse Probability Weighted Predict-Then-Debias (IPW PTD) implementation is statistically valid.
Setup: We estimate the mean of a binary outcome (e.g., the hallucination rate of an AI system). We have:
- A pool of
N_TOTALsamples, each with a proxy label $\tilde{Y}$ (y_proxy) and an oracle proxy uncertainty that quantifies how unreliable the proxy is for each individual sample - A labeling budget of
N_LABELEDsamples: we can reveal the true label $Y$ (y_true_oracle) for only a fraction of the pool
Samples are selected for labeling based on sampling probabilities ($\pi_i \propto \text{uncertainty}_i$): samples where the proxy is least reliable are labeled with higher probability. IPW PTD then corrects for this non-uniform selection via Inverse Probability Weighting (IPW), yielding confidence intervals that are:
- Valid : they cover the true mean at the specified rate regardless of the sampling rule
- Shorter : active sampling concentrates the labeling budget on uncertain samples, producing shorter intervals when the proxy is sufficiently informative
We test these two claims empirically across a range of proxy/true correlation levels.
from functools import partial
import matplotlib.pyplot as plt
import numpy as np
from glide.estimators import ClassicalMeanEstimator, IPWClassicalMeanEstimator, IPWPTDMeanEstimator
from glide.samplers import ActiveSampler
from glide.scientific_validation import compute_hits, coverage_with_error_bar, run_monte_carlo
from glide.simulators import generate_binary_dataset_with_oracle_sampling, simulate_annotation
plt.rcParams.update(
{
"font.size": 18,
"axes.labelsize": 18,
"axes.titlesize": 18,
"legend.fontsize": 16,
"xtick.labelsize": 16,
"ytick.labelsize": 16,
"figure.titlesize": 19,
}
)
Experiment Parameters¶
We fix all parameters up front so every section of this notebook uses a consistent setup. We define :
CONFIDENCE_LEVEL: the confidence level at which we will compute confidence intervals.N_TOTAL: the total number of samples in the pool. Every sample has a proxy prediction and an oracle proxy uncertainty.N_LABELED: the number of labeled samples in the pool.TRUE_MEAN: the true mean value of human labels.PROXY_MEAN: the (biased) proxy mean value.N_SEEDS: the number of simulations we will run in our Monte Carlo experiments.
Note on correlation bounds: Depending on the values of
TRUE_MEANandPROXY_MEAN, extreme correlation values (close to -1 or 1) may not be achievable. Correlation sweeps are kept within these limits.
Finally, we define the baseline estimation methods for comparison:
True only: usesN_LABELEDactively sampled true labels with an IPW-corrected classical CLT confidence interval. No proxy labels are used.Proxy only: uses proxy labels only, without correction.IPW PTD: Inverse Probability Weighted Predict-Then-Debias, the same actively sampled true labels asTrue only, further combined with IPW-rectified proxy labels for additional efficiency.
CONFIDENCE_LEVEL = 0.9
N_TOTAL = 6500 # total pool size (all samples have oracle uncertainty)
N_LABELED = 200 # labeling budget
TRUE_MEAN = 0.55
PROXY_MEAN = 0.5
N_SEEDS = 1000
METHODS = ["True only", "Proxy only", "IPW PTD"]
correlations = np.arange(0.1, 0.95, 0.1)
n_correlations = len(correlations)
correlations_lmh = [
correlations[n_correlations // 4],
correlations[n_correlations // 2],
correlations[3 * n_correlations // 4],
] # low, medium and high values
corr_labels = ["Low", "Medium", "High"]
Data Simulation¶
We use generate_binary_dataset_with_oracle_sampling to simulate a realistic evaluation scenario.
It returns three parallel arrays of length N_TOTAL, one value per sample:
y_true_oracle($Y$) : ground-truth binary label (latent, revealed only for labeled samples)y_proxy($\tilde{Y}$) : proxy binary prediction (always available for every sample)uncertainty: oracle proxy uncertainty $\sqrt{\mathbb{E}[(\tilde{Y}_i - Y_i)^2 \mid x_i]}$, quantifies per-sample proxy reliability
Samples with high uncertainty are those where the proxy is least reliable. The True only and IPW PTD methods assign higher labeling probabilities $\pi_i$ to these samples via ActiveSampler by solving the optimization:
$$\mathrm{minimize} \sum_i \frac{\text{uncertainty}_i^2}{\pi_i}$$
subject to $\pi_i \in (0, 1]$ for all $i$ and $\sum_i \pi_i = N_{\text{labeled}}$. This concentrates the labeling budget on samples where true labels add the most information.
The build_dataset helper below applies ActiveSampler to the uncertainty to compute sampling probabilities $\pi_i$ and Bernoulli selection indicators $\xi_i$ for each sample. Samples with $\xi_i = 0$ have their y_true_oracle value set to np.nan (unobserved).
def build_dataset(y_true_oracle, y_proxy, uncertainty, seed):
# Active sampling via ActiveSampler — shared by True only and IPW PTD
pi, xi = ActiveSampler().sample(uncertainty, budget=N_LABELED, random_seed=seed)
y_true = simulate_annotation(y_true_oracle, xi)
return y_true, y_proxy, pi
We now use the previous function to simulate a single example dataset for illustration with correlation = 0.5
y_true_oracle, y_proxy, uncertainty = generate_binary_dataset_with_oracle_sampling(
n_total=N_TOTAL,
true_mean=TRUE_MEAN,
proxy_mean=PROXY_MEAN,
correlation=0.5,
random_seed=0,
)
y_true, y_proxy, pi = build_dataset(y_true_oracle, y_proxy, uncertainty, seed=0)
n_labeled = int(np.sum(~np.isnan(y_true)))
Now print some statistics about the labeling budget and sampling probabilities
print(f"Total samples : {N_TOTAL}")
print(f"Labeling budget : {N_LABELED}")
print(f"Labeled (realized, Bernoulli) : {n_labeled}")
print(f"\nSampling probability p, — min: {pi.min():.3f}, max: {pi.max():.3f}, mean: {pi.mean():.3f}")
Total samples : 6500 Labeling budget : 200 Labeled (realized, Bernoulli) : 200 Sampling probability p, — min: 0.000, max: 0.041, mean: 0.029
Let's look at how the active sampling probability $\pi_i$ is distributed across samples in this example. Since both True only and IPW PTD share this sampling rule, this distribution applies to both.
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(pi[pi > 0], bins=30, color="darkorange", alpha=0.75, label="Active sampling probability $\\pi_i$")
ax.set_xlabel("Sampling probability $\\pi_i$")
ax.set_ylabel("Count")
ax.legend()
ax.xaxis.set_ticks(ax.get_xticks()[1:-1:2])
plt.tight_layout()
plt.show()
print(f"Cut-off samples : {np.sum(pi == 0)}/{len(pi)}")
Cut-off samples : 433/6500
The histogram shows that $\pi_i$ values are spread around the mean labeling rate. Samples where the proxy is unreliable (high uncertainty) receive a higher sampling probability, while samples where the proxy is already reliable receive a lower one.
Note that independent random sampling can result in more sampled annotations than the given budget. To prevent this, the ActiveSampler uses a cut-off mechanism which sets some samples' probabilities to zero once the budget is reached. These are excluded from the above histogram.
Inference Results¶
All three methods receive the same labeled samples (drawn with the same active sampling rule). Their differences are summarised below:
| Estimation method | Data used | Notes |
|---|---|---|
| True only | y_true (active sampling, IPW-corrected) |
No proxy labels |
| Proxy only | y_proxy |
Biased, cheap but wrong |
| IPW PTD | y_true (active sampling, IPW-rectified) + y_proxy |
Same labels as True only, plus proxy rectification |
The function below simulates a dataset for a given seed and correlation level, then runs all three estimation methods on it.
def simulate_estimates(seed, correlation):
y_true_oracle, y_proxy, uncertainty = generate_binary_dataset_with_oracle_sampling(
n_total=N_TOTAL,
true_mean=TRUE_MEAN,
proxy_mean=PROXY_MEAN,
correlation=correlation,
random_seed=seed,
)
y_true, y_proxy, pi = build_dataset(y_true_oracle, y_proxy, uncertainty, seed)
# --- IPW PTD (active sampling, IPW-corrected proxy rectification) ---
estimator = IPWPTDMeanEstimator()
ipwptd_result = estimator.estimate(
y_true,
y_proxy,
pi,
confidence_level=CONFIDENCE_LEVEL,
n_bootstrap=500,
)
# --- True only (active sampling, IPW-corrected, no proxy) ---
true_only_result = IPWClassicalMeanEstimator().estimate(y_true, pi, confidence_level=CONFIDENCE_LEVEL)
# --- Proxy only (no sampling correction, biased) ---
classical_estimator = ClassicalMeanEstimator()
proxy_only_result = classical_estimator.estimate(y_proxy, confidence_level=CONFIDENCE_LEVEL)
return {
"True only": {
"mean": true_only_result.mean,
"std": true_only_result.std,
"confidence_interval": true_only_result.confidence_interval,
},
"Proxy only": {
"mean": proxy_only_result.mean,
"std": proxy_only_result.std,
"confidence_interval": proxy_only_result.confidence_interval,
},
"IPW PTD": {
"mean": ipwptd_result.mean,
"std": ipwptd_result.std,
"confidence_interval": ipwptd_result.confidence_interval,
"effective_sample_size": ipwptd_result.effective_sample_size,
},
}
IPW PTD is implemented by the IPWPTDMeanEstimator whereas IPWClassicalMeanEstimator implements IPW-corrected mean estimation and ClassicalMeanEstimator implements conventional mean estimation.
Coverage Validity¶
A confidence interval is valid if it reliably captures the true value at the nominal rate: a 90% confidence interval is valid if, across many repetitions, around 90% of the resulting intervals contain the true value.
The IPW correction is such that coverage is maintained i.e. the resulting confidence intervals are valid. The sampling probabilities are used to de-bias the oracle-selected estimates restoring validity as in uniform sampling.
We run a Monte Carlo experiment to verify this for each method. We check that the empirical coverage tracks the nominal level throughout, including under the non-uniform active sampling rule. See the Scientific Validation Methodology page for more details about the verification protocol.
Coverage vs confidence level for three correlation levels¶
We sweep the confidence level from 0.55 to 0.95 and plot the observed coverage. For a valid estimation method, the dots should fall on or around the black diagonal $y = \text{confidence level}$.
We do this for low, medium and high proxy correlation.
# Run Monte Carlo simulations for each correlation level
confidence_levels = np.arange(0.55, 1.00, 0.05)
confidence_levels = np.round(confidence_levels, 2)
raw_stats = {
corr: run_monte_carlo(confidence_levels, partial(simulate_estimates, correlation=corr)) for corr in correlations
}
# Derive coverage for every (correlation, confidence_level) pair
coverages_confidence_intervals = {}
for correlation in correlations_lmh:
coverages_confidence_intervals[correlation] = {}
for confidence_level in confidence_levels:
hits = compute_hits(raw_stats[correlation], confidence_level, TRUE_MEAN)
coverages_confidence_intervals[correlation][confidence_level] = dict()
for method in METHODS:
coverage_confidence_interval = coverage_with_error_bar(hits[method], confidence_level=CONFIDENCE_LEVEL)
coverages_confidence_intervals[correlation][confidence_level][method] = coverage_confidence_interval
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
colors = {"True only": "steelblue", "IPW PTD": "darkorange", "Proxy only": "red"}
for ax, correlation, label in zip(axes, correlations_lmh, corr_labels):
ax.plot(confidence_levels, confidence_levels, color="black", lw=1.5, linestyle="--", label="Ideal")
for method in METHODS:
mean_ci = np.array([coverages_confidence_intervals[correlation][cl][method] for cl in confidence_levels])
mean = mean_ci[:, 0]
lo = mean_ci[:, 1]
hi = mean_ci[:, 2]
ax.plot(confidence_levels, mean, marker="o", color=colors[method], label=method)
ax.fill_between(confidence_levels, lo, hi, alpha=0.15, color=colors[method])
ax.set_xlabel("Target confidence level")
ax.set_ylabel("Observed coverage")
ax.set_title(f"{label} correlation (${round(correlation, 2)}$)")
ax.legend(loc="lower right")
ax.set_xlim(0.5, 1.0)
ax.set_ylim(0.5, 1.0)
plt.tight_layout()
plt.show()
Both IPW PTD and True only track the diagonal closely across all correlation levels, confirming that IPW PTD achieves valid coverage regardless of proxy quality. The Proxy only method does not show up because it uses biased data so that its coverage is invalid (close to zero).
Since both IPW PTD and True only use the same active sampling rule, this comparison directly isolates the effect of incorporating proxy labels: IPW PTD's validity is preserved even after adding the proxy rectification step.
Coverage vs correlation for fixed confidence level¶
We now fix the confidence level and sweep a range of proxy-true correlation levels. This shows that IPW PTD's validity does not degrade as the proxy becomes weaker.
coverage_by_corr = {} # {correlation: {method: observed mean coverage}}
coverage_ci_by_corr = {} # {correlation: {method: (lower, upper) Confidence Interval on coverage}}
for correlation in correlations:
hits = compute_hits(raw_stats[correlation], CONFIDENCE_LEVEL, TRUE_MEAN)
coverage_by_corr[correlation] = {}
coverage_ci_by_corr[correlation] = {}
for method in METHODS:
mean_cov, lo, hi = coverage_with_error_bar(hits[method], CONFIDENCE_LEVEL)
coverage_by_corr[correlation][method] = mean_cov
coverage_ci_by_corr[correlation][method] = (lo, hi)
fig, ax = plt.subplots(figsize=(8, 5))
method_colors = {"True only": "steelblue", "Proxy only": "red", "IPW PTD": "darkorange"}
for method in ["True only", "IPW PTD"]:
obs = np.array([coverage_by_corr[correlation][method] for correlation in correlations])
ci_bounds = np.array([coverage_ci_by_corr[correlation][method] for correlation in correlations])
lo = ci_bounds[:, 0]
hi = ci_bounds[:, 1]
ax.plot(correlations, obs, marker="o", color=method_colors[method], label=method)
ax.fill_between(correlations, lo, hi, alpha=0.15, color=method_colors[method])
ax.axhline(y=CONFIDENCE_LEVEL, color="red", linestyle="--", lw=2, label=f"Target coverage {CONFIDENCE_LEVEL:.0%}")
ax.set_xlabel("Proxy–true correlation")
ax.set_ylabel("Observed coverage")
ax.set_xlim(0, 1)
ax.set_ylim(0.8, 1.0)
ax.yaxis.set_ticks(ax.get_yticks()[1:-1:2])
ax.legend()
plt.tight_layout()
plt.show()
Note that Proxy only is not plotted because the proxy is biased (proxy mean ≠ true mean). Therefore it has invalid coverage (close to 0) whereas IPW PTD and True only remain valid across all correlation levels.
Confidence Interval Width¶
Coverage validity is necessary but not sufficient: we also want short intervals. Both True only and IPW PTD use the same active sampling, so any width difference between them is attributable solely to the proxy labels. IPW PTD uses the proxy signal to rectify the estimate, extracting additional information beyond what the true labels alone provide.
We compare mean confidence interval widths for IPW PTD and True only across correlation levels.
width_by_corr = {}
for correlation in correlations:
width_by_corr[correlation] = {}
for method in METHODS:
lower_bound = raw_stats[correlation][method]["lower_bounds"][CONFIDENCE_LEVEL]
upper_bound = raw_stats[correlation][method]["upper_bounds"][CONFIDENCE_LEVEL]
width_by_corr[correlation][method] = upper_bound - lower_bound
fig, ax = plt.subplots(figsize=(9, 5))
plot_methods = ["True only", "IPW PTD"]
colors_w = {"True only": "steelblue", "IPW PTD": "darkorange"}
# Compute percentiles based on CONFIDENCE_LEVEL
lower_percentile = round(((1 - CONFIDENCE_LEVEL) / 2) * 100)
upper_percentile = 100 - lower_percentile
for method in plot_methods:
means_w = [np.mean(width_by_corr[correlation][method]) for correlation in correlations]
q_lower = [np.percentile(width_by_corr[correlation][method], lower_percentile) for correlation in correlations]
q_upper = [np.percentile(width_by_corr[correlation][method], upper_percentile) for correlation in correlations]
ax.plot(correlations, means_w, marker="o", label=method, color=colors_w[method])
ax.fill_between(correlations, q_lower, q_upper, alpha=0.15, color=colors_w[method])
ax.set_xlabel("Proxy–true correlation")
ax.set_ylabel("Confidence Interval width")
ax.set_xlim(0.05, 0.95)
ax.legend()
plt.tight_layout()
plt.show()
As expected, IPW PTD's interval width decreases with increasing correlation. Since both methods share the same active sample, the width reduction is entirely due to the proxy rectification step. Note that benefits can be seen mainly for reasonable correlation values (> 0.4).
Effective Sample Size¶
A natural summary of IPW PTD's efficiency gain is the effective sample size (ESS): the number of true labels that would be needed to match IPW PTD's mean confidence interval width.
We report IPW PTD's effective sample size across correlation levels, translating the width reduction into an equivalent number of true labels. See the Scientific Validation Methodology page for the formal definition and formula of ESS.
ess_mean = [np.mean(raw_stats[correlation]["IPW PTD"]["effective_sample_sizes"]) for correlation in correlations]
ess_q_lower = [
np.percentile(raw_stats[correlation]["IPW PTD"]["effective_sample_sizes"], lower_percentile)
for correlation in correlations
]
ess_q_upper = [
np.percentile(raw_stats[correlation]["IPW PTD"]["effective_sample_sizes"], upper_percentile)
for correlation in correlations
]
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(correlations, ess_mean, marker="o", color="darkorange", label="IPW PTD ESS (mean)")
ax.fill_between(
correlations,
ess_q_lower,
ess_q_upper,
alpha=0.15,
color="darkorange",
label=f"{lower_percentile}th–{upper_percentile}th percentile",
)
ax.axhline(y=N_LABELED, color="steelblue", linestyle="--", lw=2, label=f"Baseline (True only, n={N_LABELED})")
ax.set_xlabel("Proxy–true correlation")
ax.set_ylabel("Effective sample size")
ax.set_xlim(0.05, 0.95)
ax.legend()
plt.tight_layout()
plt.show()
Summary¶
This notebook has empirically validated that GLIDE's IPW PTD implementation satisfies two key statistical properties:
| Property | Result |
|---|---|
| Coverage validity | IPW PTD achieves the nominal coverage across all correlation levels and confidence levels tested |
| Efficiency | IPW PTD produces shorter confidence intervals than True only for sufficient correlation levels, with the gain growing with correlation |
Because both IPW PTD and True only share the same active sampling rule, every observed difference (in interval width or effective sample size) is attributable exclusively to the proxy labels. This clean comparison confirms that the proxy rectification step in IPW PTD adds genuine statistical efficiency without sacrificing validity.
Crucially, the biased baseline (Proxy only) fails the coverage test. It appears precise but is systematically wrong. IPW PTD avoids this by correcting for proxy bias via IPW using the labeled subset.
The ESS analysis shows that with a proxy correlation of $0.9,$ IPW PTD is equivalent to having more than twice more labeled data, a practical gain in scenarios where true annotation is expensive. This highlights the importance of a good LLM judge to evaluate an AI system.