Scientific Validity of ASI for Mean Estimation¶
This notebook provides empirical evidence that GLIDE's Active Statistical Inference (ASI) implementation is statistically valid.
Setup: We estimate the mean of a binary outcome (e.g., the hallucination rate of an AI system). We have:
- A pool of
N_TOTALsamples, each with a proxy label $\tilde{Y}$ (y_proxy) and an oracle proxy uncertainty that quantifies how unreliable the proxy is for each individual sample - A labeling budget of
N_LABELEDsamples: we can reveal the true label $Y$ (y_true_oracle) for only a fraction of the pool
ASI selects which samples to label using sampling probabilities ($\pi_i \propto \text{uncertainty}_i$): samples where the proxy is least reliable are labeled with higher probability. It then corrects for this non-uniform selection via Inverse Probability Weighting (IPW), yielding confidence intervals that are:
- Valid : they cover the true mean at the specified rate regardless of the sampling rule
- Shorter : active sampling concentrates the labeling budget on uncertain samples, producing shorter intervals when the proxy is sufficiently informative
We test these two claims empirically across a range of proxy/true correlation levels.
from functools import partial
import matplotlib.pyplot as plt
import numpy as np
from glide.estimators import ASIMeanEstimator, ClassicalMeanEstimator, IPWClassicalMeanEstimator
from glide.samplers.active import ActiveSampler
from glide.scientific_validation import compute_hits, coverage_with_error_bar, run_monte_carlo
from glide.simulators import generate_binary_dataset_with_oracle_sampling, simulate_annotation
plt.rcParams.update(
{
"font.size": 18,
"axes.labelsize": 18,
"axes.titlesize": 18,
"legend.fontsize": 16,
"xtick.labelsize": 16,
"ytick.labelsize": 16,
"figure.titlesize": 19,
}
)
Experiment Parameters¶
We fix all parameters up front so every section of this notebook uses a consistent setup. We define :
CONFIDENCE_LEVEL: the confidence level at which we will compute confidence intervals.N_TOTAL: the total number of samples in the pool. Every sample has a proxy prediction and an oracle proxy uncertainty.N_LABELED: the number of labeled samples in the pool.TRUE_MEAN: the true mean value of human labels.PROXY_MEAN: the (biased) proxy mean value.N_SEEDS: the number of simulations we will run in our Monte Carlo experiments.
Note on correlation bounds: Depending on the values of
TRUE_MEANandPROXY_MEAN, extreme correlation values (close to -1 or 1) may not be achievable. Correlation sweeps are kept within these limits.
Finally, we define the baseline estimation methods for comparison:
True only: usesN_LABELEDactively sampled true labels with an IPW-corrected classical CLT confidence interval. No proxy labels are used.Proxy only: uses proxy labels only, without correction.ASI: Active Statistical Inference, the same actively sampled true labels asTrue only, further combined with IPW-rectified proxy labels for additional efficiency.
CONFIDENCE_LEVEL = 0.9
N_TOTAL = 4400 # total pool size (all samples have oracle uncertainty)
N_LABELED = 200 # labeling budget
TRUE_MEAN = 0.55
PROXY_MEAN = 0.5
N_SEEDS = 1000
METHODS = ["True only", "Proxy only", "ASI"]
correlations = np.arange(0.1, 0.95, 0.1)
n_correlations = len(correlations)
correlations_lmh = [
correlations[n_correlations // 4],
correlations[n_correlations // 2],
correlations[3 * n_correlations // 4],
] # low, medium and high values
corr_labels = ["Low", "Medium", "High"]
Data Simulation¶
We use generate_binary_dataset_with_oracle_sampling to simulate a realistic evaluation scenario.
It returns three parallel arrays of length N_TOTAL, one value per sample:
y_true_oracle($Y$) : ground-truth binary label (latent, revealed only for labeled samples)y_proxy($\tilde{Y}$) : proxy binary prediction (always available for every sample)uncertainty: oracle proxy uncertainty $\sqrt{\mathbb{E}[(\tilde{Y}_i - Y_i)^2 \mid x_i]}$, quantifies per-sample proxy reliability
Samples with high uncertainty are those where the proxy is least reliable. The True only and ASI methods assign higher labeling probabilities $\pi_i$ to these samples via ActiveSampler by solving the optimization:
$$\mathrm{minimize} \sum_i \frac{\text{uncertainty}_i^2}{\pi_i}$$
subject to $\pi_i \in (0, 1]$ for all $i$ and $\sum_i \pi_i = N_{\text{labeled}}$. This concentrates the labeling budget on samples where true labels add the most information.
The build_dataset helper below applies ActiveSampler to the uncertainty to compute sampling probabilities $\pi_i$ and Bernoulli selection indicators $\xi_i$ for each sample. Samples with $\xi_i = 0$ have their y_true_oracle value set to np.nan (unobserved).
def build_dataset(y_true_oracle, y_proxy, uncertainty, seed):
# Active sampling via ActiveSampler — shared by True only and ASI
pi, xi = ActiveSampler().sample(uncertainty, budget=N_LABELED, random_seed=seed)
y_true = simulate_annotation(y_true_oracle, xi)
return y_true, y_proxy, pi
We now use the previous function to simulate a single example dataset for illustration with correlation = 0.5
y_true_oracle, y_proxy, uncertainty = generate_binary_dataset_with_oracle_sampling(
n_total=N_TOTAL,
true_mean=TRUE_MEAN,
proxy_mean=PROXY_MEAN,
correlation=0.5,
random_seed=0,
)
y_true, y_proxy, pi = build_dataset(y_true_oracle, y_proxy, uncertainty, seed=0)
n_labeled = int(np.sum(~np.isnan(y_true)))
Now print some statistics about the labeling budget and sampling probabilities
print(f"Total samples : {N_TOTAL}")
print(f"Labeling budget : {N_LABELED}")
print(f"Labeled (realized, Bernoulli) : {n_labeled}")
print(f"\nSampling probability p, — min: {pi.min():.3f}, max: {pi.max():.3f}, mean: {pi.mean():.3f}")
Total samples : 4400 Labeling budget : 200 Labeled (realized, Bernoulli) : 200 Sampling probability p, — min: 0.000, max: 0.061, mean: 0.045
Let's look at how the active sampling probability $\pi_i$ is distributed across samples in this example. Since both True only and ASI share this sampling rule, this distribution applies to both.
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(pi[pi > 0], bins=30, color="darkorange", alpha=0.75, label="Active sampling probability $\\pi_i$")
ax.set_xlabel("Sampling probability $\\pi_i$")
ax.xaxis.set_ticks(ax.get_xticks()[1:-1:2])
ax.set_ylabel("Count")
ax.legend()
plt.tight_layout()
plt.show()
print(f"Cut-off samples : {np.sum(pi == 0)}/{len(pi)}")
Cut-off samples : 5/4400
The histogram shows that $\pi_i$ values are spread around the mean labeling rate. Samples where the proxy is unreliable (high uncertainty) receive a higher sampling probability, while samples where the proxy is already reliable receive a lower one.
Note that independent random sampling can result in more sampled annotations than the given budget. To prevent this, the ActiveSampler uses a cut-off mechanism which sets some samples' probabilities to zero once the budget is reached. These are excluded from the above histogram.
In the following sections, we will perform Monte Carlo experiments to estimate confidence interval width among other things.
This consists in running N_SEEDS simulations where we simulate data, compute a confidence interval and measure its width each time. We end up with N_SEEDS sample values for the measured quantity that we can use to compute statistics.
The same method can be used to evaluate coverage which will be defined and illustrated below.
Inference Results¶
All three methods receive the same labeled samples (drawn with the same active sampling rule). Their differences are summarised below:
| Estimation method | Data used | Notes |
|---|---|---|
| True only | y_true (active sampling, IPW-corrected) |
No proxy labels |
| Proxy only | y_proxy |
Biased, cheap but wrong |
| ASI | y_true (active sampling, IPW-rectified) + y_proxy |
Same labels as True only, plus proxy rectification |
The function below simulates a dataset for a given seed and correlation level, then runs all three estimation methods on it.
def simulate_estimates(seed, correlation):
y_true_oracle, y_proxy, uncertainty = generate_binary_dataset_with_oracle_sampling(
n_total=N_TOTAL,
true_mean=TRUE_MEAN,
proxy_mean=PROXY_MEAN,
correlation=correlation,
random_seed=seed,
)
y_true, y_proxy, pi = build_dataset(y_true_oracle, y_proxy, uncertainty, seed)
estimator = ASIMeanEstimator()
asi_result = estimator.estimate(y_true, y_proxy, pi, confidence_level=CONFIDENCE_LEVEL)
# --- True only (active sampling, IPW-corrected, no proxy) ---
true_only_result = IPWClassicalMeanEstimator().estimate(y_true, pi, confidence_level=CONFIDENCE_LEVEL)
# --- Proxy only (no sampling correction, biased) ---
classical_estimator = ClassicalMeanEstimator()
proxy_only_result = classical_estimator.estimate(y_proxy, confidence_level=CONFIDENCE_LEVEL)
return {
"True only": {
"mean": true_only_result.mean,
"std": true_only_result.std,
"confidence_interval": true_only_result.confidence_interval,
},
"Proxy only": {
"mean": proxy_only_result.mean,
"std": proxy_only_result.std,
"confidence_interval": proxy_only_result.confidence_interval,
},
"ASI": {
"mean": asi_result.mean,
"std": asi_result.std,
"confidence_interval": asi_result.confidence_interval,
"effective_sample_size": asi_result.effective_sample_size,
},
}
ASI is implemented by the ASIMeanEstimator whereas IPWClassicalMeanEstimator implements IPW-corrected mean estimation and ClassicalMeanEstimator implements conventional mean estimation.
Coverage Validity¶
A confidence interval is valid if it reliably captures the true value at the nominal rate: a 90% confidence interval is valid if, across many repetitions, around 90% of the resulting intervals contain the true value.
The IPW correction is such that coverage is maintained i.e. the resulting confidence intervals are valid. The sampling probabilities are used to de-bias the oracle-selected estimates restoring validity as in uniform sampling.
We run a Monte Carlo experiment to verify this for each method. We check that the empirical coverage tracks the nominal level throughout, including under the non-uniform active sampling rule. See the Scientific Validation Methodology page for more details about the verification protocol.
Coverage vs confidence level for three correlation levels¶
We sweep the confidence level from 0.55 to 0.95 and plot the observed coverage. For a valid estimation method, the dots should fall on or around the black diagonal $y = \text{confidence level}$.
We do this for low, medium and high proxy correlation.
# Run Monte Carlo simulations for each correlation level
confidence_levels = np.arange(0.55, 1.00, 0.05)
confidence_levels = np.round(confidence_levels, 2)
raw_stats = {
corr: run_monte_carlo(confidence_levels, partial(simulate_estimates, correlation=corr)) for corr in correlations
}
# Derive coverage for every (correlation, confidence_level) pair
coverages_confidence_intervals = {}
for correlation in correlations_lmh:
coverages_confidence_intervals[correlation] = {}
for confidence_level in confidence_levels:
hits = compute_hits(raw_stats[correlation], confidence_level, TRUE_MEAN)
coverages_confidence_intervals[correlation][confidence_level] = {}
for method in METHODS:
coverage_confidence_interval = coverage_with_error_bar(hits[method], confidence_level=CONFIDENCE_LEVEL)
coverages_confidence_intervals[correlation][confidence_level][method] = coverage_confidence_interval
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
colors = {"True only": "steelblue", "ASI": "darkorange", "Proxy only": "red"}
for ax, correlation, label in zip(axes, correlations_lmh, corr_labels):
ax.plot(confidence_levels, confidence_levels, color="black", lw=1.5, linestyle="--", label="Ideal")
for method in METHODS:
mean_ci = np.array([coverages_confidence_intervals[correlation][cl][method] for cl in confidence_levels])
mean = mean_ci[:, 0]
lo = mean_ci[:, 1]
hi = mean_ci[:, 2]
ax.plot(confidence_levels, mean, marker="o", color=colors[method], label=method)
ax.fill_between(confidence_levels, lo, hi, alpha=0.15, color=colors[method])
ax.set_xlabel("Target confidence level")
ax.set_ylabel("Observed coverage")
ax.legend()
ax.set_xlim(0.5, 1.0)
ax.set_ylim(0.5, 1.0)
plt.tight_layout()
plt.show()
Both ASI and True only track the diagonal closely across all correlation levels, confirming that ASI achieves valid coverage regardless of proxy quality. The Proxy only method does not show up because it uses biased data so that its coverage is invalid (close to zero).
Since both ASI and True only use the same active sampling rule, this comparison directly isolates the effect of incorporating proxy labels: ASI's validity is preserved even after adding the proxy rectification step.
Coverage vs correlation for fixed confidence level¶
We now fix the confidence level and sweep a range of proxy-true correlation levels. This shows that ASI's validity does not degrade as the proxy becomes weaker.
coverage_by_corr = {} # {correlation: {method: observed mean coverage}}
coverage_ci_by_corr = {} # {correlation: {method: (lower, upper) Confidence Interval on coverage}}
for correlation in correlations:
hits = compute_hits(raw_stats[correlation], CONFIDENCE_LEVEL, TRUE_MEAN)
coverage_by_corr[correlation] = {}
coverage_ci_by_corr[correlation] = {}
for method in METHODS:
mean_cov, lo, hi = coverage_with_error_bar(hits[method], CONFIDENCE_LEVEL)
coverage_by_corr[correlation][method] = mean_cov
coverage_ci_by_corr[correlation][method] = (lo, hi)
fig, ax = plt.subplots(figsize=(8, 5))
method_colors = {"True only": "steelblue", "Proxy only": "red", "ASI": "darkorange"}
for method in ["True only", "ASI"]:
obs = np.array([coverage_by_corr[correlation][method] for correlation in correlations])
ci_bounds = np.array([coverage_ci_by_corr[correlation][method] for correlation in correlations])
lo = ci_bounds[:, 0]
hi = ci_bounds[:, 1]
ax.plot(correlations, obs, marker="o", color=method_colors[method], label=method)
ax.fill_between(correlations, lo, hi, alpha=0.15, color=method_colors[method])
ax.axhline(y=CONFIDENCE_LEVEL, color="red", linestyle="--", lw=2, label=f"Target coverage {CONFIDENCE_LEVEL:.0%}")
ax.set_xlabel("Proxy–true correlation")
ax.set_ylabel("Observed coverage")
ax.set_xlim(0, 1)
ax.set_ylim(0.8, 1.0)
ax.yaxis.set_ticks(ax.get_yticks()[1:-1:2])
ax.legend()
plt.tight_layout()
plt.show()
Note that Proxy only is not plotted because the proxy is biased (proxy mean ≠ true mean). Therefore it has invalid coverage (close to 0) whereas ASI and True only remain valid across all correlation levels.
Confidence Interval Width¶
Coverage validity is necessary but not sufficient: we also want short intervals. Both True only and ASI use the same active sampling, so any width difference between them is attributable solely to the proxy labels. ASI uses the proxy signal to rectify the estimate, extracting additional information beyond what the true labels alone provide.
We compare mean confidence interval widths for ASI and True only across correlation levels.
width_by_corr = {}
for correlation in correlations:
width_by_corr[correlation] = {}
for method in METHODS:
lower_bound = raw_stats[correlation][method]["lower_bounds"][CONFIDENCE_LEVEL]
upper_bound = raw_stats[correlation][method]["upper_bounds"][CONFIDENCE_LEVEL]
width_by_corr[correlation][method] = upper_bound - lower_bound
fig, ax = plt.subplots(figsize=(9, 5))
plot_methods = ["True only", "ASI"]
colors_w = {"True only": "steelblue", "ASI": "darkorange"}
# Compute percentiles based on CONFIDENCE_LEVEL
lower_percentile = round(((1 - CONFIDENCE_LEVEL) / 2) * 100)
upper_percentile = 100 - lower_percentile
for method in plot_methods:
means_w = [np.mean(width_by_corr[correlation][method]) for correlation in correlations]
q_lower = [np.percentile(width_by_corr[correlation][method], lower_percentile) for correlation in correlations]
q_upper = [np.percentile(width_by_corr[correlation][method], upper_percentile) for correlation in correlations]
ax.plot(correlations, means_w, marker="o", label=method, color=colors_w[method])
ax.fill_between(correlations, q_lower, q_upper, alpha=0.15, color=colors_w[method])
ax.set_xlabel("Proxy–true correlation")
ax.set_ylabel("Confidence Interval width")
ax.set_xlim(0.05, 0.95)
ax.legend()
plt.tight_layout()
plt.show()
As expected, ASI's interval width decreases with increasing correlation. Since both methods share the same active sample, the width reduction is entirely due to the proxy rectification step. Note that benefits can be seen mainly for reasonable correlation values (> 0.4).
Effective Sample Size¶
A natural summary of ASI's efficiency gain is the effective sample size (ESS): the number of true labels that would be needed to match ASI's mean confidence interval width.
We report ASI's effective sample size across correlation levels, translating the width reduction into an equivalent number of true labels. See the Scientific Validation Methodology page for the formal definition and formula of ESS.
ess_mean = [np.mean(raw_stats[correlation]["ASI"]["effective_sample_sizes"]) for correlation in correlations]
ess_q_lower = [
np.percentile(raw_stats[correlation]["ASI"]["effective_sample_sizes"], lower_percentile)
for correlation in correlations
]
ess_q_upper = [
np.percentile(raw_stats[correlation]["ASI"]["effective_sample_sizes"], upper_percentile)
for correlation in correlations
]
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(correlations, ess_mean, marker="o", color="darkorange", label="ASI ESS (mean)")
ax.fill_between(
correlations,
ess_q_lower,
ess_q_upper,
alpha=0.15,
color="darkorange",
label=f"{lower_percentile}th–{upper_percentile}th percentile",
)
ax.axhline(y=N_LABELED, color="steelblue", linestyle="--", lw=2, label=f"Baseline (True only, n={N_LABELED})")
ax.set_xlabel("Proxy–true correlation")
ax.set_ylabel("Effective sample size")
ax.set_xlim(0.05, 0.95)
ax.legend()
plt.tight_layout()
plt.show()
Summary¶
This notebook has empirically validated that GLIDE's ASI implementation satisfies two key statistical properties:
| Property | Result |
|---|---|
| Coverage validity | ASI achieves the nominal coverage across all correlation levels and confidence levels tested |
| Efficiency | ASI produces shorter confidence intervals than True only for sufficient correlation levels, with the gain growing with correlation |
Because both ASI and True only share the same active sampling rule, every observed difference (in interval width or effective sample size) is attributable exclusively to the proxy labels. This clean comparison confirms that the proxy rectification step in ASI adds genuine statistical efficiency without sacrificing validity.
Crucially, the biased baseline (Proxy only) fails the coverage test. It appears precise but is systematically wrong. ASI avoids this by correcting for proxy bias via IPW using the labeled subset.
The ESS analysis shows that with a proxy correlation of $0.9,$ ASI is equivalent to having more than twice more labeled data, a practical gain in scenarios where true annotation is expensive. This highlights the importance of a good LLM judge to evaluate an AI system.