Cost and uncertainty scores available: Cost-Optimal Sampling¶
When evaluating an AI system, you often have access to two annotation sources: a cheap but biased LLM-as-judge and an expensive but accurate human annotator. Given a fixed budget, how should you split it between them? And should every conversation be treated the same, or should annotation effort focus on cases where the judge is least reliable?
A key observation is that cheap proxies rarely come as pure black boxes. An LLM judge can also give confidence scores on its outputs, and many annotation pipelines produce uncertainty estimates alongside each label. These signals are essentially free, and they carry direct information about where the proxy is likely to err. Rather than discarding them, a well-designed sampling strategy can use them to focus the expensive annotation budget where it is most needed.
Cost-Optimal Sampling turns this into a concrete strategy: it uses per-sample uncertainty estimates to compute annotation probabilities that concentrate the human annotation budget where the proxy is least reliable, minimising estimation variance for a given spend.
What you will learn:
- How to fit
CostOptimalSampleron a burn-in dataset of ground-truth labels - How to supply per-sample uncertainty scores to derive annotation probabilities that vary across conversations
- How to use the sampler's outputs with
ASIMeanEstimatorfor valid, bias-corrected estimates - How cost-optimal sampling compares to proxy-only and human-only strategies in balancing cost and accuracy
The evaluation challenge: two annotation sources, one budget¶
Suppose you are measuring the hallucination rate of a customer-facing AI assistant. Two annotation sources are available:
- LLM-as-judge: $0.05 per sample. Fast and scalable, but systematically under-reports hallucinations.
- Human annotator: $10.00 per sample. Ground-truth quality, but 200 times more expensive.
Your budget¶
You have 8,150 new conversations to evaluate and a total budget of $6,000, covering all annotation costs from start to finish.
The plan¶
You decide to spend the first part of your budget on a burn-in phase: annotate 150 conversations with both sources. This costs $1,507.50 ($10.00 + $0.05 per conversation) and serves two purposes:
- It reveals the judge's bias: the judge systematically under-reports hallucinations compared to human annotators.
- It provides the ground-truth labels that can be used to estimate the variance of true outcomes, which is needed to compute the optimal annotation policy.
The question¶
With $4,492.50 left, you now need to evaluate the remaining 8,000 conversations. How do you allocate the rest of your budget between proxy and human annotations?
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from glide.estimators import ASIMeanEstimator, ClassicalMeanEstimator
from glide.samplers import CostOptimalSampler, UniformSampler
from glide.simulators import generate_binary_dataset_with_oracle_sampling, simulate_annotation
C_OPTIMAL = "#27AE60" # cost-optimal (green)
C_PROXY = "#E67E22" # proxy only (orange)
C_HUMAN = "#2E86AB" # human only (blue)
C_TRUTH = "#2C3E50" # true value (dark slate)
plt.rcParams.update(
{
"figure.facecolor": "white",
"axes.facecolor": "#FAFAFA",
"axes.grid": True,
"grid.color": "#E5E5E5",
"grid.linewidth": 0.8,
"font.size": 18,
"axes.labelsize": 18,
"axes.titlesize": 18,
"legend.fontsize": 16,
"xtick.labelsize": 16,
"ytick.labelsize": 16,
"figure.titlesize": 19,
}
)
Simulating 8,150 conversations with a biased judge¶
generate_binary_dataset_with_oracle_sampling produces a synthetic dataset that matches the scenario above.
| Array | Meaning |
|---|---|
y_true_oracle |
Ground-truth binary label: $Y_i = 1$ means a hallucination confirmed by a human annotator |
y_proxy |
Binary label from the LLM judge: $\tilde{Y}_i = 1$ means the judge flagged a hallucination |
uncertainty |
Root mean squared error of the proxy relative to the ground truth label, lower values indicate more reliable entries |
The first N_BURN_IN = 150 conversations are annotated with both sources: these form the burn-in dataset used to calibrate the sampler.
Simulated annotation: In practice, labels for the 8,000 new conversations are revealed only after being queried. Here we generate all labels upfront to simulate the annotation process: a new conversation only gets its labels revealed when the sampler selects it and annotates it.
Let us first generate the data and isolate the burn-in segment from the main one.
N_BURN_IN = 150
N_MAIN = 8000
N_TOTAL = N_BURN_IN + N_MAIN
TRUE_RATE = 0.10
PROXY_RATE = 0.05
COST_PROXY = 0.05
COST_HUMAN = 10.00
BUDGET = 6000
CORRELATION = 0.65
RANDOM_SEED = 42
y_true_oracle, y_proxy, uncertainty = generate_binary_dataset_with_oracle_sampling(
n_total=N_TOTAL,
true_mean=TRUE_RATE,
proxy_mean=PROXY_RATE,
correlation=CORRELATION,
random_seed=RANDOM_SEED,
)
y_true_burn_in = y_true_oracle[:N_BURN_IN]
y_proxy_burn_in = y_proxy[:N_BURN_IN]
print(f"Total conversations : {N_TOTAL:,}")
print(f" Burn-in (labeled) : {N_BURN_IN}")
print(f" Main (new) : {N_MAIN:,}")
print()
print(f"LLM judge cost : ${COST_PROXY:.2f} per sample")
print(f"Human annotator cost : ${COST_HUMAN:.2f} per sample")
print(f"Cost ratio : {COST_HUMAN / COST_PROXY:.0f}x")
print()
print(f"Burn-in: judge rate = {y_proxy_burn_in.mean():.1%}")
print(f"Burn-in: human rate = {y_true_burn_in.mean():.1%}")
print(f"Proxy bias = {y_proxy_burn_in.mean() - y_true_burn_in.mean():+.1%}")
Total conversations : 8,150 Burn-in (labeled) : 150 Main (new) : 8,000 LLM judge cost : $0.05 per sample Human annotator cost : $10.00 per sample Cost ratio : 200x Burn-in: judge rate = 6.7% Burn-in: human rate = 15.3% Proxy bias = -8.7%
CostOptimalSampler to the rescue¶
The burn-in data confirms the proxy is biased, so we cannot trust it directly. We need prediction-powered estimation to correct for this bias. The question is: how many new human annotations should we purchase alongside the cheap proxy labels, and for which conversations?
CostOptimalSampler answers exactly this question. The workflow has two steps:
- Fit the sampler on the burn-in dataset to estimate the variance of the ground-truth labels.
- Sample from the new conversations using per-sample uncertainty scores to obtain annotation assignments.
Step 1: Fit on the burn-in dataset¶
The sampler estimates one quantity from the burn-in data:
- $\text{Var}(Y)$: variance of the ground-truth labels, capturing how spread out the true outcomes are.
This is computed by calling sampler.fit() using the burn-in samples. Note that only ground-truth labels are needed here: the per-sample uncertainty scores, which measure proxy reliability for each individual conversation, are supplied at sampling time.
sampler = CostOptimalSampler()
sampler.fit(y_true_burn_in)
<glide.samplers.cost_optimal.CostOptimalSampler at 0x7bfe90381460>
The optimal annotation policy¶
The sampler computes per-sample annotation probabilities $\pi_i$ based on three factors:
- Cost ratio: How much more expensive is human annotation than the proxy? A 200x cost gap means you can afford fewer human annotations than proxies overall.
- Per-sample proxy reliability: How reliable is the LLM judge for each individual conversation? The sampler uses per-sample uncertainty scores, estimates of the root mean squared proxy error, to assign higher annotation probabilities to conversations where the proxy is least reliable and lower probabilities where the proxy is trustworthy.
- Data variability: How diverse are the true labels? More variability in the ground truth requires more human annotations to estimate accurately.
In the next step, we call sampler.sample() with the uncertainty scores to compute the $\pi_i$ and draw Bernoulli indicators for each conversation.
Step 2: Draw annotation assignments for new conversations¶
With the burn-in variance estimate in hand, sampler.sample() runs in three stages:
- Compute per-sample annotation probabilities $\pi_i$ from the uncertainty scores, concentrating the budget on conversations where the proxy is least reliable.
- Draw Bernoulli annotation indicators. Conversations are randomly shuffled to ensure the outcome does not depend on the input order. Each conversation is then independently sent to the human annotator with probability $\pi_i$, by drawing:
$$\xi_i \sim \text{Bernoulli}(\pi_i)$$
- Apply the budget cutoff. The actual cost of conversation $i$ is $c_{\tilde{y}} + c_y \cdot \xi_i$. The sampler accumulates these costs in shuffled order until the budget is exhausted. Conversations not reached before the budget runs out are excluded and receive $\pi_i = 0$ and $\xi_i = \mathrm{NaN}$. Results are returned in the original input order.
sampler.sample() returns two values:
| Return value | Meaning |
|---|---|
pi |
Per-sample annotation probabilities $\pi_i$: positive for included conversations, $0$ for excluded ones |
xi |
Bernoulli indicators: $1$ = human annotation requested, $0$ = proxy label only, $\mathrm{NaN}$ = excluded |
pi, xi = sampler.sample(
uncertainties=uncertainty[N_BURN_IN:],
y_true_cost=COST_HUMAN,
y_proxy_cost=COST_PROXY,
budget=BUDGET - N_BURN_IN * (COST_HUMAN + COST_PROXY),
random_seed=RANDOM_SEED,
)
n_human = int(np.nansum(xi))
n_proxy = int(np.sum(~np.isnan(xi)))
n_not_annotated = N_MAIN - n_proxy
print(f"Conversations with human annotation : {n_human}")
print(f"Conversations with proxy + human annotation : {n_proxy}")
print(f"Conversations without annotation : {n_not_annotated}")
Conversations with human annotation : 415 Conversations with proxy + human annotation : 6707 Conversations without annotation : 1293
Running the full estimation pipeline¶
Building the combined arrays¶
Before calling the estimator, the burn-in and main segments are concatenated into three arrays:
y_true_full: the human labels for annotated samples (burn-in + main samples with $\xi_i = 1$), NaN for unannotated samples ($\xi_i = 0$).y_proxy_full: the proxy labels for selected samples (burn-in + main samples with $\pi_i > 0$).pi_full: the annotation probability for each sample, set to $1$ for burn-in samples, the computed $\pi_i$ for selected main samples.
We exclude unselected samples ($\pi_i = 0$) since no annotation is available for them.
Performing estimation with ASI¶
ASIMeanEstimator implements Active Statistical Inference (ASI), which corrects for selective annotation bias via inverse probability weighting. For each sample, the proxy label serves as a baseline; when a human label is available ($\xi_i = 1$), the human-proxy residual is added back, scaled by $1/\pi_i$ to account for the sampling design. For burn-in samples ($\pi_i = 1$) this correction is exact and reduces to the human label alone.
Pass y_true_full, pi_full, and y_proxy_full to ASIMeanEstimator().estimate().
Note: In practice, it is common to obtain an LLM-as-judge's uncertainty scores for individual samples simultaneously with the associated proxy labels, meaning that the proxy budget will have already been spent, or is negligible. In this case, one can call the
sample()withy_proxy_cost=0and specify the remaining budget to choose samples to send for human annotation.
y_true_annotated = simulate_annotation(y_true_oracle[N_BURN_IN:], xi)
y_true_annotated_filtered = y_true_annotated[pi > 0]
y_proxy_main_filtered = y_proxy[N_BURN_IN:][pi > 0]
pi_main_filtered = pi[pi > 0]
y_true_full = np.hstack([y_true_burn_in, y_true_annotated_filtered])
y_proxy_full = np.hstack([y_proxy_burn_in, y_proxy_main_filtered])
pi_full = np.hstack([np.ones(N_BURN_IN), pi_main_filtered])
result = ASIMeanEstimator().estimate(
y_true_full,
y_proxy_full,
pi_full,
metric_name="Hallucination Rate",
confidence_level=0.95,
)
n_human_total = N_BURN_IN + n_human
n_proxy_labeled = N_BURN_IN + n_proxy
print(result.summary())
print()
print(f"Human labels used : {n_human_total} ({N_BURN_IN} burn-in + {n_human} main)")
print(f"Proxy labels used : {n_proxy_labeled} ({N_BURN_IN} burn-in + {n_proxy} main)")
print(f"True rate : {TRUE_RATE:.1%}")
Metric: Hallucination Rate Point Estimate: 0.110 Confidence Interval (95%): [0.086, 0.133] Estimator : ASIMeanEstimator n_true: 565 n_proxy: 6857 Effective Sample Size: 940 Human labels used : 565 (150 burn-in + 415 main) Proxy labels used : 6857 (150 burn-in + 6707 main) True rate : 10.0%
Comparing with two baselines at a fixed budget¶
Recall that, after the burn-in phase, you have a remaining budget of $4,492.50 to evaluate 8,000 new conversations. Three possible spending strategies:
Proxy-only baseline: Spend the entire budget on LLM-as-Judge labels. All 8,000 conversations plus the 150 burn-in samples cost only $407.50 at $0.05 each. Use the proxy scores directly as your estimate. No human annotations, but the judge's systematic bias goes uncorrected.
Human-only baseline: Spend the entire remaining budget on human annotations only. At $10 per sample, this buys only a fraction of conversations. Use a classical estimator on those labels adding the burn-in human labels, no proxies.
Cost-optimal: Use the burn-in to calibrate the sampler and estimate ground-truth label variance. Compute per-sample annotation probabilities $\pi_i$ from the uncertainty scores, then allocate the remaining budget across as many conversations as affordable, combining cheap proxy labels with selectively chosen human annotations. ASI corrects for the non-uniform sampling design and combines both signals efficiently.
# Proxy-only: ClassicalMeanEstimator on proxy labels only (no human annotations)
result_proxy_only = ClassicalMeanEstimator().estimate(
y_proxy,
metric_name="Hallucination Rate",
)
# Human-only: ClassicalMeanEstimator on as many new human labels as budget allows (no proxy)
remaining_budget = BUDGET - (COST_HUMAN + COST_PROXY) * N_BURN_IN
n_human_only_main = int(np.ceil(remaining_budget / COST_HUMAN))
xi_uniform = UniformSampler().sample(N_MAIN, budget=n_human_only_main, random_seed=RANDOM_SEED)
y_true_human_only_main = simulate_annotation(y_true_oracle[N_BURN_IN:], xi_uniform)
y_true_human_only = np.hstack([y_true_burn_in, y_true_human_only_main])
n_human_total_human_only = N_BURN_IN + n_human_only_main
result_human_only = ClassicalMeanEstimator().estimate(
y_true_human_only,
metric_name="Hallucination Rate",
)
cost_optimal_total = n_human_total * COST_HUMAN + n_proxy_labeled * COST_PROXY
print(f"Proxy-only : 0 human labels + {N_TOTAL} proxy labels (budget used: ${N_TOTAL * COST_PROXY:.0f})")
print(f"Human-only : {n_human_total_human_only} human labels + 0 proxy labels (budget used: ${BUDGET})")
print(
f"Cost-optimal: {n_human_total} human labels + {n_proxy_labeled} proxy labels"
f" (budget used: ${cost_optimal_total:.0f})"
)
Proxy-only : 0 human labels + 8150 proxy labels (budget used: $408) Human-only : 600 human labels + 0 proxy labels (budget used: $6000) Cost-optimal: 565 human labels + 6857 proxy labels (budget used: $5993)
human_cost_optimal = n_human_total * COST_HUMAN
proxy_cost_optimal = n_proxy_labeled * COST_PROXY
baseline_estimates = [
(
f"Proxy-only\n${N_TOTAL * COST_PROXY:.0f} on LLM-as-Judge",
result_proxy_only.mean,
result_proxy_only.confidence_interval.lower_bound,
result_proxy_only.confidence_interval.upper_bound,
C_PROXY,
),
(
f"Human-only\n${BUDGET:.0f} on human annotations",
result_human_only.mean,
result_human_only.confidence_interval.lower_bound,
result_human_only.confidence_interval.upper_bound,
C_HUMAN,
),
(
f"Cost-optimal\n\\${human_cost_optimal:.0f} on humans + \\${proxy_cost_optimal:.0f} on proxy",
result.mean,
result.confidence_interval.lower_bound,
result.confidence_interval.upper_bound,
C_OPTIMAL,
),
]
fig, ax = plt.subplots(figsize=(11, 5.5))
y_pos = [2, 1, 0]
for y, (label, mean, lo, hi, color) in zip(y_pos, baseline_estimates):
ax.plot([lo, hi], [y, y], color=color, linewidth=4, solid_capstyle="round", zorder=3)
for xc in [lo, hi]:
ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=2.5, zorder=3)
ax.scatter(mean, y, s=200, color=color, zorder=5, edgecolors="white", linewidths=2.5)
ax.text(mean, y + 0.25, f"{mean:.1%}", ha="center", va="bottom", color=color, fontweight="bold")
ax.text(mean, y - 0.25, f"[{lo:.1%}, {hi:.1%}]", ha="center", va="top", color="#888888")
ax.axvline(TRUE_RATE, color=C_TRUTH, linestyle="--", linewidth=2.5, zorder=4)
ax.text(TRUE_RATE + 0.004, 2.72, "True rate 10%", color=C_TRUTH, fontweight="bold")
ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in baseline_estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("Estimated Hallucination Rate")
ax.set_xlim(-0.01, 0.26)
ax.set_ylim(-0.8, 3.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()
Summary: what each strategy contributes¶
| Proxy-only | Human-only | Cost-optimal | |
|---|---|---|---|
| Proxy labels | 8,150 | 0 | 6,857 |
| Human labels | 0 | 600 | 565 |
| Unbiased | no | yes | yes |
| Total cost | $408 | $6,000 | $5,993 |
Key takeaways:
Proxy-only estimation gives false confidence. The LLM judge's systematic bias places the estimate well below the true rate. A tight interval around the wrong value is worse than a wide one.
Spending the entire budget on human annotations leaves precision on the table. Each annotation dollar buys exactly one data point; the proxy signal is discarded entirely.
Cost-optimal sampling buys more precision for the same budget. The sampler finds the annotation policy that minimises estimation error within the budget, boosting the effective sample size beyond what human annotation alone would achieve at the same cost.
Concentrating annotations where the proxy struggles makes a difference. Higher annotation probabilities on less reliable conversations target the budget at exactly the cases that reduce variance the most.
ASI is valid under the non-uniform annotation design. Inverse Probability Weighting accounts for the per-sample probabilities $\pi_i$, ensuring the final estimate is unbiased despite the non-uniform sampling plan.
Want to go further? The ASI deep dive provides a detailed validation of the ASI estimator, with analytical comparisons and coverage experiments.