Cost estimates available: Cost-Optimal Random Sampling¶
When evaluating an AI system, you often have access to two annotation sources: a cheap but biased LLM-as-judge and an expensive but accurate human annotator. Given a fixed budget, how should you split it between them?
Cost-Optimal Random Sampling computes the fraction of samples to send to the expensive annotator so that estimation variance is minimised within your budget.
What you will learn:
- How to use
CostOptimalRandomSamplerto compute the optimal annotation fraction from a small labeled dataset - How to use the sampler's outputs with
ASIMeanEstimatorfor valid, bias-corrected estimates - How cost-optimal sampling compares to proxy-only and human-only strategies in balancing cost and accuracy
The evaluation challenge: two annotation sources, one budget¶
Suppose you are measuring the hallucination rate of a customer-facing AI assistant. Two annotation sources are available:
- LLM-as-judge: $0.05 per sample. Fast and scalable, but systematically under-reports hallucinations.
- Human annotator: $10.00 per sample. Ground-truth quality, but 200 times more expensive.
Your budget¶
You have 8,150 new conversations to evaluate and a total budget of $6,000, covering all annotation costs from start to finish.
The plan¶
You decide to spend the first part of your budget on a burn-in phase: annotate 150 conversations with both sources. This costs $1,507.50 ($10.00 + $0.05 per conversation) and serves two purposes:
- It reveals the judge's bias: the judge systematically under-reports hallucinations compared to human annotators.
- It allows you to calibrate the cost-optimal sampler before spending the remaining budget.
The question¶
With $4,492.50 left, you now need to evaluate the remaining 8,000 conversations. How do you allocate the rest of your budget between proxy and human annotations?
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from glide.estimators import ASIMeanEstimator, ClassicalMeanEstimator
from glide.samplers import CostOptimalRandomSampler, UniformSampler
from glide.simulators import generate_binary_dataset, simulate_annotation
N_BURN_IN = 150
N_MAIN = 8000
N_TOTAL = N_BURN_IN + N_MAIN
TRUE_RATE = 0.10
PROXY_RATE = 0.05
COST_PROXY = 0.05
COST_HUMAN = 10.00
BUDGET = 6000
RANDOM_SEED = 0
C_OPTIMAL = "#27AE60" # cost-optimal (green)
C_PROXY = "#E67E22" # proxy only (orange)
C_HUMAN = "#2E86AB" # human only (blue)
C_TRUTH = "#2C3E50" # true value (dark slate)
plt.rcParams.update(
{
"figure.facecolor": "white",
"axes.facecolor": "#FAFAFA",
"axes.grid": True,
"grid.color": "#E5E5E5",
"grid.linewidth": 0.8,
"font.size": 18,
"axes.labelsize": 18,
"axes.titlesize": 18,
"legend.fontsize": 16,
"xtick.labelsize": 16,
"ytick.labelsize": 16,
"figure.titlesize": 19,
}
)
Simulating 8,150 conversations with a biased judge¶
generate_binary_dataset produces a synthetic dataset that matches the scenario above.
| Array | Meaning |
|---|---|
y_true |
Ground-truth binary label: $Y_i = 1$ means a hallucination confirmed by a human annotator |
y_proxy |
Binary label from the LLM judge: $\tilde{Y}_i = 1$ means the judge flagged a hallucination |
The first N_BURN_IN = 150 conversations are annotated with both sources: these form the burn-in dataset used to calibrate the sampler. The remaining 8,000 conversations are new, and only the proxy label is available initially.
Simulated annotation: In practice, a human label for a new conversation is revealed only after a human annotator's work. Here we generate all ground-truth labels upfront to simulate the annotation process: a main-set conversation only gets its true label revealed when the sampler decides to annotate it.
y_true, y_proxy = generate_binary_dataset(
n_total=N_TOTAL,
true_mean=TRUE_RATE,
proxy_mean=PROXY_RATE,
correlation=0.65,
random_seed=RANDOM_SEED,
)
y_true_burn_in = y_true[:N_BURN_IN]
y_proxy_burn_in = y_proxy[:N_BURN_IN]
print(f"Total conversations : {N_TOTAL:,}")
print(f" Burn-in (labeled) : {N_BURN_IN}")
print(f" Main (new) : {N_MAIN:,}")
print()
print(f"LLM judge cost : ${COST_PROXY:.2f} per sample")
print(f"Human annotator cost : ${COST_HUMAN:.2f} per sample")
print(f"Cost ratio : {COST_HUMAN / COST_PROXY:.0f}x")
print()
print(f"Burn-in: judge rate = {y_proxy_burn_in.mean():.1%}")
print(f"Burn-in: human rate = {y_true_burn_in.mean():.1%}")
print(f"Proxy bias = {y_proxy_burn_in.mean() - y_true_burn_in.mean():+.1%}")
Total conversations : 8,150 Burn-in (labeled) : 150 Main (new) : 8,000 LLM judge cost : $0.05 per sample Human annotator cost : $10.00 per sample Cost ratio : 200x Burn-in: judge rate = 6.7% Burn-in: human rate = 14.7% Proxy bias = -8.0%
CostOptimalRandomSampler to the rescue¶
The burn-in data confirms the proxy is biased, so we cannot trust it directly. We need prediction-powered estimation to correct for this bias. The question is: how many new human annotations should we purchase alongside the cheap proxy labels?
CostOptimalRandomSampler answers exactly this question. The workflow has two steps:
- Fit the sampler on the burn-in dataset to estimate the proxy's reliability.
- Sample from the new conversations to obtain annotation assignments.
Step 1: Fit on the burn-in dataset¶
The sampler estimates two quantities from the burn-in data:
- $\text{Var}(Y)$: variance of the ground-truth labels, capturing how spread out the true outcomes are.
- $\text{MSE}(\tilde{Y}, Y)$: mean squared error of the proxy relative to ground truth, measuring how unreliable the judge is.
Both quantities are computed during sampler.fit(y_true_burn_in, y_proxy_burn_in) using the burn-in samples.
sampler = CostOptimalRandomSampler()
sampler.fit(y_true_burn_in, y_proxy_burn_in)
<glide.samplers.cost_optimal_random.CostOptimalRandomSampler at 0x7e41df2de930>
The optimal annotation rate¶
The sampler computes an optimal annotation probability $\pi^*$ based on three factors:
- Cost ratio — How much more expensive is human annotation than the proxy? A 200x cost gap means you need fewer human annotations than proxies to stay within budget.
- Proxy quality — How reliable is the LLM judge? A more accurate proxy requires fewer human annotations to correct its bias.
- Data variability — How diverse are the true labels? More variability requires more human annotations to estimate accurately.
The formula for $\pi^*$ balances these trade-offs. In the next step, we call sampler.sample() to compute $\pi^*$ and draw Bernoulli indicators for each conversation.
Step 2: Draw annotation assignments for new conversations¶
With $\pi^{*}$ determined, sampler.sample() runs in two stages:
- Draw Bernoulli annotation indicators. Conversations are randomly shuffled to ensure the outcome does not depend on the input order. Each conversation is then independently sent to the human annotator with probability $\pi^{*}$, by drawing:
$$\xi_i \sim \text{Bernoulli}(\pi^{*})$$
- Apply the budget cutoff. The actual cost of conversation $i$ is $c_{\tilde{y}} + c_y \cdot \xi_i$. The sampler accumulates these costs in shuffled order until the budget is exhausted. Conversations not reached before the budget runs out are excluded and receive $\pi_i = 0$ and $\xi_i = \mathrm{NaN}$. Results are returned in the original input order.
sampler.sample() returns two values:
| Return value | Meaning |
|---|---|
pi |
Annotation probabilities: $\pi^{*}$ for included conversations, $0$ for excluded ones |
xi |
Bernoulli indicators: $1$ = human annotation obtained, $0$ = proxy label only, $\mathrm{NaN}$ = excluded |
pi, xi = sampler.sample(
n_samples=N_MAIN,
y_true_cost=COST_HUMAN,
y_proxy_cost=COST_PROXY,
budget=BUDGET - N_BURN_IN * (COST_HUMAN + COST_PROXY),
random_seed=RANDOM_SEED,
)
pi_opt = pi[pi > 0][-1]
n_human = int(np.nansum(xi))
n_proxy = int(np.sum(~np.isnan(xi)))
n_not_annotated = N_MAIN - n_proxy
print(f"Conversations with human annotation : {n_human}")
print(f"Conversations with proxy + human annotation : {n_proxy}")
print(f"Conversations without annotation : {n_not_annotated}")
print()
print(f"Annotation probability : pi = {pi_opt:.3f}")
print(f"Realized annotation rate : {n_human / (n_human + n_proxy):.3f}")
Conversations with human annotation : 425 Conversations with proxy + human annotation : 4669 Conversations without annotation : 3331 Annotation probability : pi = 0.093 Realized annotation rate : 0.083
Running the full estimation pipeline¶
Building the combined arrays¶
Before calling the estimator, the burn-in and main segments are concatenated into three arrays:
y_true_full: the human labels for annotated samples (burn-in + main samples with $\xi_i = 1$), NaN for unannotated samples ($\xi_i = 0$).y_proxy_full: the proxy labels for selected samples (burn-in + main samples with $\pi_i > 0$).pi_full: the annotation probability for each sample, set to $1$ for burn-in samples, the computed $\pi_i$ for selected main samples.
We exclude unselected samples ($\pi_i = 0$) since no annotation is available for them.
Performing estimation with ASI¶
ASIMeanEstimator implements Active Statistical Inference (ASI), which corrects for selective annotation bias via inverse probability weighting. For each sample, the proxy label serves as a baseline; when a human label is available ($\xi_i = 1$), the human-proxy residual is added back, scaled by $1/\pi_i$ to account for the sampling design. For burn-in samples ($\pi_i = 1$) this correction is exact and reduces to the human label alone.
Pass y_true_full, pi_full, and y_proxy_full to ASIMeanEstimator().estimate().
y_true_annotated = simulate_annotation(y_true[N_BURN_IN:], xi)
y_true_full = np.hstack([y_true_burn_in, y_true_annotated[pi > 0]])
y_proxy_full = np.hstack([y_proxy_burn_in, y_proxy[N_BURN_IN:][pi > 0]])
pi_full = np.hstack([np.ones(N_BURN_IN), pi[pi > 0]])
result = ASIMeanEstimator().estimate(
y_true_full,
y_proxy_full,
pi_full,
metric_name="Hallucination Rate",
confidence_level=0.95,
)
n_human_total = N_BURN_IN + n_human
n_proxy_labeled = N_BURN_IN + n_proxy
print(result.summary())
print()
print(f"Human labels used : {n_human_total} ({N_BURN_IN} burn-in + {n_human} main)")
print(f"Proxy labels used : {n_proxy_labeled} ({N_BURN_IN} burn-in + {n_proxy} main)")
print(f"True rate : {TRUE_RATE:.1%}")
Metric: Hallucination Rate Point Estimate: 0.093 Confidence Interval (95%): [0.073, 0.113] Estimator : ASIMeanEstimator n_true: 575 n_proxy: 4819 Effective Sample Size: 1374 Human labels used : 575 (150 burn-in + 425 main) Proxy labels used : 4819 (150 burn-in + 4669 main) True rate : 10.0%
Comparing with two baselines at a fixed budget¶
Recall that, after the burn-in phase, you have a remaining budget of $4,492.50 to evaluate 8,000 new conversations. Three possible spending strategies:
Proxy-only baseline: Spend the entire budget on LLM-as-Judge labels. All 8,000 conversations plus the 150 burn-in samples cost only $407.50 at $0.05 each. Use the proxy scores directly as your estimate. No human annotations, but the judge's systematic bias goes uncorrected.
Human-only baseline: Spend the entire remaining budget on human annotations only. At $10 per sample, this buys only a fraction of conversations. Use a classical estimator on those labels adding the burn-in human labels, no proxies.
Cost-optimal: Use the burn-in to calibrate the sampler and learn the proxy's reliability. Compute the optimal annotation rate $\pi^*$, then allocate the remaining budget across as many conversations as affordable, combining cheap proxy labels with selectively chosen human annotations. ASI corrects for selective sampling and combines both signals efficiently.
# Proxy-only: ClassicalMeanEstimator on proxy labels only (no human annotations)
result_proxy_only = ClassicalMeanEstimator().estimate(
y_proxy,
metric_name="Hallucination Rate",
)
# Human-only: ClassicalMeanEstimator on as many new human labels as budget allows (no proxy)
remaining_budget = BUDGET - (COST_HUMAN + COST_PROXY) * N_BURN_IN
n_human_only_main = int(np.ceil(remaining_budget / COST_HUMAN))
xi_uniform = UniformSampler().sample(N_MAIN, budget=n_human_only_main, random_seed=RANDOM_SEED)
y_true_human_only_main = simulate_annotation(y_true[N_BURN_IN:], xi_uniform)
y_true_human_only = np.hstack([y_true_burn_in, y_true_human_only_main])
n_human_total_human_only = N_BURN_IN + n_human_only_main
result_human_only = ClassicalMeanEstimator().estimate(
y_true_human_only,
metric_name="Hallucination Rate",
)
cost_optimal_total = n_human_total * COST_HUMAN + n_proxy_labeled * COST_PROXY
print(f"Proxy-only : 0 human labels + {N_TOTAL} proxy labels (budget used: ${N_TOTAL * COST_PROXY:.0f})")
print(f"Human-only : {n_human_total_human_only} human labels + 0 proxy labels (budget used: ${BUDGET})")
print(
f"Cost-optimal: {n_human_total} human labels + {n_proxy_labeled} proxy labels"
f" (pi* = {pi_opt:.3f}, budget used: ${cost_optimal_total:.0f})"
)
Proxy-only : 0 human labels + 8150 proxy labels (budget used: $408) Human-only : 600 human labels + 0 proxy labels (budget used: $6000) Cost-optimal: 575 human labels + 4819 proxy labels (pi* = 0.093, budget used: $5991)
human_cost_optimal = n_human_total * COST_HUMAN
proxy_cost_optimal = n_proxy_labeled * COST_PROXY
baseline_estimates = [
(
f"Proxy-only\n${N_TOTAL * COST_PROXY:.0f} on LLM-as-Judge",
result_proxy_only.mean,
result_proxy_only.confidence_interval.lower_bound,
result_proxy_only.confidence_interval.upper_bound,
C_PROXY,
),
(
f"Human-only\n${BUDGET:.0f} on human annotations",
result_human_only.mean,
result_human_only.confidence_interval.lower_bound,
result_human_only.confidence_interval.upper_bound,
C_HUMAN,
),
(
f"Cost-optimal\n\\${human_cost_optimal:.0f} on humans + \\${proxy_cost_optimal:.0f} on proxy",
result.mean,
result.confidence_interval.lower_bound,
result.confidence_interval.upper_bound,
C_OPTIMAL,
),
]
fig, ax = plt.subplots(figsize=(11, 5.5))
y_pos = [2, 1, 0]
for y, (label, mean, lo, hi, color) in zip(y_pos, baseline_estimates):
ax.plot([lo, hi], [y, y], color=color, linewidth=4, solid_capstyle="round", zorder=3)
for xc in [lo, hi]:
ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=2.5, zorder=3)
ax.scatter(mean, y, s=200, color=color, zorder=5, edgecolors="white", linewidths=2.5)
ax.text(mean, y + 0.25, f"{mean:.1%}", ha="center", va="bottom", color=color, fontweight="bold")
ax.text(mean, y - 0.25, f"[{lo:.1%}, {hi:.1%}]", ha="center", va="top", color="#888888")
ax.axvline(TRUE_RATE, color=C_TRUTH, linestyle="--", linewidth=2.5, zorder=4)
ax.text(TRUE_RATE + 0.004, 2.72, "True rate 10%", color=C_TRUTH, fontweight="bold")
ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in baseline_estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("Estimated Hallucination Rate")
ax.set_xlim(-0.01, 0.26)
ax.set_ylim(-0.8, 3.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()
Summary: what each strategy contributes¶
| Proxy-only | Human-only | Cost-optimal | |
|---|---|---|---|
| Proxy labels | 8,150 | 0 | 4,819 |
| Human labels | 0 | 600 | 575 |
| Unbiased | no | yes | yes |
| Total cost | $408 | $6,000 | $5,991 |
Key takeaways:
Proxy-only estimation gives false confidence. The LLM judge's systematic bias places the estimate well below the true hallucination rate. A tight interval around the wrong value is worse than a wide one.
Spending the entire budget on human annotations leaves precision on the table. Each annotation dollar buys exactly one data point; the proxy signal is discarded entirely.
Cost-optimal sampling buys more precision for the same budget. The sampler finds the optimal mix of human and proxy annotations within the budget, boosting the effective sample size beyond what human annotation alone would achieve at the same cost.
ASI is valid under the non-uniform annotation design. Inverse probability weighting corrects for the selective sampling plan, ensuring the final estimate is unbiased.
Want to go further? The ASI deep dive provides a detailed validation of the ASI estimator, with analytical comparisons and coverage experiments.