Uncertainty scores available: ASI¶
ASI (Active Statistical Inference) combines a small set of expensive true evaluation labels with a large pool of cheap proxy evaluation labels to produce a statistically valid, bias-corrected quality metric. It goes one step further than basic bias correction: when your LLM judge reports not only a label but also an uncertainty score, ASI uses that score to concentrate your annotation budget on the conversations where the judge is least reliable — extracting more information from the same number of human annotations.
What you will learn:
- Why LLM-as-Judge metrics are systematically biased
- How to exploit judge uncertainty scores to allocate your annotation budget efficiently
- How to use ASI to produce a valid, bias-corrected metric with a tight confidence interval
The problem: your LLM judge disagrees with your users¶
Let's assume you run a customer-facing agentic assistant handling thousands of conversations per day.
The signals¶
- Every tenth user reports incorrect or fabricated information.
- You deploy an LLM judge to measure the hallucination rate. It tells you the rate is 5%.
The users and the judge disagree, the latter is systematically optimistic.
Fixing an annotation budget¶
Manual annotation is expensive. With thousands of conversations per day, annotating them all is not an option. You fix a budget of 200 manual annotations — accurate ground truth, but scarce.
The judge's hidden signal¶
Your LLM judge is more sophisticated than a plain yes/no classifier. For each conversation it evaluates, it also outputs an uncertainty score: a measure of how confident it is in its own label. Some conversations are clear-cut; others are genuinely ambiguous, and the judge knows it.
This uncertainty signal is valuable: it tells you where a human annotator's time would be best spent. Rather than picking 200 conversations at random, you can concentrate the budget on the conversations where the judge is least reliable.
The challenge¶
You now have:
- 2,200 LLM judgements — cheap and fast, but biased
- Budget for 200 human annotations — to be spent wisely on accurate and informative instances.
Two decisions to make:
- Which conversations to annotate? Uncertainty-guided (active) sampling concentrates the budget where the judge is most uncertain, rather than spreading it uniformly at random.
- How to combine human and proxy labels? Using the proxy labels naively re-introduces bias; they need a statistical correction.
ASI addresses both: it uses active sampling for the annotation budget and combines the resulting human labels with all 2,200 LLM judgements via a bias correction, yielding an unbiased estimate with a tighter confidence interval than either source alone.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from glide.estimators import ASIMeanEstimator, ClassicalMeanEstimator, IPWClassicalMeanEstimator
from glide.samplers import ActiveSampler
from glide.simulators import generate_binary_dataset_with_oracle_sampling, simulate_annotation
# ── Colour palette ──────────────────────────────────────────
C_JUDGE = "#E74C3C" # LLM judge — red-orange
C_HUMAN = "#2E86AB" # Human-only — blue
C_GLIDE = "#27AE60" # GLIDE / ASI — green
C_TRUTH = "#2C3E50" # True value — dark slate
C_ACTIVE = "#E67E22" # Active sampling — amber
# ── Global plot style ───────────────────────────────────────
plt.rcParams.update(
{
"figure.facecolor": "white",
"axes.facecolor": "#FAFAFA",
"axes.grid": True,
"grid.color": "#E5E5E5",
"grid.linewidth": 0.8,
"font.size": 18,
"axes.labelsize": 18,
"axes.titlesize": 18,
"legend.fontsize": 16,
"xtick.labelsize": 16,
"ytick.labelsize": 16,
"figure.titlesize": 19,
}
)
Simulating 2,200 conversations with a biased judge that reports uncertainty¶
generate_binary_dataset_with_oracle_sampling produces a synthetic dataset that mirrors the scenario above.
It returns three arrays, one value per sample (conversation):
| Array | Meaning |
|---|---|
y_true_oracle |
Ground-truth binary label: $Y_i = 1$ = hallucination confirmed by a human annotator |
y_proxy |
Binary label from the LLM judge: $\tilde{Y}_i = 1$ = hallucination flagged |
uncertainty |
Oracle uncertainty score: $\sqrt{\mathbb{E}[(\tilde{Y}_i - Y_i)^2 \mid x_i]}$ |
where $Y_i$ is the human annotation, $\tilde{Y}_i$ is the LLM judge annotation, and $x_i$ is the conversation in question. A high uncertainty value means the judge is least reliable for that sample — exactly where a human annotator adds the most value.
Simulated annotation: In practice,
y_true_oracleis revealed only after a human annotates the conversation. Here we generate the ground-truth labels upfront so that we can later simulate the annotation process by masking unlabeled samples withnp.nan.
N_TOTAL = 2200 # total conversations evaluated by the LLM judge
N_LABELED = 200 # human annotation budget
TRUE_RATE = 0.10 # true hallucination rate (unknown in practice)
RANDOM_SEED = 0
y_true_oracle, y_proxy, uncertainty = generate_binary_dataset_with_oracle_sampling(
n_total=N_TOTAL,
true_mean=0.10,
proxy_mean=0.05,
correlation=0.55,
random_seed=RANDOM_SEED,
)
print(f"Total conversations : {len(y_true_oracle):,}")
print(f"Annotation budget : {N_LABELED}")
Total conversations : 2,200 Annotation budget : 200
The judge's uncertainty varies across conversations¶
Before deciding which conversations to annotate, let's inspect the distribution of the judge's uncertainty scores. A uniform distribution would mean every conversation is equally hard for the judge; the actual distribution reveals that some conversations are genuinely ambiguous.
fig, ax = plt.subplots(figsize=(9, 4.5))
ax.hist(uncertainty, bins=40, color=C_ACTIVE, alpha=0.75, label="Judge uncertainty")
ax.axvline(
uncertainty.mean(),
color=C_TRUTH,
linestyle="--",
linewidth=2,
label=f"Mean uncertainty = {uncertainty.mean():.3f}",
)
ax.set_xlabel("Judge uncertainty score")
ax.set_ylabel("Number of conversations")
ax.legend()
plt.tight_layout()
plt.show()
print(
f"Uncertainty — min: {uncertainty.min():.3f} | mean: {uncertainty.mean():.3f} | max: {uncertainty.max():.3f}"
)
Uncertainty — min: 0.228 | mean: 0.260 | max: 0.290
Smart choice of conversations to annotate with ActiveSampler¶
ActiveSampler translates the uncertainty scores into Bernoulli sampling probabilities $\pi_i$ by solving the optimization:
$$\mathrm{minimize} \sum_i \frac{\text{uncertainty}_i^2}{\pi_i}$$
subject to $\pi_i \in (0, 1]$ for all $i$ and $\sum_i \pi_i = \text{budget}$.
Each sample is then independently selected with probability $\pi_i$. The number of selected samples is random and equals the budget at most.
Samples with high uncertainty receive the highest $\pi_i$ and are therefore most likely to be annotated.
The sampler returns two arrays:
| Array | Meaning |
|---|---|
pi |
Sampling probability $\pi_i \in (0, 1]$ for each sample |
xi |
Bernoulli indicator: 1 if the sample was selected for annotation, 0 otherwise |
# Compute sampling probabilities and draw Bernoulli annotation indicators
pi, xi = ActiveSampler().sample(
uncertainty,
budget=N_LABELED,
random_seed=RANDOM_SEED,
)
# Simulate active annotation: withhold y_true_oracle for conversations not selected by the sampler
y_true = simulate_annotation(y_true_oracle, xi)
n_annotated = int(np.sum(xi))
print(f"Total conversations: {N_TOTAL:,}")
print(f"Annotation budget: {N_LABELED}")
print(f"Actually annotated (realized): {n_annotated}")
print()
print(f"Sampling probabilities — min: {pi.min():.3f} | mean: {pi.mean():.3f} | max: {pi.max():.3f}")
print(f"Uniform pi (for reference): {N_LABELED / N_TOTAL:.3f}")
Total conversations: 2,200 Annotation budget: 200 Actually annotated (realized): 183 Sampling probabilities — min: 0.079 | mean: 0.091 | max: 0.101 Uniform pi (for reference): 0.091
Human-only active sampling: unbiased but forfeits the proxy¶
Before deploying ASI, consider two alternatives:
Option A — Trust the judge on all conversations.
Precise (large sample), but the judge's systematic bias makes the estimate wrong.
Option B — Trust only the human annotations, but sample actively.
Uses the same uncertainty-guided sampling rule as ASI, so the annotation budget is concentrated on the conversations the judge finds hardest. Unbiased, but the 95% confidence interval is wider because the proxy labels are ignored entirely.
The gap between Option B and ASI isolates exactly what the proxy labels contribute: both use the same active sample of ~200 annotations; ASI additionally leverages all 2,200 LLM judgements to sharpen the estimate.
# Option A: LLM judge — average proxy labels over all 2,200 conversations
judge_estimate = ClassicalMeanEstimator().estimate(y_proxy)
judge_mean = judge_estimate.mean
judge_lower = judge_estimate.confidence_interval.lower_bound
judge_upper = judge_estimate.confidence_interval.upper_bound
# Option B: human labels on the actively sampled conversations (same sampling rule as ASI, no proxy labels)
human_estimate = IPWClassicalMeanEstimator().estimate(y_true, pi)
human_mean = human_estimate.mean
human_lower = human_estimate.confidence_interval.lower_bound
human_upper = human_estimate.confidence_interval.upper_bound
human_label = f"Human labels only (n={n_annotated}, active)"
sep = "-" * 72
print(sep)
print(f"{'Method':<36} {'Estimate':>8} {'95% confidence interval':>16}")
print(sep)
print(f"{'LLM Judge only (2,200)':<36} {judge_mean:>7.1%} [{judge_lower:.1%}, {judge_upper:.1%}]")
print(f"{human_label:<36} {human_mean:>7.1%} [{human_lower:.1%}, {human_upper:.1%}]")
print(sep)
print(f"{'True rate (simulation)':<36} {'10.0%':>8}")
------------------------------------------------------------------------ Method Estimate 95% confidence interval ------------------------------------------------------------------------ LLM Judge only (2,200) 4.7% [3.8%, 5.6%] Human labels only (n=183, active) 7.5% [3.7%, 11.3%] ------------------------------------------------------------------------ True rate (simulation) 10.0%
ASI: active sampling plus proxy labels¶
ASIMeanEstimator implements Active Statistical Inference, which does everything Option B does and adds one more layer. It measures and removes the judge's systematic bias on the annotated subset. This is done by computing the gap between proxy and true labels and uses it to correct the estimate across all conversations.
Compared with Option B (human-only active sampling), ASI additionally feeds all 2,200 LLM proxy labels into the estimate. This extra information reduces variance further, shrinking the confidence interval beyond what a human-only strategy can achieve at the same annotation budget. This improvement is quantified by the effective sample size displayed below. It is the annotation budget that the human-only strategy would need to achieve the same variance as ASI.
estimator = ASIMeanEstimator()
asi_result = estimator.estimate(
y_true,
y_proxy,
pi,
metric_name="Hallucination Rate",
confidence_level=0.95,
)
print(asi_result.summary())
Metric: Hallucination Rate Point Estimate: 0.088 Confidence Interval (95%): [0.052, 0.123] Estimator : ASIMeanEstimator n_true: 183 n_proxy: 2200 Effective Sample Size: 208
The proxy labels make ASI sharper¶
The plot below compares point estimates and 95% confidence intervals for all three methods against the true hallucination rate (dashed line):
- LLM judge: very narrow confidence interval, but wrong.
- Human-only (active): uses the same uncertainty-guided sample as ASI — unbiased, but wider because it ignores the proxy labels.
- ASI: unbiased and narrower — active sampling concentrates the budget on the most informative conversations, and all 2,200 proxy labels add variance reduction on top.
estimates = [
(
f"LLM Judge\n({N_TOTAL:,} | raw proxy)",
judge_mean,
judge_lower,
judge_upper,
C_JUDGE,
),
(
f"Human Annotation\n({n_annotated} | active, no proxy)",
human_mean,
human_lower,
human_upper,
C_HUMAN,
),
(
f"ASI (GLIDE)\n({n_annotated} active + {N_TOTAL:,} proxy)\n(uncertainty-guided)",
asi_result.mean,
asi_result.confidence_interval.lower_bound,
asi_result.confidence_interval.upper_bound,
C_GLIDE,
),
]
fig, ax = plt.subplots(figsize=(11, 5.5))
y_pos = [2, 1, 0]
for y, (label, mean, lo, hi, color) in zip(y_pos, estimates):
ax.plot([lo, hi], [y, y], color=color, linewidth=4, solid_capstyle="round", zorder=3)
for xc in [lo, hi]:
ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=2.5, zorder=3)
ax.scatter(mean, y, s=200, color=color, zorder=5, edgecolors="white", linewidths=2.5)
ax.text(mean, y + 0.34, f"{mean:.1%}", ha="center", va="bottom", fontsize=12, color=color, fontweight="bold")
ax.text(mean, y - 0.34, f"[{lo:.1%}, {hi:.1%}]", ha="center", va="top", fontsize=11, color="#888888")
ax.axvline(TRUE_RATE, color=C_TRUTH, linestyle="--", linewidth=2.5, zorder=4)
ax.text(TRUE_RATE + 0.004, 2.72, "True rate 10%", color=C_TRUTH, fontsize=10.5, fontweight="bold")
ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("Estimated Hallucination Rate")
ax.set_xlim(-0.01, 0.26)
ax.set_ylim(-0.8, 3.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()
Testing whether the hallucination rate is within acceptable limits¶
GLIDE's estimation output enables a formal hypothesis test: is the true hallucination rate significantly higher than 5%?
$$H_0 : \mu = 5\% \qquad H_1 : \mu > 5\%$$
Without ASI, we can use the LLM judge estimate only — which gives a misleading confidence interval close to 5%, making it unlikely that we would reject the null hypothesis. The human-only active baseline is unbiased but its confidence interval is wide, leading to conservative decisions. ASI uses the same active sample and all proxy labels, enabling accurate hypothesis testing.
z_stat, p_value, _ = asi_result.confidence_interval.test_null_hypothesis(
h0_value=0.05, # LLM judge's claimed rate
alternative="larger", # H1: true rate > 5%
)
sep = "-" * 48
print("Hypothesis test — ASI Estimator")
print(sep)
print("H0 : hallucination rate = 5% (LLM says so)")
print("H1 : hallucination rate > 5% (users complain!)")
print()
print(f"z-statistic : {z_stat:.2f}")
print(f"p-value : {p_value:.10f}")
print()
if p_value < 0.05:
print("Decision : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
print("=> The true hallucination rate is significantly above 5%.")
else:
print(
"Decision : Given the p-value is higher than our threshold (p-value >= 0.05), "
"we cannot reject H0 at the 5% level."
)
Hypothesis test — ASI Estimator ------------------------------------------------ H0 : hallucination rate = 5% (LLM says so) H1 : hallucination rate > 5% (users complain!) z-statistic : 2.06 p-value : 0.0196009466 Decision : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level. => The true hallucination rate is significantly above 5%.
The null hypothesis is rejected, signalling that the hallucination rate is significantly above the fixed threshold.
Let us run the same hypothesis test using the human-only active baseline.
z_stat_cm, p_value_cm, _ = human_estimate.confidence_interval.test_null_hypothesis(
h0_value=0.05,
alternative="larger",
)
sep = "-" * 48
print("Hypothesis test — Human labels only (active sample, no proxy)")
print(sep)
print("H0 : hallucination rate = 5% (LLM says so)")
print("H1 : hallucination rate > 5% (users complain!)")
print()
print(f"z-statistic : {z_stat_cm:.2f}")
print(f"p-value : {p_value_cm:.10f}")
print()
if p_value_cm < 0.05:
print("Decision : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
print("=> The true hallucination rate is significantly above 5%.")
else:
print(
"Decision : Given the p-value is higher than our threshold (p-value >= 0.05), "
"we cannot reject H0 at the 5% level."
)
Hypothesis test — Human labels only (active sample, no proxy) ------------------------------------------------ H0 : hallucination rate = 5% (LLM says so) H1 : hallucination rate > 5% (users complain!) z-statistic : 1.30 p-value : 0.0970925750 Decision : Given the p-value is higher than our threshold (p-value >= 0.05), we cannot reject H0 at the 5% level.
The null hypothesis is not rejected: even with the annotation budget concentrated on the hardest conversations, ~200 human labels without any proxy information leave the confidence interval too wide to draw a firm conclusion. ASI uses the same active sample but additionally leverages all 2,200 proxy labels, reducing variance enough to reject the null — demonstrating that the proxy information closes the gap.
Summary: what each layer adds¶
| LLM Judge | Human-only (active) | ASI | |
|---|---|---|---|
| Sample size | 2,200 | ~200 | ~200 + 2,200 |
| Unbiased estimate | ❌ | ✅ | ✅ |
| Narrow confidence interval | 🟠 (misleading) | ❌ | ✅ |
| Uncertainty-guided sampling | ❌ | ✅ | ✅ |
| Labeling cost | Low | High | Small |
Key takeaways:
LLM judges are biased. A narrow confidence interval around the wrong value is worse than useless — it gives false confidence.
Active sampling alone is not enough. Concentrating the annotation budget on uncertain conversations (Option B) is a good strategy, but without proxy labels the confidence interval remains wide.
The proxy labels make ASI sharper. ASI uses the same active sample as Option B and additionally incorporates all LLM proxy labels with a bias correction, shrinking the interval further.
200 human annotations is all you need. ASI's bias rectification uses information from 2200 cheap proxy labels to shrink the confidence interval by a large amount compared to human-only estimation.
ASI is valid under non-uniform sampling. Inverse Probability Weighting corrects for the fact that uncertain elements are over-represented among the annotated set, preserving statistical validity.
Want to go further? The ASI scientific validation notebook provides rigorous empirical evidence — coverage validity and confidence interval width across a sweep of proxy correlations and confidence levels.