Clustered data: Cluster PPI++¶
Cluster PPI++ combines a small set of expensive true evaluation labels with a large pool of cheap proxy evaluation labels to produce a statistically valid, bias-corrected quality metric. It is designed for datasets where samples are grouped into clusters (such as sentences in paragraphs or turns within conversations), because samples in the same cluster are correlated and must be treated as a single annotation unit. The annotation budget is therefore defined in terms of the number of clusters (conversations) to annotate, not individual samples.
This guide walks through a realistic hallucination-detection scenario end-to-end.
What you will learn:
- Why LLM-as-Judge metrics are systematically biased
- How to use Cluster PPI++ to produce a bias-corrected metric on clustered data
- How to test that your metric fits your expectations
The problem: your LLM judge disagrees with your users¶
Let's assume you run a customer-facing AI assistant that handles thousands of multi-turn conversations per day. Each conversation is made up of individual turns (phrases or exchanges), and the quality metric is measured at the turn level: a turn is flagged if it contains a hallucination. Turns within the same conversation are not independent (they share context) so they must be treated as a cluster.
The signals¶
- Every tenth user reports incorrect or fabricated information (unacceptable for the management).
- You deploy an LLM judge to rate each conversation turn for hallucination. It reports a hallucination rate of 5%.
The users and the LLM judge disagree. You decide to dig deeper.
The manual investigation¶
You budget for 40 conversations to annotate manually — expensive but accurate ground truth. Because annotators review full conversations, this covers roughly 330 individual turns. Annotators find that ~10% of turns contain a blatant hallucination.
That is double what the LLM reports. The judge is systematically optimistic.
The challenge¶
You now have:
- 8,000 LLM judgements across 800 conversations — cheap and fast, but biased
- 330 human annotations across 40 conversations — accurate, but covering only a small portion of your data
Ignoring the conversation structure would lead to uncertainty underestimation. Cluster PPI++ accounts for the within-conversation correlation and combines both sources to produce a reliable, unbiased estimate of the true hallucination rate.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from glide.estimators import ClusterClassicalMeanEstimator, ClusterPPIMeanEstimator
from glide.samplers import UniformClusterSampler
from glide.simulators import generate_clustered_binary_dataset, simulate_annotation
# ── Colour palette ──────────────────────────────────────────
C_JUDGE = "#E74C3C" # LLM judge — red-orange
C_HUMAN = "#2E86AB" # Human-only — blue
C_GLIDE = "#27AE60" # GLIDE — green
C_TRUTH = "#2C3E50" # True value — dark slate
# ── Global plot style ───────────────────────────────────────
plt.rcParams.update(
{
"figure.facecolor": "white",
"axes.facecolor": "#FAFAFA",
"axes.grid": True,
"grid.color": "#E5E5E5",
"grid.linewidth": 0.8,
"font.size": 18,
"axes.labelsize": 18,
"axes.titlesize": 18,
"legend.fontsize": 16,
"xtick.labelsize": 16,
"ytick.labelsize": 16,
"figure.titlesize": 19,
}
)
Simulating a clustered dataset with a biased judge¶
generate_clustered_binary_dataset produces synthetic data that mirrors the scenario above, with ground-truth labels available for all samples. Each row represents a single conversation turn; turns are grouped into conversations via the clusters array.
The simulation replicates the practical workflow: proxy labels are generated for every turn, a random subset of conversations is selected for human annotation (the budget is specified as a number of conversations), and the remaining ground-truth labels are masked with np.nan to reflect what would be unobserved in practice.
- Generate synthetic data with
generate_clustered_binary_dataset: proxy labels cover all turns. - Sample a subset of conversations for annotation with
UniformClusterSampler. - Annotate selected turns and mask the rest via
simulate_annotation.
The table below summarizes the arrays produced and what each contains.
| Array | Meaning |
|---|---|
y_true_oracle |
Ground-truth labels for all turns (revealed only after annotation) |
y_proxy |
Proxy predictions for all turns |
clusters |
Cluster (conversation) ID for each turn |
xi |
Annotation indicator: 1 if the turn's conversation was annotated, 0 otherwise |
y_true |
Observed labels: ground-truth where annotated, np.nan elsewhere |
N_TOTAL = 8000
N_CLUSTERS = 800
N_LABELED = 40
RANDOM_SEED = 14
y_true_oracle, y_proxy, clusters = generate_clustered_binary_dataset(
n_total=N_TOTAL,
n_clusters=N_CLUSTERS,
true_mean=0.10,
proxy_mean=0.05,
correlation=0.65,
random_seed=RANDOM_SEED,
)
xi = UniformClusterSampler().sample(clusters, n_clusters=N_LABELED, random_seed=RANDOM_SEED)
y_true = simulate_annotation(y_true_oracle, xi)
labeled_mask = ~np.isnan(y_true)
n_labeled = np.sum(labeled_mask)
n_unlabeled = len(y_true) - n_labeled
print(f"Total turns : {len(y_true):,}")
print(f"Total conversations : {N_CLUSTERS:,}")
print(f" Annotated conversations: {N_LABELED:,}")
print(f" Annotated turns : {n_labeled:,}")
print(f" LLM-judged only (turns): {n_unlabeled:,}")
print()
print("Sample values")
print(f" y_true (annotated turn): {y_true[labeled_mask][0]}")
print(f" y_true (unannotated turn): {y_true[~labeled_mask][0]}")
print(f" y_proxy (all turns): {y_proxy[0]}")
Total turns : 8,000 Total conversations : 800 Annotated conversations: 40 Annotated turns : 330 LLM-judged only (turns): 7,670 Sample values y_true (annotated turn): 0.0 y_true (unannotated turn): nan y_proxy (all turns): 0.0
Two naive strategies both fail¶
Two obvious approaches to estimating the true hallucination rate each have a fatal flaw:
Option A — Trust the judge on all turns. Precise (large sample), but the judge's systematic bias makes the estimate wrong.
Option B — Trust only the human annotations. Unbiased, but the 95% confidence interval is very wide because only few conversations were annotated.
Cluster PPI++, introduced in the next section, fixes both problems simultaneously.
# Option A: LLM judge — average proxy labels over all conversations
judge_estimate = ClusterClassicalMeanEstimator().estimate(y_proxy, clusters)
judge_mean = judge_estimate.mean
judge_lower_bound = judge_estimate.confidence_interval.lower_bound
judge_upper_bound = judge_estimate.confidence_interval.upper_bound
# Option B: human labels only — average over labeled conversations
human_estimate = ClusterClassicalMeanEstimator().estimate(y_true, clusters)
human_mean = human_estimate.mean
human_lower_bound = human_estimate.confidence_interval.lower_bound
human_upper_bound = human_estimate.confidence_interval.upper_bound
sep = "-" * 70
print(sep)
print(f"{'Method':<34} {'Estimate':>8} {'95% confidence interval':>16}")
print(sep)
print(f"{'LLM Judge (800 conversations)':<34} {judge_mean:>7.1%} [{judge_lower_bound:.1%}, {judge_upper_bound:.1%}]")
print(f"{'Human-only (40 conversations)':<34} {human_mean:>7.1%} [{human_lower_bound:.1%}, {human_upper_bound:.1%}]")
print(sep)
print(f"{'True rate (simulation)':<34} {'10.0%':>8}")
---------------------------------------------------------------------- Method Estimate 95% confidence interval ---------------------------------------------------------------------- LLM Judge (800 conversations) 3.9% [3.2%, 4.5%] Human-only (40 conversations) 8.3% [4.2%, 12.3%] ---------------------------------------------------------------------- True rate (simulation) 10.0%
The root cause: the LLM judge is systematically biased¶
The gap is clear: compared to human annotators, the judge consistently under-reports hallucinations on average.
The Cluster PPI++ rectifier measures this systematic error on the annotated turns, then applies the correction across all turns.
labeled_mask = ~np.isnan(y_true)
y_true_filtered = y_true[labeled_mask]
y_proxy_labeled = y_proxy[labeled_mask]
p_mean = np.mean(y_proxy_labeled)
t_mean = np.mean(y_true_filtered)
bias = p_mean - t_mean
fig, ax = plt.subplots(figsize=(7, 4.5))
x_pos = np.array([0, 1])
bar_vals = [p_mean, t_mean]
bar_colors = [C_JUDGE, C_HUMAN]
bar_labels = ["LLM Judge\n on annotated subset", "Human Annotation\n(ground truth)"]
ax.bar(x_pos, bar_vals, color=bar_colors, width=0.45, edgecolor="white", linewidth=2, zorder=3)
for xi, val, c in zip(x_pos, bar_vals, bar_colors):
ax.text(xi, val + 0.005, f"{val:.1%}", ha="center", fontsize=14, fontweight="bold", color=c)
ax.annotate(
"", xy=(1, t_mean + 0.010), xytext=(0, p_mean + 0.010), arrowprops=dict(arrowstyle="<->", color="#666666", lw=2.5)
)
ax.text(
0.5,
max(bar_vals) + 0.033,
f"Bias = {bias:+.1%}",
ha="center",
fontsize=12,
color="#555555",
fontstyle="italic",
bbox=dict(boxstyle="round,pad=0.35", facecolor="#FFFDE7", edgecolor="#CCCCCC"),
)
ax.axhline(0.10, color=C_TRUTH, linestyle="--", linewidth=1.8, zorder=2, label="True rate: 10%")
ax.set_xticks(x_pos)
ax.set_xticklabels(bar_labels)
ax.set_ylabel("Hallucination Rate")
ax.set_ylim(0, 0.26)
ax.set_xlim(-0.5, 1.5)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.spines[["top", "right"]].set_visible(False)
ax.legend(loc="upper right")
plt.tight_layout()
plt.show()
Cluster PPI++ corrects biased labels by leveraging true ones¶
ClusterPPIMeanEstimator implements Cluster PPI++, a cluster-aware extension of Prediction-Powered Inference (PPI++). It combines:
- n annotated turns from the labeled conversations (with human annotations and LLM judge labels)
- N turns from unlabeled conversations (LLM judge labels only)
It accounts for the within-conversation correlation structure, so the confidence interval correctly reflects the variance due to sampling whole conversations. The more the judge agrees with human annotators, the more it reduces the uncertainty of the final estimate, so a better-calibrated judge directly translates into cost savings.
estimator = ClusterPPIMeanEstimator()
cluster_ppi_result = estimator.estimate(
y_true,
y_proxy,
clusters,
metric_name="Hallucination Rate",
confidence_level=0.95,
)
print(cluster_ppi_result.summary())
Metric: Hallucination Rate Point Estimate: 0.084 Confidence Interval (95%): [0.048, 0.121] Estimator : ClusterPPIMeanEstimator n_true: 330 n_proxy: 8000 Effective Sample Size: 405
When judge and human labels are correlated, the difference $y_{\text{true}} - y_{\text{proxy}}$ has low variance at the turn level, so the rectifier adds little noise. Cluster PPI++ achieves a sizable increase in effective sample size: this is the number of turns that a human-only strategy would need to achieve the same variance as Cluster PPI++, quantifying the variance reduction.
Cluster PPI++ Delivers an Unbiased Estimate at Low Cost¶
The plot below compares point estimates and 95% confidence intervals for all three methods against the true hallucination rate (dashed line):
- LLM judge: very narrow confidence interval, but wrong.
- Human-only: unbiased, but the confidence interval is very wide because only few conversations were annotated.
- Cluster PPI++: unbiased and narrow, combining the accuracy of human labels with the precision of many proxy judgements, while correctly accounting for within-conversation correlation.
TRUE_RATE = 0.10
estimates = [
(
f"LLM Judge\n({cluster_ppi_result.n_proxy} | raw proxy)",
judge_mean,
judge_lower_bound,
judge_upper_bound,
C_JUDGE,
),
(
f"Human Annotation\n({cluster_ppi_result.n_true} | small sample)",
human_mean,
human_lower_bound,
human_upper_bound,
C_HUMAN,
),
(
f"Cluster PPI++ (GLIDE)\n({cluster_ppi_result.n_true} + {cluster_ppi_result.n_proxy})\n(full data exploited)",
cluster_ppi_result.mean,
cluster_ppi_result.confidence_interval.lower_bound,
cluster_ppi_result.confidence_interval.upper_bound,
C_GLIDE,
),
]
fig, ax = plt.subplots(figsize=(11, 5.5))
y_pos = [2, 1, 0]
for y, (label, mean, lo, hi, color) in zip(y_pos, estimates):
# Confidence interval line
ax.plot([lo, hi], [y, y], color=color, linewidth=4, solid_capstyle="round", zorder=3)
# Cap marks
for xc in [lo, hi]:
ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=2.5, zorder=3)
# Point estimate
ax.scatter(mean, y, s=200, color=color, zorder=5, edgecolors="white", linewidths=2.5)
# Value label above
ax.text(mean, y + 0.34, f"{mean:.1%}", ha="center", va="bottom", fontsize=12, color=color, fontweight="bold")
# Confidence interval bounds below
ax.text(mean, y - 0.34, f"[{lo:.1%}, {hi:.1%}]", ha="center", va="top", fontsize=11, color="#888888")
# True value
ax.axvline(TRUE_RATE, color=C_TRUTH, linestyle="--", linewidth=2.5, zorder=4)
ax.text(TRUE_RATE + 0.004, 2.72, "True rate 10%", color=C_TRUTH, fontsize=10.5, fontweight="bold")
ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("Estimated Hallucination Rate")
ax.set_xlim(-0.01, 0.26)
ax.set_ylim(-0.8, 3.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()
Testing whether the hallucination rate is within acceptable limits¶
GLIDE's estimation output enables a formal hypothesis test: is the true hallucination rate significantly higher than 5%?
$$H_0 : \mu = 5\% \qquad H_1 : \mu > 5\%$$
Is our system's hallucination rate sufficiently above the business tolerance of 5% to act on?
Without Cluster PPI++, the LLM judge gives a misleading confidence interval near 5%, making it unlikely to reject the null hypothesis. Human-only annotations cover only few conversations and produce a wide confidence interval, leading to conservative decisions.
Cluster PPI++ combines both sources to perform accurate hypothesis testing.
z_stat, p_value, _ = cluster_ppi_result.confidence_interval.test_null_hypothesis(
h0_value=0.05, # LLM judge's claimed rate
alternative="larger", # H1: true rate > 5%
)
sep = "-" * 48
print("Hypothesis test — PPI++ Estimator")
print(sep)
print("H0 : hallucination rate = 5% (LLM says so)")
print("H1 : hallucination rate > 5% (users complain!)")
print()
print(f"z-statistic : {z_stat:.2f}")
print(f"p-value : {p_value:.10f}")
print()
if p_value < 0.05:
print("Decision : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
print("=> The true hallucination rate is significantly above 5%.")
else:
print(
"Decision : Given the p-value is higher than our threshold (p-value >= 0.05), "
" we cannot reject H0 at the 5% level."
)
Hypothesis test — PPI++ Estimator ------------------------------------------------ H0 : hallucination rate = 5% (LLM says so) H1 : hallucination rate > 5% (users complain!) z-statistic : 1.84 p-value : 0.0329014524 Decision : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level. => The true hallucination rate is significantly above 5%.
Notice that the null hypothesis is rejected signalling that the hallucination rate is significantly above the fixed threshold.
Let us try the same hypothesis test using human annotations only.
human_labels_ci = human_estimate.confidence_interval
z_stat_cm, p_value_cm, _ = human_labels_ci.test_null_hypothesis(
h0_value=0.05, # LLM judge's claimed rate
alternative="larger", # H1: true rate > 5%
)
sep = "-" * 48
print("Hypothesis test — Classical Mean Estimator (Human labels only)")
print(sep)
print("H0 : hallucination rate = 5% (LLM says so)")
print("H1 : hallucination rate > 5% (users complain!)")
print()
print(f"z-statistic : {z_stat_cm:.2f}")
print(f"p-value : {p_value_cm:.10f}")
print()
if p_value_cm < 0.05:
print("Decision : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
print("=> The true hallucination rate is significantly above 5%.")
else:
print(
"Decision : Given the p-value is higher than our threshold (p-value >= 0.05), "
" we cannot reject H0 at the 5% level."
)
Hypothesis test — Classical Mean Estimator (Human labels only) ------------------------------------------------ H0 : hallucination rate = 5% (LLM says so) H1 : hallucination rate > 5% (users complain!) z-statistic : 1.58 p-value : 0.0566632362 Decision : Given the p-value is higher than our threshold (p-value >= 0.05), we cannot reject H0 at the 5% level.
The null hypothesis is not rejected due to high uncertainty, and it is not possible to draw the same conclusion. As we saw above, Cluster PPI++ fixes this by leveraging the LLM judge labels to reduce uncertainty while properly accounting for the cluster structure.
Summary: Cluster PPI++ combines accuracy and precision¶
| LLM Judge | Human-only | Cluster PPI++ | |
|---|---|---|---|
| Conversations covered | 800 | 40 | 800 |
| Turns covered | 8,000 | 330 | 8,000 |
| Unbiased estimate | ❌ | ✅ | ✅ |
| Narrow confidence interval | 🟠 (misleading) | ❌ | ✅ |
| Labeling cost | Low | High | Small |
Key takeaways:
LLM judges are biased. A narrow confidence interval around the wrong value is worse than useless, it gives false confidence.
Annotating 40 conversations is all you need. The rectifier uses information from 8,000 cheap proxy labels to shrink the confidence interval compared to human-only estimation.
Cluster PPI++ efficiency relies on conversation count and LLM-judge quality. To shrink the confidence interval further, annotate more conversations or improve the LLM judge's calibration.
Want to go further? The Cluster PPI scientific validation notebook provides rigorous empirical evidence — coverage validity and confidence interval width across a sweep of proxy correlations and confidence levels.