Clustered data: Cluster PPI++¶

Cluster PPI++ combines a small set of expensive true evaluation labels with a large pool of cheap proxy evaluation labels to produce a statistically valid, bias-corrected quality metric. It is designed for datasets where samples are grouped into clusters (such as sentences in paragraphs or turns within conversations), because samples in the same cluster are correlated and must be treated as a single annotation unit. The annotation budget is therefore defined in terms of the number of clusters (conversations) to annotate, not individual samples.

This guide walks through a realistic hallucination-detection scenario end-to-end.

What you will learn:

Why LLM-as-Judge metrics are systematically biased
How to use Cluster PPI++ to produce a bias-corrected metric on clustered data
How to test that your metric fits your expectations

The problem: your LLM judge disagrees with your users¶

Let's assume you run a customer-facing AI assistant that handles thousands of multi-turn conversations per day. Each conversation is made up of individual turns (phrases or exchanges), and the quality metric is measured at the turn level: a turn is flagged if it contains a hallucination. Turns within the same conversation are not independent (they share context) so they must be treated as a cluster.

The signals¶

Every tenth user reports incorrect or fabricated information (unacceptable for the management).
You deploy an LLM judge to rate each conversation turn for hallucination. It reports a hallucination rate of 5%.

The users and the LLM judge disagree. You decide to dig deeper.

The manual investigation¶

You budget for 40 conversations to annotate manually — expensive but accurate ground truth. Because annotators review full conversations, this covers roughly 330 individual turns. Annotators find that ~10% of turns contain a blatant hallucination.

That is double what the LLM reports. The judge is systematically optimistic.

The challenge¶

You now have:

8,000 LLM judgements across 800 conversations — cheap and fast, but biased
330 human annotations across 40 conversations — accurate, but covering only a small portion of your data

Ignoring the conversation structure would lead to uncertainty underestimation. Cluster PPI++ accounts for the within-conversation correlation and combines both sources to produce a reliable, unbiased estimate of the true hallucination rate.

In [1]:

Copied!





%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

from glide.estimators import ClusterClassicalMeanEstimator, ClusterPPIMeanEstimator
from glide.samplers import UniformClusterSampler
from glide.simulators import generate_clustered_binary_dataset, simulate_annotation

# ── Colour palette ──────────────────────────────────────────
C_JUDGE = "#E74C3C"  # LLM judge  — red-orange
C_HUMAN = "#2E86AB"  # Human-only — blue
C_GLIDE = "#27AE60"  # GLIDE      — green
C_TRUTH = "#2C3E50"  # True value — dark slate

# ── Global plot style ───────────────────────────────────────
plt.rcParams.update(
    {
        "figure.facecolor": "white",
        "axes.facecolor": "#FAFAFA",
        "axes.grid": True,
        "grid.color": "#E5E5E5",
        "grid.linewidth": 0.8,
        "font.size": 18,
        "axes.labelsize": 18,
        "axes.titlesize": 18,
        "legend.fontsize": 16,
        "xtick.labelsize": 16,
        "ytick.labelsize": 16,
        "figure.titlesize": 19,
    }
)
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

from glide.estimators import ClusterClassicalMeanEstimator, ClusterPPIMeanEstimator
from glide.samplers import UniformClusterSampler
from glide.simulators import generate_clustered_binary_dataset, simulate_annotation

# ── Colour palette ──────────────────────────────────────────
C_JUDGE = "#E74C3C"  # LLM judge  — red-orange
C_HUMAN = "#2E86AB"  # Human-only — blue
C_GLIDE = "#27AE60"  # GLIDE      — green
C_TRUTH = "#2C3E50"  # True value — dark slate

# ── Global plot style ───────────────────────────────────────
plt.rcParams.update(
    {
        "figure.facecolor": "white",
        "axes.facecolor": "#FAFAFA",
        "axes.grid": True,
        "grid.color": "#E5E5E5",
        "grid.linewidth": 0.8,
        "font.size": 18,
        "axes.labelsize": 18,
        "axes.titlesize": 18,
        "legend.fontsize": 16,
        "xtick.labelsize": 16,
        "ytick.labelsize": 16,
        "figure.titlesize": 19,
    }
)

Simulating a clustered dataset with a biased judge¶

generate_clustered_binary_dataset produces synthetic data that mirrors the scenario above, with ground-truth labels available for all samples. Each row represents a single conversation turn; turns are grouped into conversations via the clusters array.

The simulation replicates the practical workflow: proxy labels are generated for every turn, a random subset of conversations is selected for human annotation (the budget is specified as a number of conversations), and the remaining ground-truth labels are masked with np.nan to reflect what would be unobserved in practice.

Generate synthetic data with generate_clustered_binary_dataset: proxy labels cover all turns.
Sample a subset of conversations for annotation with UniformClusterSampler.
Annotate selected turns and mask the rest via simulate_annotation.

The table below summarizes the arrays produced and what each contains.

Array	Meaning
`y_true_oracle`	Ground-truth labels for all turns (revealed only after annotation)
`y_proxy`	Proxy predictions for all turns
`clusters`	Cluster (conversation) ID for each turn
`xi`	Annotation indicator: 1 if the turn's conversation was annotated, 0 otherwise
`y_true`	Observed labels: ground-truth where annotated, `np.nan` elsewhere

In [2]:

Copied!





N_TOTAL = 8000
N_CLUSTERS = 800
N_LABELED = 40
RANDOM_SEED = 14

y_true_oracle, y_proxy, clusters = generate_clustered_binary_dataset(
    n_total=N_TOTAL,
    n_clusters=N_CLUSTERS,
    true_mean=0.10,
    proxy_mean=0.05,
    correlation=0.65,
    random_seed=RANDOM_SEED,
)
xi = UniformClusterSampler().sample(clusters, n_clusters=N_LABELED, random_seed=RANDOM_SEED)
y_true = simulate_annotation(y_true_oracle, xi)
N_TOTAL = 8000
N_CLUSTERS = 800
N_LABELED = 40
RANDOM_SEED = 14

y_true_oracle, y_proxy, clusters = generate_clustered_binary_dataset(
    n_total=N_TOTAL,
    n_clusters=N_CLUSTERS,
    true_mean=0.10,
    proxy_mean=0.05,
    correlation=0.65,
    random_seed=RANDOM_SEED,
)
xi = UniformClusterSampler().sample(clusters, n_clusters=N_LABELED, random_seed=RANDOM_SEED)
y_true = simulate_annotation(y_true_oracle, xi)

In [3]:

Copied!





labeled_mask = ~np.isnan(y_true)
n_labeled = np.sum(labeled_mask)
n_unlabeled = len(y_true) - n_labeled

print(f"Total turns              : {len(y_true):,}")
print(f"Total conversations      : {N_CLUSTERS:,}")
print(f"  Annotated conversations: {N_LABELED:,}")
print(f"  Annotated turns        : {n_labeled:,}")
print(f"  LLM-judged only (turns): {n_unlabeled:,}")
print()
print("Sample values")
print(f"  y_true (annotated turn):    {y_true[labeled_mask][0]}")
print(f"  y_true (unannotated turn):  {y_true[~labeled_mask][0]}")
print(f"  y_proxy (all turns):        {y_proxy[0]}")
labeled_mask = ~np.isnan(y_true)
n_labeled = np.sum(labeled_mask)
n_unlabeled = len(y_true) - n_labeled

print(f"Total turns              : {len(y_true):,}")
print(f"Total conversations      : {N_CLUSTERS:,}")
print(f"  Annotated conversations: {N_LABELED:,}")
print(f"  Annotated turns        : {n_labeled:,}")
print(f"  LLM-judged only (turns): {n_unlabeled:,}")
print()
print("Sample values")
print(f"  y_true (annotated turn):    {y_true[labeled_mask][0]}")
print(f"  y_true (unannotated turn):  {y_true[~labeled_mask][0]}")
print(f"  y_proxy (all turns):        {y_proxy[0]}")

Total turns              : 8,000
Total conversations      : 800
  Annotated conversations: 40
  Annotated turns        : 330
  LLM-judged only (turns): 7,670

Sample values
  y_true (annotated turn):    0.0
  y_true (unannotated turn):  nan
  y_proxy (all turns):        0.0

Two naive strategies both fail¶

Two obvious approaches to estimating the true hallucination rate each have a fatal flaw:

Option A — Trust the judge on all turns. Precise (large sample), but the judge's systematic bias makes the estimate wrong.

Option B — Trust only the human annotations. Unbiased, but the 95% confidence interval is very wide because only few conversations were annotated.

Cluster PPI++, introduced in the next section, fixes both problems simultaneously.

In [4]:

Copied!





# Option A: LLM judge — average proxy labels over all conversations
judge_estimate = ClusterClassicalMeanEstimator().estimate(y_proxy, clusters)
judge_mean = judge_estimate.mean
judge_lower_bound = judge_estimate.confidence_interval.lower_bound
judge_upper_bound = judge_estimate.confidence_interval.upper_bound

# Option B: human labels only — average over labeled conversations
human_estimate = ClusterClassicalMeanEstimator().estimate(y_true, clusters)
human_mean = human_estimate.mean
human_lower_bound = human_estimate.confidence_interval.lower_bound
human_upper_bound = human_estimate.confidence_interval.upper_bound
# Option A: LLM judge — average proxy labels over all conversations
judge_estimate = ClusterClassicalMeanEstimator().estimate(y_proxy, clusters)
judge_mean = judge_estimate.mean
judge_lower_bound = judge_estimate.confidence_interval.lower_bound
judge_upper_bound = judge_estimate.confidence_interval.upper_bound

# Option B: human labels only — average over labeled conversations
human_estimate = ClusterClassicalMeanEstimator().estimate(y_true, clusters)
human_mean = human_estimate.mean
human_lower_bound = human_estimate.confidence_interval.lower_bound
human_upper_bound = human_estimate.confidence_interval.upper_bound

In [5]:

Copied!





sep = "-" * 70
print(sep)
print(f"{'Method':<34} {'Estimate':>8}   {'95% confidence interval':>16}")
print(sep)
print(f"{'LLM Judge  (800 conversations)':<34} {judge_mean:>7.1%}   [{judge_lower_bound:.1%}, {judge_upper_bound:.1%}]")
print(f"{'Human-only  (40 conversations)':<34} {human_mean:>7.1%}   [{human_lower_bound:.1%}, {human_upper_bound:.1%}]")
print(sep)
print(f"{'True rate  (simulation)':<34} {'10.0%':>8}")
sep = "-" * 70
print(sep)
print(f"{'Method':<34} {'Estimate':>8}   {'95% confidence interval':>16}")
print(sep)
print(f"{'LLM Judge  (800 conversations)':<34} {judge_mean:>7.1%}   [{judge_lower_bound:.1%}, {judge_upper_bound:.1%}]")
print(f"{'Human-only  (40 conversations)':<34} {human_mean:>7.1%}   [{human_lower_bound:.1%}, {human_upper_bound:.1%}]")
print(sep)
print(f"{'True rate  (simulation)':<34} {'10.0%':>8}")

----------------------------------------------------------------------
Method                             Estimate   95% confidence interval
----------------------------------------------------------------------
LLM Judge  (800 conversations)        3.9%   [3.2%, 4.5%]
Human-only  (40 conversations)        8.3%   [4.2%, 12.3%]
----------------------------------------------------------------------
True rate  (simulation)               10.0%

The root cause: the LLM judge is systematically biased¶

The gap is clear: compared to human annotators, the judge consistently under-reports hallucinations on average.

The Cluster PPI++ rectifier measures this systematic error on the annotated turns, then applies the correction across all turns.

In [6]:

Copied!





labeled_mask = ~np.isnan(y_true)
y_true_filtered = y_true[labeled_mask]
y_proxy_labeled = y_proxy[labeled_mask]

p_mean = np.mean(y_proxy_labeled)
t_mean = np.mean(y_true_filtered)
bias = p_mean - t_mean
labeled_mask = ~np.isnan(y_true)
y_true_filtered = y_true[labeled_mask]
y_proxy_labeled = y_proxy[labeled_mask]

p_mean = np.mean(y_proxy_labeled)
t_mean = np.mean(y_true_filtered)
bias = p_mean - t_mean

In [7]:

Copied!





fig, ax = plt.subplots(figsize=(7, 4.5))

x_pos = np.array([0, 1])
bar_vals = [p_mean, t_mean]
bar_colors = [C_JUDGE, C_HUMAN]
bar_labels = ["LLM Judge\n on annotated subset", "Human Annotation\n(ground truth)"]

ax.bar(x_pos, bar_vals, color=bar_colors, width=0.45, edgecolor="white", linewidth=2, zorder=3)

for xi, val, c in zip(x_pos, bar_vals, bar_colors):
    ax.text(xi, val + 0.005, f"{val:.1%}", ha="center", fontsize=14, fontweight="bold", color=c)

ax.annotate(
    "", xy=(1, t_mean + 0.010), xytext=(0, p_mean + 0.010), arrowprops=dict(arrowstyle="<->", color="#666666", lw=2.5)
)
ax.text(
    0.5,
    max(bar_vals) + 0.033,
    f"Bias = {bias:+.1%}",
    ha="center",
    fontsize=12,
    color="#555555",
    fontstyle="italic",
    bbox=dict(boxstyle="round,pad=0.35", facecolor="#FFFDE7", edgecolor="#CCCCCC"),
)

ax.axhline(0.10, color=C_TRUTH, linestyle="--", linewidth=1.8, zorder=2, label="True rate: 10%")

ax.set_xticks(x_pos)
ax.set_xticklabels(bar_labels)
ax.set_ylabel("Hallucination Rate")
ax.set_ylim(0, 0.26)
ax.set_xlim(-0.5, 1.5)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.spines[["top", "right"]].set_visible(False)
ax.legend(loc="upper right")
plt.tight_layout()
plt.show()
fig, ax = plt.subplots(figsize=(7, 4.5))

x_pos = np.array([0, 1])
bar_vals = [p_mean, t_mean]
bar_colors = [C_JUDGE, C_HUMAN]
bar_labels = ["LLM Judge\n on annotated subset", "Human Annotation\n(ground truth)"]

ax.bar(x_pos, bar_vals, color=bar_colors, width=0.45, edgecolor="white", linewidth=2, zorder=3)

for xi, val, c in zip(x_pos, bar_vals, bar_colors):
    ax.text(xi, val + 0.005, f"{val:.1%}", ha="center", fontsize=14, fontweight="bold", color=c)

ax.annotate(
    "", xy=(1, t_mean + 0.010), xytext=(0, p_mean + 0.010), arrowprops=dict(arrowstyle="<->", color="#666666", lw=2.5)
)
ax.text(
    0.5,
    max(bar_vals) + 0.033,
    f"Bias = {bias:+.1%}",
    ha="center",
    fontsize=12,
    color="#555555",
    fontstyle="italic",
    bbox=dict(boxstyle="round,pad=0.35", facecolor="#FFFDE7", edgecolor="#CCCCCC"),
)

ax.axhline(0.10, color=C_TRUTH, linestyle="--", linewidth=1.8, zorder=2, label="True rate: 10%")

ax.set_xticks(x_pos)
ax.set_xticklabels(bar_labels)
ax.set_ylabel("Hallucination Rate")
ax.set_ylim(0, 0.26)
ax.set_xlim(-0.5, 1.5)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.spines[["top", "right"]].set_visible(False)
ax.legend(loc="upper right")
plt.tight_layout()
plt.show()

No description has been provided for this image

Cluster PPI++ corrects biased labels by leveraging true ones¶

ClusterPPIMeanEstimator implements Cluster PPI++, a cluster-aware extension of Prediction-Powered Inference (PPI++). It combines:

n annotated turns from the labeled conversations (with human annotations and LLM judge labels)
N turns from unlabeled conversations (LLM judge labels only)

It accounts for the within-conversation correlation structure, so the confidence interval correctly reflects the variance due to sampling whole conversations. The more the judge agrees with human annotators, the more it reduces the uncertainty of the final estimate, so a better-calibrated judge directly translates into cost savings.

In [8]:

Copied!





estimator = ClusterPPIMeanEstimator()

cluster_ppi_result = estimator.estimate(
    y_true,
    y_proxy,
    clusters,
    metric_name="Hallucination Rate",
    confidence_level=0.95,
)

print(cluster_ppi_result.summary())
estimator = ClusterPPIMeanEstimator()

cluster_ppi_result = estimator.estimate(
    y_true,
    y_proxy,
    clusters,
    metric_name="Hallucination Rate",
    confidence_level=0.95,
)

print(cluster_ppi_result.summary())

Metric: Hallucination Rate
Point Estimate: 0.084
Confidence Interval (95%): [0.048, 0.121]
Estimator : ClusterPPIMeanEstimator
n_true: 330
n_proxy: 8000
Effective Sample Size: 405

When judge and human labels are correlated, the difference $y_{\text{true}} - y_{\text{proxy}}$ has low variance at the turn level, so the rectifier adds little noise. Cluster PPI++ achieves a sizable increase in effective sample size: this is the number of turns that a human-only strategy would need to achieve the same variance as Cluster PPI++, quantifying the variance reduction.

Cluster PPI++ Delivers an Unbiased Estimate at Low Cost¶

The plot below compares point estimates and 95% confidence intervals for all three methods against the true hallucination rate (dashed line):

LLM judge: very narrow confidence interval, but wrong.
Human-only: unbiased, but the confidence interval is very wide because only few conversations were annotated.
Cluster PPI++: unbiased and narrow, combining the accuracy of human labels with the precision of many proxy judgements, while correctly accounting for within-conversation correlation.

In [9]:

Copied!





TRUE_RATE = 0.10

estimates = [
    (
        f"LLM Judge\n({cluster_ppi_result.n_proxy}  |  raw proxy)",
        judge_mean,
        judge_lower_bound,
        judge_upper_bound,
        C_JUDGE,
    ),
    (
        f"Human Annotation\n({cluster_ppi_result.n_true} |  small sample)",
        human_mean,
        human_lower_bound,
        human_upper_bound,
        C_HUMAN,
    ),
    (
        f"Cluster PPI++ (GLIDE)\n({cluster_ppi_result.n_true}  +  {cluster_ppi_result.n_proxy})\n(full data exploited)",
        cluster_ppi_result.mean,
        cluster_ppi_result.confidence_interval.lower_bound,
        cluster_ppi_result.confidence_interval.upper_bound,
        C_GLIDE,
    ),
]

fig, ax = plt.subplots(figsize=(11, 5.5))
y_pos = [2, 1, 0]

for y, (label, mean, lo, hi, color) in zip(y_pos, estimates):
    # Confidence interval line
    ax.plot([lo, hi], [y, y], color=color, linewidth=4, solid_capstyle="round", zorder=3)
    # Cap marks
    for xc in [lo, hi]:
        ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=2.5, zorder=3)
    # Point estimate
    ax.scatter(mean, y, s=200, color=color, zorder=5, edgecolors="white", linewidths=2.5)
    # Value label above
    ax.text(mean, y + 0.34, f"{mean:.1%}", ha="center", va="bottom", fontsize=12, color=color, fontweight="bold")
    # Confidence interval bounds below
    ax.text(mean, y - 0.34, f"[{lo:.1%},  {hi:.1%}]", ha="center", va="top", fontsize=11, color="#888888")

# True value
ax.axvline(TRUE_RATE, color=C_TRUTH, linestyle="--", linewidth=2.5, zorder=4)
ax.text(TRUE_RATE + 0.004, 2.72, "True rate  10%", color=C_TRUTH, fontsize=10.5, fontweight="bold")

ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("Estimated Hallucination Rate")
ax.set_xlim(-0.01, 0.26)
ax.set_ylim(-0.8, 3.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()
TRUE_RATE = 0.10

estimates = [
    (
        f"LLM Judge\n({cluster_ppi_result.n_proxy}  |  raw proxy)",
        judge_mean,
        judge_lower_bound,
        judge_upper_bound,
        C_JUDGE,
    ),
    (
        f"Human Annotation\n({cluster_ppi_result.n_true} |  small sample)",
        human_mean,
        human_lower_bound,
        human_upper_bound,
        C_HUMAN,
    ),
    (
        f"Cluster PPI++ (GLIDE)\n({cluster_ppi_result.n_true}  +  {cluster_ppi_result.n_proxy})\n(full data exploited)",
        cluster_ppi_result.mean,
        cluster_ppi_result.confidence_interval.lower_bound,
        cluster_ppi_result.confidence_interval.upper_bound,
        C_GLIDE,
    ),
]

fig, ax = plt.subplots(figsize=(11, 5.5))
y_pos = [2, 1, 0]

for y, (label, mean, lo, hi, color) in zip(y_pos, estimates):
    # Confidence interval line
    ax.plot([lo, hi], [y, y], color=color, linewidth=4, solid_capstyle="round", zorder=3)
    # Cap marks
    for xc in [lo, hi]:
        ax.plot([xc, xc], [y - 0.2, y + 0.2], color=color, linewidth=2.5, zorder=3)
    # Point estimate
    ax.scatter(mean, y, s=200, color=color, zorder=5, edgecolors="white", linewidths=2.5)
    # Value label above
    ax.text(mean, y + 0.34, f"{mean:.1%}", ha="center", va="bottom", fontsize=12, color=color, fontweight="bold")
    # Confidence interval bounds below
    ax.text(mean, y - 0.34, f"[{lo:.1%},  {hi:.1%}]", ha="center", va="top", fontsize=11, color="#888888")

# True value
ax.axvline(TRUE_RATE, color=C_TRUTH, linestyle="--", linewidth=2.5, zorder=4)
ax.text(TRUE_RATE + 0.004, 2.72, "True rate  10%", color=C_TRUTH, fontsize=10.5, fontweight="bold")

ax.set_yticks(y_pos)
ax.set_yticklabels([e[0] for e in estimates])
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda v, _: f"{v:.0%}"))
ax.set_xlabel("Estimated Hallucination Rate")
ax.set_xlim(-0.01, 0.26)
ax.set_ylim(-0.8, 3.2)
ax.spines[["top", "right", "left"]].set_visible(False)
ax.tick_params(left=False)
plt.tight_layout()
plt.show()

Testing whether the hallucination rate is within acceptable limits¶

GLIDE's estimation output enables a formal hypothesis test: is the true hallucination rate significantly higher than 5%?

$$H_0 : \mu = 5\% \qquad H_1 : \mu > 5\%$$

Is our system's hallucination rate sufficiently above the business tolerance of 5% to act on?

Without Cluster PPI++, the LLM judge gives a misleading confidence interval near 5%, making it unlikely to reject the null hypothesis. Human-only annotations cover only few conversations and produce a wide confidence interval, leading to conservative decisions.

Cluster PPI++ combines both sources to perform accurate hypothesis testing.

In [10]:

Copied!





z_stat, p_value, _ = cluster_ppi_result.confidence_interval.test_null_hypothesis(
    h0_value=0.05,  # LLM judge's claimed rate
    alternative="larger",  # H1: true rate > 5%
)

sep = "-" * 48
print("Hypothesis test — PPI++ Estimator")
print(sep)
print("H0 : hallucination rate = 5%   (LLM says so)")
print("H1 : hallucination rate > 5%   (users complain!)")
print()
print(f"z-statistic : {z_stat:.2f}")
print(f"p-value     : {p_value:.10f}")
print()
if p_value < 0.05:
    print("Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
    print("=> The true hallucination rate is significantly above 5%.")
else:
    print(
        "Decision  : Given the p-value is higher than our threshold (p-value >= 0.05), "
        " we cannot reject H0 at the 5% level."
    )
z_stat, p_value, _ = cluster_ppi_result.confidence_interval.test_null_hypothesis(
    h0_value=0.05,  # LLM judge's claimed rate
    alternative="larger",  # H1: true rate > 5%
)

sep = "-" * 48
print("Hypothesis test — PPI++ Estimator")
print(sep)
print("H0 : hallucination rate = 5%   (LLM says so)")
print("H1 : hallucination rate > 5%   (users complain!)")
print()
print(f"z-statistic : {z_stat:.2f}")
print(f"p-value     : {p_value:.10f}")
print()
if p_value < 0.05:
    print("Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
    print("=> The true hallucination rate is significantly above 5%.")
else:
    print(
        "Decision  : Given the p-value is higher than our threshold (p-value >= 0.05), "
        " we cannot reject H0 at the 5% level."
    )

Hypothesis test — PPI++ Estimator
------------------------------------------------
H0 : hallucination rate = 5%   (LLM says so)
H1 : hallucination rate > 5%   (users complain!)

z-statistic : 1.84
p-value     : 0.0329014524

Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.
=> The true hallucination rate is significantly above 5%.

Notice that the null hypothesis is rejected signalling that the hallucination rate is significantly above the fixed threshold.

Let us try the same hypothesis test using human annotations only.

In [11]:

Copied!





human_labels_ci = human_estimate.confidence_interval
z_stat_cm, p_value_cm, _ = human_labels_ci.test_null_hypothesis(
    h0_value=0.05,  # LLM judge's claimed rate
    alternative="larger",  # H1: true rate > 5%
)

sep = "-" * 48
print("Hypothesis test — Classical Mean Estimator (Human labels only)")
print(sep)
print("H0 : hallucination rate = 5%   (LLM says so)")
print("H1 : hallucination rate > 5%   (users complain!)")
print()
print(f"z-statistic : {z_stat_cm:.2f}")
print(f"p-value     : {p_value_cm:.10f}")
print()
if p_value_cm < 0.05:
    print("Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
    print("=> The true hallucination rate is significantly above 5%.")
else:
    print(
        "Decision  : Given the p-value is higher than our threshold (p-value >= 0.05), "
        " we cannot reject H0 at the 5% level."
    )
human_labels_ci = human_estimate.confidence_interval
z_stat_cm, p_value_cm, _ = human_labels_ci.test_null_hypothesis(
    h0_value=0.05,  # LLM judge's claimed rate
    alternative="larger",  # H1: true rate > 5%
)

sep = "-" * 48
print("Hypothesis test — Classical Mean Estimator (Human labels only)")
print(sep)
print("H0 : hallucination rate = 5%   (LLM says so)")
print("H1 : hallucination rate > 5%   (users complain!)")
print()
print(f"z-statistic : {z_stat_cm:.2f}")
print(f"p-value     : {p_value_cm:.10f}")
print()
if p_value_cm < 0.05:
    print("Decision  : Given the p-value is lower than our threshold (p-value < 0.05), we reject H0 at the 5% level.")
    print("=> The true hallucination rate is significantly above 5%.")
else:
    print(
        "Decision  : Given the p-value is higher than our threshold (p-value >= 0.05), "
        " we cannot reject H0 at the 5% level."
    )

Hypothesis test — Classical Mean Estimator (Human labels only)
------------------------------------------------
H0 : hallucination rate = 5%   (LLM says so)
H1 : hallucination rate > 5%   (users complain!)

z-statistic : 1.58
p-value     : 0.0566632362

Decision  : Given the p-value is higher than our threshold (p-value >= 0.05),  we cannot reject H0 at the 5% level.

The null hypothesis is not rejected due to high uncertainty, and it is not possible to draw the same conclusion. As we saw above, Cluster PPI++ fixes this by leveraging the LLM judge labels to reduce uncertainty while properly accounting for the cluster structure.

Summary: Cluster PPI++ combines accuracy and precision¶

	LLM Judge	Human-only	Cluster PPI++
Conversations covered	800	40	800
Turns covered	8,000	330	8,000
Unbiased estimate	❌	✅	✅
Narrow confidence interval	🟠 (misleading)	❌	✅
Labeling cost	Low	High	Small

Key takeaways:

LLM judges are biased. A narrow confidence interval around the wrong value is worse than useless, it gives false confidence.
Annotating 40 conversations is all you need. The rectifier uses information from 8,000 cheap proxy labels to shrink the confidence interval compared to human-only estimation.
Cluster PPI++ efficiency relies on conversation count and LLM-judge quality. To shrink the confidence interval further, annotate more conversations or improve the LLM judge's calibration.

Want to go further? The Cluster PPI scientific validation notebook provides rigorous empirical evidence — coverage validity and confidence interval width across a sweep of proxy correlations and confidence levels.

Click here to download this notebook