Quickstart¶
GLIDE combines a small set of true evaluation labels (e.g. human annotations) with a large pool of proxy evaluation labels (e.g. LLM-as-judge annotations) to produce a bias-corrected metric with statistically valid confidence intervals.
| Ingredient | Role |
|---|---|
| True labels | Ground truth — accurate but slow and expensive |
| Proxy labels | Proxy — cheap and fast but biased |
import numpy as np
from glide.estimators.ppi import PPIMeanEstimator
from glide.io import to_json
from glide.samplers import UniformSampler
from glide.simulators import generate_binary_dataset, simulate_annotation
Step 1 — Assemble Your Data¶
GLIDE works with numpy arrays:
y_true— labeled ground truth values, withnp.nanfor unlabeled rowsy_proxy— proxy predictions for all rows
We generate all 1000 labels, then mask 900 of them to simulate the unobserved ground truths. The true rate is 15% and the proxy rate is 10% (biased low).
y_true_oracle, y_proxy = generate_binary_dataset(
n_total=1000,
true_mean=0.15, # true rate
proxy_mean=0.10, # proxy rate (biased)
correlation=0.70,
random_seed=0,
)
xi = UniformSampler().sample(n_samples=len(y_true_oracle), budget=100, random_seed=0)
y_true = simulate_annotation(y_true_oracle, xi)
# Count labeled vs unlabeled
n_labeled = np.sum(~np.isnan(y_true))
n_unlabeled = len(y_true) - n_labeled
print(f"Total observations: {len(y_true)}")
print(f" Labeled : {n_labeled}")
print(f" Unlabeled: {n_unlabeled}")
print()
print("Sample values")
print(f" y_true (labeled) : {y_true[0]}")
print(f" y_true (unlabeled): {y_true[-1]}")
print(f" y_proxy : {y_proxy[0]}")
Total observations: 1000 Labeled : 100 Unlabeled: 900 Sample values y_true (labeled) : nan y_true (unlabeled): 0.0 y_proxy : 0.0
Step 2 — Estimate the Metric¶
PPIMeanEstimator.estimate() corrects the proxy's bias using the labeled subset, then applies the correction across the full dataset.
result = PPIMeanEstimator().estimate(
y_true,
y_proxy,
metric_name="Evaluated metric", # e.g. a hallucination rate
confidence_level=0.95,
)
print(result.summary())
Metric: Evaluated metric Point Estimate: 0.138 Confidence Interval (95%): [0.081, 0.194] Estimator : PPIMeanEstimator n_true: 100 n_proxy: 1000 Effective Sample Size: 152
The effective sample size is the number of true labels needed to achieve the same confidence with true labels only.
InferenceResult contains the point estimate, standard error and confidence interval.
# Point estimate and confidence interval
print(f"Estimate : {result.mean:.3f}")
print(f"Standard error : {result.std:.3f}")
print(
f"95% Confidence Interval : [{result.confidence_interval.lower_bound:.3f}, \
{result.confidence_interval.upper_bound:.3f}]"
)
print()
Estimate : 0.138 Standard error : 0.029 95% Confidence Interval : [0.081, 0.194]
Step 3 — Export as JSON¶
The inference results can be exported in JSON format using the to_json utility function.
print(to_json(result))
{
"confidence_interval": {
"confidence_level": 0.95,
"lower_bound": 0.08062858621066935,
"upper_bound": 0.1944213638392808,
"width": 0.11379277762861144
},
"metric_name": "Evaluated metric",
"estimator_name": "PPIMeanEstimator",
"n_true": 100,
"n_proxy": 1000,
"effective_sample_size": 152,
"mean": 0.13752497502497507,
"std": 0.029029303223476133
}
Step 4 — Check if you reached the target¶
InferenceResult also allows you to perform hypothesis testing on the estimated distribution. For example, we can check if the 95% confidence interval is consistent with our target of keeping the rate below the specified limit.
z, p, _ = result.confidence_interval.test_null_hypothesis(
h0_value=0.10,
alternative="larger",
)
print("H0: rate <= 10% | H1: rate > 10%")
print(f"z = {z:.3f} | p = {p:.3f}")
print("Reject H0 — the rate is significantly above 10%." if p < 0.05 else "Cannot reject H0 at the 5% level.")
H0: rate <= 10% | H1: rate > 10% z = 1.293 | p = 0.098 Cannot reject H0 at the 5% level.
Hypothesis was rejected, the hallucination rate is above the business tolerance of 10% at the 95% confidence level. In a real-world example, this would mean the system is not production-ready.
Using Your Own Data¶
Replace generate_binary_dataset with your own numpy arrays. Use np.nan for unlabeled rows in y_true.
import json
import numpy as np
import pandas as pd
import polars as pl
# From a JSON file
with open("annotations.json") as f:
data = json.load(f)
y_true = np.array([d.get("y_true", np.nan) for d in data])
y_proxy = np.array([d["y_proxy"] for d in data])
# From a pandas DataFrame
df = pd.read_csv("your_data.csv")
y_true = df["label"].values.astype(float) # NaN for unlabeled
y_proxy = df["prediction"].values
# From a polars DataFrame
df = pl.read_csv("your_data.csv")
y_true = df["label"].to_numpy().astype(float)
y_proxy = df["prediction"].to_numpy()