GLIDE

Generated Label Inference & Debiasing Engine

🧭 What is GLIDE?

GLIDE is a Python library for rigorous evaluation of GenAI systems using hybrid human/proxy annotations.

GLIDE implements methods from the field of prediction-powered inference — the science of system evaluation that combines a small set of labeled data with a large set of proxy-labeled data to produce valid, debiased estimates. See the implemented algorithms below.

Prediction-powered inference schema

🤔 Why GLIDE?

🤖 GenAI applications are everywhere — and imperfect. Deployed systems make mistakes, and measuring how often matters.
⚖️ LLM-as-judge is biased. Proxy evaluators (models, heuristics) are cheap but systematically over- or under-estimate true performance.
🧑 Rigorous evaluation requires a human in the loop. Ground-truth labels from humans are expensive, so only a small subset is feasible.
📐 GLIDE bridges the gap. It combines a small set of human annotations with a large set of proxy predictions to produce statistically valid metrics — correcting proxy bias without requiring full human labeling.

⚡ Quick Start

Install the package with your favorite package manager :

uv add glide-py

or

pip install glide-py

And look at our practical quickstart.

📚 Documentation

Explore the full documentation — from practical tutorials and user guides to scientific deep dives into the methods behind GLIDE.

🤝 Contributing

Contributions are welcome! Please read the contributing guide for setup instructions, an architectural overview, and the checklist to follow before opening a pull request. Feel free to open an issue to report a bug or suggest a feature.

🔢 Versioning

This project follows Semantic Versioning (SemVer): MAJOR.MINOR.PATCH.

📦 Dependency Support

This project follows SPEC 0 for dependency support windows.

📄 License & Citation

This project is licensed under the Apache 2.0 License.

If you use GLIDE in your work, please cite us using the "Cite this repository" button on the GitHub repository page.

📚 Implemented Algorithms

Name	Class	Reference Paper(s)	Original Implementation
Prediction-Powered Inference	`estimators.PPIMeanEstimator` (with `power_tuning=False`)	[1]	Link
PPI++	`estimators.PPIMeanEstimator`	[2]	Link
Stratified Prediction-Powered Inference	`estimators.StratifiedPPIMeanEstimator`	[3]	—
Stratified Sampling	`samplers.StratifiedSampler`	[4]	Link
Active Statistical Inference	`estimators.ASIMeanEstimator`	[5], [6]	Link
Active Sampling	`samplers.ActiveSampler`	[5], [6]	Link
Predict-Then-Debias	`estimators.PTDMeanEstimator`, `estimators.StratifiedPTDMeanEstimator`, `estimators.IPWPTDMeanEstimator`	[7]	Link
Cluster Prediction-Powered Inference	`estimators.ClusterPPIMeanEstimator`	—	Link

References

[1] Angelopoulos, Anastasios N., Stephen Bates, Clara Fannjiang, Michael I. Jordan, and Tijana Zrnic. "Prediction-powered inference." Science 382, no. 6671 (2023): 669-674.

[2] Angelopoulos, Anastasios N., John C. Duchi, and Tijana Zrnic. "PPI++: Efficient prediction-powered inference." arXiv preprint arXiv:2311.01453 (2023).

[3] Fisch, Adam, Joshua Maynez, R. Alex Hofer, Bhuwan Dhingra, Amir Globerson, and William W. Cohen. "Stratified prediction-powered inference for effective hybrid evaluation of language models." Advances in Neural Information Processing Systems 37 (2024): 111489-111514.

[4] Fogliato, Riccardo, Pratik Patil, Mathew Monfort, and Pietro Perona. "A framework for efficient model evaluation through stratification, sampling, and estimation." In European Conference on Computer Vision, pp. 140-158. Cham: Springer Nature Switzerland, 2024.

[5] Zrnic, Tijana, and Emmanuel J. Candès. "Active statistical inference." In Proceedings of the 41st International Conference on Machine Learning, pp. 62993-63010. 2024.

[6] Gligorić, Kristina, Tijana Zrnic, Cinoo Lee, Emmanuel Candes, and Dan Jurafsky. "Can unconfident LLM annotations be used for confident conclusions?" In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3514-3533. 2025.

[7] Kluger, Dan M., Kerri Lu, Tijana Zrnic, Sherrie Wang, and Stephen Bates. "Prediction-powered inference with imputed covariates and nonuniform sampling." arXiv preprint arXiv:2501.18577 (2025).

📬 Stay Updated

Follow our LinkedIn newsletter for updates on GLIDE and GenAI evaluation.

🏛️ Affiliation

Developed at Emerton Data.