GLIDE
Generated Label Inference & Debiasing Engine
π§ What is GLIDE?
GLIDE is a Python library for rigorous evaluation of GenAI systems using hybrid human/proxy annotations.
GLIDE implements methods from the field of prediction-powered inference β the science of system evaluation that combines a small set of labeled data with a large set of proxy-labeled data to produce valid, debiased estimates. See the implemented algorithms below.
π€ Why GLIDE?
- π€ GenAI applications are everywhere β and imperfect. Deployed systems make mistakes, and measuring how often matters.
- βοΈ LLM-as-judge is biased. Proxy evaluators (models, heuristics) are cheap but systematically over- or under-estimate true performance.
- π§ Rigorous evaluation requires a human in the loop. Ground-truth labels from humans are expensive, so only a small subset is feasible.
- π GLIDE bridges the gap. It combines a small set of human annotations with a large set of proxy predictions to produce statistically valid metrics β correcting proxy bias without requiring full human labeling.
β‘ Quick Start
Install the package with your favorite package manager :
uv add glide-py
or
pip install glide-py
And look at our practical quickstart.
π Documentation
Explore the full documentation β from practical tutorials and user guides to scientific deep dives into the methods behind GLIDE.
π€ Contributing
Contributions are welcome! Please read the contributing guide for setup instructions, an architectural overview, and the checklist to follow before opening a pull request. Feel free to open an issue to report a bug or suggest a feature.
π’ Versioning
This project follows Semantic Versioning (SemVer): MAJOR.MINOR.PATCH.
π¦ Dependency Support
This project follows SPEC 0 for dependency support windows.
π License & Citation
This project is licensed under the Apache 2.0 License.
If you use GLIDE in your work, please cite us using the "Cite this repository" button on the GitHub repository page.
π Implemented Algorithms
| Name | Class | Reference Paper(s) | Original Implementation |
|---|---|---|---|
| Prediction-Powered Inference | estimators.PPIMeanEstimator (with power_tuning=False) |
[1] | Link |
| PPI++ | estimators.PPIMeanEstimator |
[2] | Link |
| Stratified Prediction-Powered Inference | estimators.StratifiedPPIMeanEstimator |
[3] | β |
| Stratified Sampling | samplers.StratifiedSampler |
[4] | Link |
| Active Statistical Inference | estimators.ASIMeanEstimator |
[5], [6] | Link |
| Active Sampling | samplers.ActiveSampler |
[5], [6] | Link |
| Predict-Then-Debias | estimators.PTDMeanEstimator, estimators.StratifiedPTDMeanEstimator, estimators.IPWPTDMeanEstimator |
[7] | Link |
| Cluster Prediction-Powered Inference | estimators.ClusterPPIMeanEstimator |
β | Link |
References
[1] Angelopoulos, Anastasios N., Stephen Bates, Clara Fannjiang, Michael I. Jordan, and Tijana Zrnic. "Prediction-powered inference." Science 382, no. 6671 (2023): 669-674.
[2] Angelopoulos, Anastasios N., John C. Duchi, and Tijana Zrnic. "PPI++: Efficient prediction-powered inference." arXiv preprint arXiv:2311.01453 (2023).
[3] Fisch, Adam, Joshua Maynez, R. Alex Hofer, Bhuwan Dhingra, Amir Globerson, and William W. Cohen. "Stratified prediction-powered inference for effective hybrid evaluation of language models." Advances in Neural Information Processing Systems 37 (2024): 111489-111514.
[4] Fogliato, Riccardo, Pratik Patil, Mathew Monfort, and Pietro Perona. "A framework for efficient model evaluation through stratification, sampling, and estimation." In European Conference on Computer Vision, pp. 140-158. Cham: Springer Nature Switzerland, 2024.
[5] Zrnic, Tijana, and Emmanuel J. Candès. "Active statistical inference." In Proceedings of the 41st International Conference on Machine Learning, pp. 62993-63010. 2024.
[6] GligoriΔ, Kristina, Tijana Zrnic, Cinoo Lee, Emmanuel Candes, and Dan Jurafsky. "Can unconfident LLM annotations be used for confident conclusions?" In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3514-3533. 2025.
[7] Kluger, Dan M., Kerri Lu, Tijana Zrnic, Sherrie Wang, and Stephen Bates. "Prediction-powered inference with imputed covariates and nonuniform sampling." arXiv preprint arXiv:2501.18577 (2025).
π¬ Stay Updated
Follow our LinkedIn newsletter for updates on GLIDE and GenAI evaluation.
ποΈ Affiliation
Developed at Emerton Data.
