Scientific Validation

glide.scientific_validation

compute_hits

compute_hits(stats, confidence_level, true_mean)

Compute per-seed hit indicators from the output of :func:run_monte_carlo.

For each method and each seed, a hit indicator records whether the confidence interval at confidence_level contains true_mean (hit = 1) or does not (hit = 0), allowing empirical coverage to be measured across seeds.

Parameters:

Name	Type	Description	Default
`stats`	`Dict[str, Dict[str, Any]]`	Output of :func:`run_monte_carlo`.	required
`confidence_level`	`float`	The confidence level at which to evaluate coverage.	required
`true_mean`	`float`	The ground-truth value that the intervals should cover.	required

Returns:

Type	Description
`Dict[str, NDArray]`	Dict mapping each method name to a float array of shape `(n_seeds,)` with values in `{0.0, 1.0}`.

Raises:

Type	Description
`ValueError`	If `confidence_level` is not in `(0, 1)`.
`ValueError`	If `confidence_level` was not passed to :func:`run_monte_carlo` and is therefore not available in `stats`.

Examples:

>>> import numpy as np
>>> from glide.scientific_validation import compute_hits
>>> stats = {"M": {"lower_bounds": {0.9: np.array([0.3, 0.6])}, "upper_bounds": {0.9: np.array([0.7, 0.8])}}}
>>> hits = compute_hits(stats, confidence_level=0.9, true_mean=0.5)
>>> hits["M"]
array([1., 0.])

Source code in glide/scientific_validation.py

def compute_hits(
    stats: Dict[str, Dict[str, Any]],
    confidence_level: float,
    true_mean: float,
) -> Dict[str, NDArray]:
    """Compute per-seed hit indicators from the output of :func:`run_monte_carlo`.

    For each method and each seed, a hit indicator records whether the confidence
    interval at ``confidence_level`` contains ``true_mean`` (hit = 1) or does not
    (hit = 0), allowing empirical coverage to be measured across seeds.

    Parameters
    ----------
    stats : Dict[str, Dict[str, Any]]
        Output of :func:`run_monte_carlo`.
    confidence_level : float
        The confidence level at which to evaluate coverage.
    true_mean : float
        The ground-truth value that the intervals should cover.

    Returns
    -------
    Dict[str, NDArray]
        Dict mapping each method name to a float array of shape ``(n_seeds,)``
        with values in ``{0.0, 1.0}``.

    Raises
    ------
    ValueError
        If ``confidence_level`` is not in ``(0, 1)``.
    ValueError
        If ``confidence_level`` was not passed to :func:`run_monte_carlo` and
        is therefore not available in ``stats``.

    Examples
    --------
    >>> import numpy as np
    >>> from glide.scientific_validation import compute_hits
    >>> stats = {"M": {"lower_bounds": {0.9: np.array([0.3, 0.6])}, "upper_bounds": {0.9: np.array([0.7, 0.8])}}}
    >>> hits = compute_hits(stats, confidence_level=0.9, true_mean=0.5)
    >>> hits["M"]
    array([1., 0.])
    """
    _validate_bounds(
        confidence_level, "confidence_level", lower=0, upper=1, left_inclusive=False, right_inclusive=False
    )
    hits = {}
    for method in stats:
        if confidence_level not in stats[method]["lower_bounds"]:
            available_levels = sorted(stats[method]["lower_bounds"].keys())
            raise ValueError(
                f"'confidence_level' {confidence_level!r} was not passed to run_monte_carlo; "
                f"available levels are {available_levels!r}."
            )
        lower = stats[method]["lower_bounds"][confidence_level]
        upper = stats[method]["upper_bounds"][confidence_level]
        hits[method] = np.asarray((lower <= true_mean) & (true_mean <= upper), dtype=float)
    return hits

coverage_with_error_bar

coverage_with_error_bar(hits, confidence_level)

Estimate empirical coverage and its confidence interval from hit indicators.

Uses :class:~glide.estimators.ClassicalMeanEstimator on the binary hit array, which gives a valid confidence interval on the coverage rate via the normal approximation (Wald's interval).

Parameters:

Name	Type	Description	Default
`hits`	`NDArray`	Float array of per-seed hit indicators (values in `{0.0, 1.0}`).	required
`confidence_level`	`float`	Confidence level for the error bar on the coverage estimate.	required

Returns:

Type	Description
`Tuple[float, float, float]`	`(mean_coverage, lower_bound, upper_bound)`.

Raises:

Type	Description
`ValueError`	If `hits` is empty.
`ValueError`	If `confidence_level` is not in `(0, 1)`.

Examples:

>>> import numpy as np
>>> from glide.scientific_validation import coverage_with_error_bar
>>> hits = np.hstack([np.ones(15), np.zeros(5)])
>>> mean_cov, lower, upper = coverage_with_error_bar(hits, confidence_level=0.95)
>>> float(mean_cov)
0.75
>>> round(float(lower), 4)
0.5553
>>> round(float(upper), 4)
0.9447

Source code in glide/scientific_validation.py

def coverage_with_error_bar(
    hits: NDArray,
    confidence_level: float,
) -> Tuple[float, float, float]:
    """Estimate empirical coverage and its confidence interval from hit indicators.

    Uses :class:`~glide.estimators.ClassicalMeanEstimator` on the binary hit array,
    which gives a valid confidence interval on the coverage rate via the normal
    approximation (Wald's interval).

    Parameters
    ----------
    hits : NDArray
        Float array of per-seed hit indicators (values in ``{0.0, 1.0}``).
    confidence_level : float
        Confidence level for the error bar on the coverage estimate.

    Returns
    -------
    Tuple[float, float, float]
        ``(mean_coverage, lower_bound, upper_bound)``.

    Raises
    ------
    ValueError
        If ``hits`` is empty.
    ValueError
        If ``confidence_level`` is not in ``(0, 1)``.

    Examples
    --------
    >>> import numpy as np
    >>> from glide.scientific_validation import coverage_with_error_bar
    >>> hits = np.hstack([np.ones(15), np.zeros(5)])
    >>> mean_cov, lower, upper = coverage_with_error_bar(hits, confidence_level=0.95)
    >>> float(mean_cov)
    0.75
    >>> round(float(lower), 4)
    0.5553
    >>> round(float(upper), 4)
    0.9447
    """
    _validate_non_empty(hits, "hits")
    _validate_bounds(
        confidence_level, "confidence_level", lower=0, upper=1, left_inclusive=False, right_inclusive=False
    )
    estimator = ClassicalMeanEstimator()
    result = estimator.estimate(hits, confidence_level=confidence_level)
    coverage = result.mean
    lower = result.confidence_interval.lower_bound
    upper = result.confidence_interval.upper_bound
    return coverage, lower, upper

run_monte_carlo

run_monte_carlo(confidence_levels, run_seed, n_seeds=500)

Run run_seed for n_seeds independent seeds and collect all outputs into a single dictionary.

The method names are inferred from the keys of the dict returned by run_seed.

Parameters:

Name	Type	Description	Default
`confidence_levels`	`NDArray`	Confidence levels at which to evaluate interval bounds.	required
`run_seed`	`Callable[[int], Dict[str, Any]]`	Function that takes a seed integer and returns a non-empty dict mapping each method name to `{"mean": float, "std": float, "confidence_interval": <result>}`. May also include `"effective_sample_size": float` for any method.	required
`n_seeds`	`int`	Number of Monte Carlo seeds. Default is 500.	`500`

Returns:

Type	Description
`Dict[str, Any]`	Nested dict mapping each method name to `{"means": NDArray, "stds": NDArray, "lower_bounds": {level: NDArray}, "upper_bounds": {level: NDArray}, "effective_sample_sizes": NDArray}`. `effective_sample_sizes` is `NaN` for seeds or methods that did not include it.

Raises:

Type	Description
`ValueError`	If `n_seeds` is not positive.
`ValueError`	If `run_seed` returns an empty dict.
`ValueError`	If any value in `confidence_levels` is not in `(0, 1)`.

Examples:

>>> import numpy as np
>>> from glide.scientific_validation import run_monte_carlo
>>> from glide.estimators import ClassicalMeanEstimator
>>> def run_seed(seed):
...     y = np.array([0.0, 1.0])
...     result = ClassicalMeanEstimator().estimate(y, confidence_level=0.9)
...     return {"M": {"mean": result.mean, "std": result.std, "confidence_interval": result.confidence_interval}}
>>> stats = run_monte_carlo(np.array([0.9]), run_seed, n_seeds=2)
>>> stats["M"]["means"]
array([0.5, 0.5])

Source code in glide/scientific_validation.py

def run_monte_carlo(
    confidence_levels: NDArray,
    run_seed: Callable[[int], Dict[str, Any]],
    n_seeds: int = 500,
) -> Dict[str, Any]:
    """Run ``run_seed`` for ``n_seeds`` independent seeds and collect all outputs into a single dictionary.

    The method names are inferred from the keys of the dict returned by ``run_seed``.

    Parameters
    ----------
    confidence_levels : NDArray
        Confidence levels at which to evaluate interval bounds.
    run_seed : Callable[[int], Dict[str, Any]]
        Function that takes a seed integer and returns a non-empty dict mapping each method
        name to ``{"mean": float, "std": float, "confidence_interval": <result>}``.
        May also include ``"effective_sample_size": float`` for any method.
    n_seeds : int, optional
        Number of Monte Carlo seeds. Default is 500.

    Returns
    -------
    Dict[str, Any]
        Nested dict mapping each method name to
        ``{"means": NDArray, "stds": NDArray, "lower_bounds": {level: NDArray},
        "upper_bounds": {level: NDArray}, "effective_sample_sizes": NDArray}``.
        ``effective_sample_sizes`` is ``NaN`` for seeds or methods that did not
        include it.

    Raises
    ------
    ValueError
        If ``n_seeds`` is not positive.
    ValueError
        If ``run_seed`` returns an empty dict.
    ValueError
        If any value in ``confidence_levels`` is not in ``(0, 1)``.

    Examples
    --------
    >>> import numpy as np
    >>> from glide.scientific_validation import run_monte_carlo
    >>> from glide.estimators import ClassicalMeanEstimator
    >>> def run_seed(seed):
    ...     y = np.array([0.0, 1.0])
    ...     result = ClassicalMeanEstimator().estimate(y, confidence_level=0.9)
    ...     return {"M": {"mean": result.mean, "std": result.std, "confidence_interval": result.confidence_interval}}
    >>> stats = run_monte_carlo(np.array([0.9]), run_seed, n_seeds=2)
    >>> stats["M"]["means"]
    array([0.5, 0.5])
    """
    _validate_strictly_positive(n_seeds, "n_seeds")
    _validate_bounds(
        confidence_levels, "confidence_levels", lower=0, upper=1, left_inclusive=False, right_inclusive=False
    )
    first_estimates = run_seed(0)
    methods = list(first_estimates.keys())
    if not methods:
        raise ValueError("'run_seed' must return a non-empty dict.")
    means = {method: np.zeros(n_seeds) for method in methods}
    stds = {method: np.zeros(n_seeds) for method in methods}
    lower_bounds = {method: {level: np.zeros(n_seeds) for level in confidence_levels} for method in methods}
    upper_bounds = {method: {level: np.zeros(n_seeds) for level in confidence_levels} for method in methods}
    effective_sample_sizes = {method: np.full(n_seeds, np.nan) for method in methods}

    for seed in range(n_seeds):
        estimates = first_estimates if seed == 0 else run_seed(seed)
        for method in methods:
            means[method][seed] = estimates[method]["mean"]
            stds[method][seed] = estimates[method]["std"]
            confidence_interval = estimates[method]["confidence_interval"]
            for level in confidence_levels:
                confidence_interval.confidence_level = level
                lower_bounds[method][level][seed] = confidence_interval.lower_bound
                upper_bounds[method][level][seed] = confidence_interval.upper_bound
            if "effective_sample_size" in estimates[method]:
                effective_sample_sizes[method][seed] = estimates[method]["effective_sample_size"]

    stats = {
        method: {
            "means": means[method],
            "stds": stds[method],
            "lower_bounds": lower_bounds[method],
            "upper_bounds": upper_bounds[method],
            "effective_sample_sizes": effective_sample_sizes[method],
        }
        for method in methods
    }
    return stats