Skip to content

Confidence Sequences

glide.confidence_sequences.empirical_bernstein.EmpiricalBernsteinConfidenceSequence dataclass

Anytime-valid empirical-Bernstein confidence sequence on a running mean.

Holds the per-look running means and the one-sided anytime-valid bound on the side where drift is harmful (a lower bound for a risk, an upper bound for a performance, after the monitor has mapped the sequence back to the original metric orientation). The bounds hold simultaneously at all looks, so testing after every batch does not inflate the false-alarm probability.

Parameters:

Name Type Description Default
running_mean_estimates NDArray

Per-look running mean of the per-batch estimates, in original metric units.

required
confidence_bounds NDArray

Per-look harmful-side anytime-valid bound, in original metric units.

required
References

Waudby-Smith, Ian, and Aaditya Ramdas. "Estimating means of bounded random variables by betting." Journal of the Royal Statistical Society Series B: Statistical Methodology 86, no. 1 (2024): 1-27.

Howard, Steven R., Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. "Time-uniform, nonparametric, nonasymptotic confidence sequences." The Annals of Statistics 49, no. 2 (2021): 1055-1080.

Examples:

>>> import numpy as np
>>> from glide.confidence_sequences import EmpiricalBernsteinConfidenceSequence
>>> sequence = EmpiricalBernsteinConfidenceSequence(
...     running_mean_estimates=np.array([0.4, 0.6]),
...     confidence_bounds=np.array([0.1, 0.55]),
... )
>>> sequence.test_null_hypothesis(0.5, alternative="larger")
array([False,  True])
Source code in glide/confidence_sequences/empirical_bernstein.py
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
@dataclass
class EmpiricalBernsteinConfidenceSequence:
    """Anytime-valid empirical-Bernstein confidence sequence on a running mean.

    Holds the per-look running means and the one-sided anytime-valid bound on the
    side where drift is harmful (a lower bound for a risk, an upper bound for a
    performance, after the monitor has mapped the sequence back to the original
    metric orientation). The bounds hold simultaneously at all looks, so testing
    after every batch does not inflate the false-alarm probability.

    Parameters
    ----------
    running_mean_estimates : NDArray
        Per-look running mean of the per-batch estimates, in original metric units.
    confidence_bounds : NDArray
        Per-look harmful-side anytime-valid bound, in original metric units.

    References
    ----------
    Waudby-Smith, Ian, and Aaditya Ramdas. "Estimating means of bounded random
    variables by betting." Journal of the Royal Statistical Society Series B:
    Statistical Methodology 86, no. 1 (2024): 1-27.

    Howard, Steven R., Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. "Time-uniform,
    nonparametric, nonasymptotic confidence sequences." The Annals of Statistics 49,
    no. 2 (2021): 1055-1080.

    Examples
    --------
    >>> import numpy as np
    >>> from glide.confidence_sequences import EmpiricalBernsteinConfidenceSequence
    >>> sequence = EmpiricalBernsteinConfidenceSequence(
    ...     running_mean_estimates=np.array([0.4, 0.6]),
    ...     confidence_bounds=np.array([0.1, 0.55]),
    ... )
    >>> sequence.test_null_hypothesis(0.5, alternative="larger")
    array([False,  True])
    """

    running_mean_estimates: NDArray
    confidence_bounds: NDArray

    def test_null_hypothesis(
        self,
        h0_value: float,
        alternative: Literal["larger", "smaller"] = "larger",
    ) -> NDArray:
        """Test the running mean against ``h0_value`` at every look.

        Parameters
        ----------
        h0_value : float
            The threshold the harmful-side bound is tested against (for a monitor,
            the user-supplied business threshold).
        alternative : str, optional
            ``'larger'`` (default) when the metric is a risk: alarm where the lower
            bound exceeds ``h0_value``. ``'smaller'`` when it is a performance: alarm
            where the upper bound falls below ``h0_value``. A confidence sequence is
            one-sided, so ``'two-sided'`` is not accepted.

        Returns
        -------
        NDArray
            Boolean per-look alarm vector, ``True`` once the bound has crossed
            ``h0_value``. Time-uniform family-wise error is controlled over all looks.

        Raises
        ------
        ValueError
            If ``alternative`` is not ``'larger'`` or ``'smaller'``.
        """
        alternatives = ["larger", "smaller"]
        _validate_literal(alternative, "alternative", alternatives)
        if alternative == alternatives[0]:
            alarms = self.confidence_bounds > h0_value
        else:
            alarms = self.confidence_bounds < h0_value
        return alarms

test_null_hypothesis

test_null_hypothesis(h0_value, alternative='larger')

Test the running mean against h0_value at every look.

Parameters:

Name Type Description Default
h0_value float

The threshold the harmful-side bound is tested against (for a monitor, the user-supplied business threshold).

required
alternative str

'larger' (default) when the metric is a risk: alarm where the lower bound exceeds h0_value. 'smaller' when it is a performance: alarm where the upper bound falls below h0_value. A confidence sequence is one-sided, so 'two-sided' is not accepted.

'larger'

Returns:

Type Description
NDArray

Boolean per-look alarm vector, True once the bound has crossed h0_value. Time-uniform family-wise error is controlled over all looks.

Raises:

Type Description
ValueError

If alternative is not 'larger' or 'smaller'.

Source code in glide/confidence_sequences/empirical_bernstein.py
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
def test_null_hypothesis(
    self,
    h0_value: float,
    alternative: Literal["larger", "smaller"] = "larger",
) -> NDArray:
    """Test the running mean against ``h0_value`` at every look.

    Parameters
    ----------
    h0_value : float
        The threshold the harmful-side bound is tested against (for a monitor,
        the user-supplied business threshold).
    alternative : str, optional
        ``'larger'`` (default) when the metric is a risk: alarm where the lower
        bound exceeds ``h0_value``. ``'smaller'`` when it is a performance: alarm
        where the upper bound falls below ``h0_value``. A confidence sequence is
        one-sided, so ``'two-sided'`` is not accepted.

    Returns
    -------
    NDArray
        Boolean per-look alarm vector, ``True`` once the bound has crossed
        ``h0_value``. Time-uniform family-wise error is controlled over all looks.

    Raises
    ------
    ValueError
        If ``alternative`` is not ``'larger'`` or ``'smaller'``.
    """
    alternatives = ["larger", "smaller"]
    _validate_literal(alternative, "alternative", alternatives)
    if alternative == alternatives[0]:
        alarms = self.confidence_bounds > h0_value
    else:
        alarms = self.confidence_bounds < h0_value
    return alarms