Data Simulators

glide.simulators

generate_binary_dataset

generate_binary_dataset(
    n_total,
    true_mean=0.7,
    proxy_mean=0.6,
    correlation=0.8,
    random_seed=None,
)

Generate a synthetic binary-label oracle dataset.

Parameters:

Name	Type	Description	Default
`n_total`	`int`	Total number of samples. All samples have both true and proxy labels.	required
`true_mean`	`float`	Expected mean value of the true labels.	`0.7`
`proxy_mean`	`float`	Expected mean value of the proxy labels.	`0.6`
`correlation`	`float`	Pearson correlation between true and proxy labels.	`0.8`
`random_seed`	`int or SeedSequence`	Seed for reproducibility.	`None`

Returns:

Type	Description
`Tuple[NDArray, NDArray]`	[0]: array of shape `(n_total,)`, y_true containing ground-truth labels. [1]: array of shape `(n_total,)`, y_proxy containing proxy labels.

Raises:

Type	Description
`ValueError`	If `true_mean` is not in (0, 1).
`ValueError`	If `proxy_mean` is not in (0, 1).
`ValueError`	If the combination of `true_mean`, `proxy_mean`, and `correlation` is impossible (leads to negative joint probabilities).

Notes

Step 1 — Joint distribution

For two binary variables with marginals p_t = P(y_true=1) and p_p = P(y_proxy=1), the Pearson correlation uniquely determines the joint distribution. Let D = sqrt(p_t * p_p * (1-p_t) * (1-p_p)) (product of standard deviations). Then:

p11 = P(y_true=1, y_proxy=1) = correlation * D + p_t * p_p
p00 = P(y_true=0, y_proxy=0) = 1 - p_t - p_p + p11
p01 = P(y_true=0, y_proxy=1) = p_p - p11
p10 = P(y_true=1, y_proxy=0) = p_t - p11

These four probabilities must all be strictly positive — otherwise the parameter combination is impossible and a ValueError is raised. The previous probabilities become negative for the following respective values :

p11 < 0 for correlation < -(p_t * p_p) / D
p00 < 0 for correlation < (p_t + p_p - p_t * p_p - 1) / D
p01 < 0 for correlation > p_p * (1 - p_t) / D
p10 < 0 for correlation > p_t * (1 - p_p) / D

Therefore, the correlation needs to satisfy :

max(-p_t * p_p, p_t + p_p - p_t * p_p - 1) <= correlation * D <= min(p_t * (1 - p_p), p_p * (1 - p_t))

Step 2 — Conditional probabilities

The joint probability p11 determines the conditional probability of y_proxy given each value of y_true:

p11 = correlation * D + p_t * p_p
p01 = p_p - p11

P(y_proxy = 1 | y_true = 1) = p11 / p_t
P(y_proxy = 1 | y_true = 0) = p01 / (1 - p_t)

Step 3 — Two-stage generation

y_true is sampled first for all n_total observations from a Bernoulli(p_t) distribution. For each observation, the corresponding conditional probability from Step 2 is selected, and y_proxy is then drawn from that conditional Bernoulli distribution:

y_true_i ~ Bernoulli(p_t)
y_proxy_i | y_true_i ~ Bernoulli(P(y_proxy = 1 | y_true_i))

References

.. [SO] Correlation between Bernoulli Variables <https://math.stackexchange.com/questions/610443/finding-a-correlation-between-bernoulli-variables>_

Examples:

>>> import numpy as np
>>> from glide.simulators import generate_binary_dataset
>>> y_true, y_proxy = generate_binary_dataset(n_total=8, random_seed=42)
>>> len(y_true)
8
>>> len(y_proxy)
8
>>> bool(np.all(np.isin(y_true, [0.0, 1.0])))
True
>>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
True

Source code in glide/simulators/binary.py

def generate_binary_dataset(
    n_total: int,
    true_mean: float = 0.7,
    proxy_mean: float = 0.6,
    correlation: float = 0.8,
    random_seed: Optional[Union[int, np.random.SeedSequence]] = None,
) -> Tuple[NDArray, NDArray]:
    """Generate a synthetic binary-label oracle dataset.

    Parameters
    ----------
    n_total : int
        Total number of samples. All samples have both true and proxy labels.
    true_mean : float
        Expected mean value of the true labels.
    proxy_mean : float
        Expected mean value of the proxy labels.
    correlation : float
        Pearson correlation between true and proxy labels.
    random_seed : int or np.random.SeedSequence, optional
        Seed for reproducibility.

    Returns
    -------
    Tuple[NDArray, NDArray]
        [0]: array of shape ``(n_total,)``, y_true containing ground-truth labels.
        [1]: array of shape ``(n_total,)``, y_proxy containing proxy labels.

    Raises
    ------
    ValueError
        If ``true_mean`` is not in (0, 1).
    ValueError
        If ``proxy_mean`` is not in (0, 1).
    ValueError
        If the combination of ``true_mean``, ``proxy_mean``, and ``correlation`` is
        impossible (leads to negative joint probabilities).

    Notes
    -----
    **Step 1 — Joint distribution**

    For two binary variables with marginals ``p_t = P(y_true=1)`` and
    ``p_p = P(y_proxy=1)``, the Pearson correlation uniquely determines the
    joint distribution.  Let ``D = sqrt(p_t * p_p * (1-p_t) * (1-p_p))``
    (product of standard deviations).  Then:

    ```
    p11 = P(y_true=1, y_proxy=1) = correlation * D + p_t * p_p
    p00 = P(y_true=0, y_proxy=0) = 1 - p_t - p_p + p11
    p01 = P(y_true=0, y_proxy=1) = p_p - p11
    p10 = P(y_true=1, y_proxy=0) = p_t - p11
    ```

    These four probabilities must all be strictly positive — otherwise the
    parameter combination is impossible and a ``ValueError`` is raised. The
    previous probabilities become negative for the following respective
    values :

    ```
    p11 < 0 for correlation < -(p_t * p_p) / D
    p00 < 0 for correlation < (p_t + p_p - p_t * p_p - 1) / D
    p01 < 0 for correlation > p_p * (1 - p_t) / D
    p10 < 0 for correlation > p_t * (1 - p_p) / D
    ```

    Therefore, the correlation needs to satisfy :

    ```
    max(-p_t * p_p, p_t + p_p - p_t * p_p - 1) <= correlation * D <= min(p_t * (1 - p_p), p_p * (1 - p_t))
    ```

    **Step 2 — Conditional probabilities**

    The joint probability ``p11`` determines the conditional probability of
    ``y_proxy`` given each value of ``y_true``:

    ```
    p11 = correlation * D + p_t * p_p
    p01 = p_p - p11

    P(y_proxy = 1 | y_true = 1) = p11 / p_t
    P(y_proxy = 1 | y_true = 0) = p01 / (1 - p_t)
    ```

    **Step 3 — Two-stage generation**

    ``y_true`` is sampled first for all ``n_total`` observations from a
    ``Bernoulli(p_t)`` distribution.  For each observation, the corresponding
    conditional probability from Step 2 is selected, and ``y_proxy`` is then
    drawn from that conditional Bernoulli distribution:

    ```
    y_true_i ~ Bernoulli(p_t)
    y_proxy_i | y_true_i ~ Bernoulli(P(y_proxy = 1 | y_true_i))
    ```

    References
    ----------
    .. [SO] `Correlation between Bernoulli Variables <https://math.stackexchange.com/questions/610443/finding-a-correlation-between-bernoulli-variables>`_

    Examples
    --------
    >>> import numpy as np
    >>> from glide.simulators import generate_binary_dataset
    >>> y_true, y_proxy = generate_binary_dataset(n_total=8, random_seed=42)
    >>> len(y_true)
    8
    >>> len(y_proxy)
    8
    >>> bool(np.all(np.isin(y_true, [0.0, 1.0])))
    True
    >>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
    True
    """
    _validate_bounds(true_mean, "true_mean", lower=0, upper=1, left_inclusive=False, right_inclusive=False)
    _validate_bounds(proxy_mean, "proxy_mean", lower=0, upper=1, left_inclusive=False, right_inclusive=False)

    p_t = true_mean
    p_p = proxy_mean

    D = np.sqrt(p_t * p_p * (1 - p_t) * (1 - p_p))

    min_possible_correlation = max(-p_t * p_p, p_p + p_t - 1 - p_t * p_p) / D
    max_possible_correlation = min(p_t * (1 - p_p), p_p * (1 - p_t)) / D
    if correlation < min_possible_correlation or correlation > max_possible_correlation:
        raise ValueError(
            f"Impossible combination of 'true_mean'={true_mean!r}, 'proxy_mean'={proxy_mean!r}, "
            f"and 'correlation'={correlation!r}: leads to negative joint probabilities; "
            f"possible 'correlation' values are in the range ({min_possible_correlation:.3f}"
            f", {max_possible_correlation:.3f})."
        )

    rng = np.random.default_rng(seed=random_seed)
    y_true = rng.binomial(1, p_t, size=n_total).astype(float)

    p11 = correlation * D + p_t * p_p
    p01 = p_p - p11
    cond_prob_given_1 = p11 / p_t
    cond_prob_given_0 = p01 / (1 - p_t)

    cond_probs = np.where(y_true.astype(bool), cond_prob_given_1, cond_prob_given_0)
    cond_probs = np.clip(cond_probs, 0.0, 1.0)
    y_proxy = rng.binomial(1, cond_probs).astype(float)

    return y_true, y_proxy

generate_binary_dataset_with_oracle_sampling

generate_binary_dataset_with_oracle_sampling(
    n_total,
    true_mean=0.7,
    proxy_mean=0.6,
    correlation=0.8,
    random_seed=None,
)

Generate a synthetic binary dataset with oracle sampling probabilities.

All n_total samples have ground-truth labels (y_true_oracle), proxy predictions (y_proxy), and an oracle uncertainty score derived from the analytical proxy error. The uncertainty values are non-uniform: samples where the proxy is less reliable receive higher uncertainty following the optimal sampling rule.

The sampling is based on a latent variable which determines the correlation between y_true_oracle and y_proxy in each sample. This variable is sampled uniformly around the given correlation value with limited spread within the interval of possible correlation levels given true_mean and proxy_mean. This way, the correlation between y_true_oracle and y_proxy matches the target value on average.

Parameters:

Name	Type	Description	Default
`n_total`	`int`	Total number of samples.	required
`true_mean`	`float`	Expected mean of y_true_oracle. Must be in (0, 1).	`0.7`
`proxy_mean`	`float`	Expected mean of y_proxy. Must be in (0, 1).	`0.6`
`correlation`	`float`	Pearson correlation between y_true_oracle and y_proxy (marginal, across all samples).	`0.8`
`random_seed`	`int`	Seed for reproducibility.	`None`

Returns:

Type	Description
`Tuple[NDArray, NDArray, NDArray]`	[0]: array of shape `(n_total,)`, y_true_oracle with the full ground-truth labels for all n_total samples (no NaN); use `simulate_annotation` to mask unlabeled rows [1]: array of shape `(n_total,)`, y_proxy with proxy predictions [2]: array of shape `(n_total,)`, uncertainty (oracle uncertainty score) per sample

Raises:

Type	Description
`ValueError`	If true_mean is not in (0, 1).
`ValueError`	If proxy_mean is not in (0, 1).
`ValueError`	If the combination of true_mean, proxy_mean, and correlation leads to negative joint probabilities.

Notes

Step 1 — Global joint distribution

For two binary variables with marginals p_t = P(y_true_oracle=1) and p_p = P(y_proxy=1), the Pearson correlation uniquely determines the joint distribution. Let D = sqrt(p_t * p_p * (1-p_t) * (1-p_p)) (product of standard deviations). Then:

p11 = P(y_true_oracle=1, y_proxy=1) = correlation * D + p_t * p_p
p00 = P(y_true_oracle=0, y_proxy=0) = 1 - p_t - p_p + p11
p01 = P(y_true_oracle=0, y_proxy=1) = p_p - p11
p10 = P(y_true_oracle=1, y_proxy=0) = p_t - p11

These four probabilities are fully determined by (p_t, p_p, correlation) and must all be strictly positive — otherwise the parameter combination is impossible and a ValueError is raised. The previous probabilities become negative for the following respective values :

p11 < 0 for correlation < -(p_t * p_p) / D
p00 < 0 for correlation < (p_t + p_p - p_t * p_p - 1) / D
p01 < 0 for correlation > p_p * (1 - p_t) / D
p10 < 0 for correlation > p_t * (1 - p_p) / D

Therefore, the correlation needs to satisfy :

max(-p_t * p_p, p_t + p_p - p_t * p_p - 1) <= correlation * D <= min(p_t * (1 - p_p), p_p * (1 - p_t))

Step 2 — Latent variable x and per-sample correlation

Each sample receives a latent value x_i ~ Uniform(-1, 1) representing "annotation difficulty". The per-sample Pearson correlation is defined as:

corr(x_i) = correlation + correlation_spread * x_i

Because E[x] = 0 for x ~ Uniform(-1, 1), the marginal correlation E[corr(X)] = correlation exactly, preserving the target value on average. Samples with low x have lower conditional correlation (proxy less reliable → higher uncertainty); samples with high x have higher conditional correlation (proxy more reliable → lower uncertainty).

correlation_spread is chosen as 90 % of the largest value that keeps all four per-sample probabilities strictly positive for every x in [-1, 1]:

max_safe_correlation_spread = min(p00, p01, p10, p11) / D

Step 3 — Per-sample probabilities and error probability

We adapt p11 with x and this propagates to other values:

p11(x) = corr(x) * D + p_t * p_p          # varies with x
error_prob(x) = p01(x) + p10(x)
                = p_t + p_p - 2 * p11(x)    # proxy ≠ y_true_oracle

error_prob(x) is the per-sample proxy error probability, which decreases linearly as x increases (higher x → better proxy).

Step 4 — Vectorized CDF inversion

Since each sample has its own probability vector, numpy.random.choice (which takes a single fixed probability vector) cannot be used. Instead, the four outcomes (0,0), (0,1), (1,0), (1,1) are encoded as integers 0–3 and sampled via cumulative-threshold comparison on a single u ~ Uniform(0,1) draw:

u < p00(x)                 → outcome 0 : (y_true_oracle=0, y_proxy=0)
u < p00(x)+p01(x)          → outcome 1 : (y_true_oracle=0, y_proxy=1)
u < p00(x)+p01(x)+p10(x)   → outcome 2 : (y_true_oracle=1, y_proxy=0)
else                       → outcome 3 : (y_true_oracle=1, y_proxy=1)

The crucial simplification is that the second threshold collapses to the constant 1 - p_t (independent of x), because:

p00(x) + p01(x) = (1-p_t-p_p+p11) + (p_p-p11) = 1 - p_t

We also have :

p00(x) + p01(x) + p10(x) = 1 - p11(x)

This means only two of the three thresholds require per-sample arrays. The outcome integer encodes both labels: y_true_oracle = outcome // 2, y_proxy = outcome % 2.

Step 5 — Oracle uncertainty

The optimal sampling probability satisfies uncertainty = sqrt(E[(y_proxy - y_true_oracle)²]) = sqrt(error_prob(x)). These values are stored directly as uncertainty.

Examples:

>>> from glide.simulators import generate_binary_dataset_with_oracle_sampling
>>> y_true_oracle, y_proxy, uncertainty = generate_binary_dataset_with_oracle_sampling(n_total=4, random_seed=0)
>>> len(y_true_oracle)
4
>>> len(y_proxy)
4
>>> len(uncertainty)
4
>>> bool(np.all(np.isin(y_true_oracle, [0.0, 1.0])))
True
>>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
True
>>> bool(np.all((uncertainty >= 0) & (uncertainty <= 1)))
True

Source code in glide/simulators/oracle_binary.py

def generate_binary_dataset_with_oracle_sampling(
    n_total: int,
    true_mean: float = 0.7,
    proxy_mean: float = 0.6,
    correlation: float = 0.8,
    random_seed: Optional[int] = None,
) -> Tuple[NDArray, NDArray, NDArray]:
    """Generate a synthetic binary dataset with oracle sampling probabilities.

    All n_total samples have ground-truth labels (y_true_oracle), proxy predictions (y_proxy),
    and an oracle uncertainty score derived from the analytical
    proxy error. The uncertainty values are non-uniform: samples where the proxy is less
    reliable receive higher uncertainty following the optimal sampling rule.

    The sampling is based on a latent variable which determines the correlation
    between y_true_oracle and y_proxy in each sample. This variable is sampled uniformly
    around the given correlation value with limited spread within the interval of
    possible correlation levels given true_mean and proxy_mean. This way, the
    correlation between y_true_oracle and y_proxy matches the target value on average.

    Parameters
    ----------
    n_total : int
        Total number of samples.
    true_mean : float
        Expected mean of y_true_oracle. Must be in (0, 1).
    proxy_mean : float
        Expected mean of y_proxy. Must be in (0, 1).
    correlation : float
        Pearson correlation between y_true_oracle and y_proxy (marginal, across all samples).
    random_seed : int, optional
        Seed for reproducibility.

    Returns
    -------
    Tuple[NDArray, NDArray, NDArray]
        [0]: array of shape ``(n_total,)``, y_true_oracle with the full ground-truth labels for all n_total
        samples (no NaN); use ``simulate_annotation`` to mask unlabeled rows
        [1]: array of shape ``(n_total,)``, y_proxy with proxy predictions
        [2]: array of shape ``(n_total,)``, uncertainty (oracle uncertainty score) per sample

    Raises
    ------
    ValueError
        If true_mean is not in (0, 1).
    ValueError
        If proxy_mean is not in (0, 1).
    ValueError
        If the combination of true_mean, proxy_mean, and correlation leads to
        negative joint probabilities.

    Notes
    -----
    **Step 1 — Global joint distribution**

    For two binary variables with marginals ``p_t = P(y_true_oracle=1)`` and
    ``p_p = P(y_proxy=1)``, the Pearson correlation uniquely determines the
    joint distribution.  Let ``D = sqrt(p_t * p_p * (1-p_t) * (1-p_p))``
    (product of standard deviations).  Then:

    ```
    p11 = P(y_true_oracle=1, y_proxy=1) = correlation * D + p_t * p_p
    p00 = P(y_true_oracle=0, y_proxy=0) = 1 - p_t - p_p + p11
    p01 = P(y_true_oracle=0, y_proxy=1) = p_p - p11
    p10 = P(y_true_oracle=1, y_proxy=0) = p_t - p11
    ```

    These four probabilities are fully determined by ``(p_t, p_p, correlation)``
    and must all be strictly positive — otherwise the parameter combination is
    impossible and a ``ValueError`` is raised. The previous probabilities become
    negative for the following respective values :

    ```
    p11 < 0 for correlation < -(p_t * p_p) / D
    p00 < 0 for correlation < (p_t + p_p - p_t * p_p - 1) / D
    p01 < 0 for correlation > p_p * (1 - p_t) / D
    p10 < 0 for correlation > p_t * (1 - p_p) / D
    ```

    Therefore, the correlation needs to satisfy :

    ```
    max(-p_t * p_p, p_t + p_p - p_t * p_p - 1) <= correlation * D <= min(p_t * (1 - p_p), p_p * (1 - p_t))
    ```

    **Step 2 — Latent variable x and per-sample correlation**

    Each sample receives a latent value ``x_i ~ Uniform(-1, 1)`` representing
    "annotation difficulty".  The per-sample Pearson correlation is defined as:

    ```
    corr(x_i) = correlation + correlation_spread * x_i
    ```

    Because ``E[x] = 0`` for ``x ~ Uniform(-1, 1)``, the marginal
    correlation ``E[corr(X)] = correlation`` exactly, preserving the target
    value on average.  Samples with low ``x`` have lower conditional
    correlation (proxy less reliable → higher uncertainty); samples with high
    ``x`` have higher conditional correlation (proxy more reliable → lower uncertainty).

    ``correlation_spread`` is chosen as 90 % of the largest value that keeps
    all four per-sample probabilities strictly positive for every
    ``x in [-1, 1]``:

    ```
    max_safe_correlation_spread = min(p00, p01, p10, p11) / D
    ```

    **Step 3 — Per-sample probabilities and error probability**

    We adapt ``p11`` with ``x`` and this propagates to other values:

    ```
    p11(x) = corr(x) * D + p_t * p_p          # varies with x
    error_prob(x) = p01(x) + p10(x)
                    = p_t + p_p - 2 * p11(x)    # proxy ≠ y_true_oracle
    ```

    ``error_prob(x)`` is the per-sample proxy error probability, which
    decreases linearly as ``x`` increases (higher x → better proxy).

    **Step 4 — Vectorized CDF inversion**

    Since each sample has its own probability vector, ``numpy.random.choice``
    (which takes a single fixed probability vector) cannot be used.  Instead,
    the four outcomes ``(0,0), (0,1), (1,0), (1,1)`` are encoded as integers
    0–3 and sampled via cumulative-threshold comparison on a single
    ``u ~ Uniform(0,1)`` draw:

    ```
    u < p00(x)                 → outcome 0 : (y_true_oracle=0, y_proxy=0)
    u < p00(x)+p01(x)          → outcome 1 : (y_true_oracle=0, y_proxy=1)
    u < p00(x)+p01(x)+p10(x)   → outcome 2 : (y_true_oracle=1, y_proxy=0)
    else                       → outcome 3 : (y_true_oracle=1, y_proxy=1)
    ```

    The crucial simplification is that the second threshold collapses to the
    constant ``1 - p_t`` (independent of ``x``), because:

    ```
    p00(x) + p01(x) = (1-p_t-p_p+p11) + (p_p-p11) = 1 - p_t
    ```

    We also have :
    ```
    p00(x) + p01(x) + p10(x) = 1 - p11(x)
    ```

    This means only two of the three thresholds require per-sample arrays.
    The outcome integer encodes both labels: ``y_true_oracle = outcome // 2``,
    ``y_proxy = outcome % 2``.

    **Step 5 — Oracle uncertainty**

    The optimal sampling probability satisfies
    ``uncertainty = sqrt(E[(y_proxy - y_true_oracle)²]) = sqrt(error_prob(x))``.
    These values are stored directly as ``uncertainty``.

    Examples
    --------
    >>> from glide.simulators import generate_binary_dataset_with_oracle_sampling
    >>> y_true_oracle, y_proxy, uncertainty = generate_binary_dataset_with_oracle_sampling(n_total=4, random_seed=0)
    >>> len(y_true_oracle)
    4
    >>> len(y_proxy)
    4
    >>> len(uncertainty)
    4
    >>> bool(np.all(np.isin(y_true_oracle, [0.0, 1.0])))
    True
    >>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
    True
    >>> bool(np.all((uncertainty >= 0) & (uncertainty <= 1)))
    True
    """
    _validate_bounds(true_mean, "true_mean", lower=0, upper=1, left_inclusive=False, right_inclusive=False)
    _validate_bounds(proxy_mean, "proxy_mean", lower=0, upper=1, left_inclusive=False, right_inclusive=False)

    rng = np.random.default_rng(seed=random_seed)
    p_t = true_mean
    p_p = proxy_mean

    # std product of the variable pair will be used multiple times
    D = np.sqrt(p_t * p_p * (1 - p_t) * (1 - p_p))

    # some combinations of true_mean, proxy_mean and correlation are impossible
    # and lead to negative probabilities, raise an error if this is the case
    min_possible_correlation = max(-p_t * p_p, p_p + p_t - 1 - p_t * p_p) / D
    max_possible_correlation = min(p_t * (1 - p_p), p_p * (1 - p_t)) / D
    if correlation < min_possible_correlation or correlation > max_possible_correlation:
        raise ValueError(
            f"Impossible combination of 'true_mean'={true_mean!r}, 'proxy_mean'={proxy_mean!r}, "
            f"and 'correlation'={correlation!r}: leads to negative joint probabilities; "
            f"possible 'correlation' values are in the range ({min_possible_correlation:.3f}"
            f", {max_possible_correlation:.3f})."
        )

    # Global (marginal) joint distribution — same as generate_binary_dataset
    p11 = correlation * D + p_t * p_p
    p00 = 1 - p_t - p_p + p11
    p01 = p_p - p11
    p10 = p_t - p11
    probs = [p00, p01, p10, p11]

    # Spread parameter: modulates the conditional correlation across samples
    max_safe_correlation_spread = min(probs) / D
    correlation_spread = 0.9 * max_safe_correlation_spread

    # Latent variable: controls per-sample proxy correlation
    x = rng.uniform(-1.0, 1.0, size=n_total)

    # Per-sample conditional joint distribution
    correlation_x = correlation + correlation_spread * x
    p11_x = correlation_x * D + p_t * p_p
    error_prob_x = p_t + p_p - 2.0 * p11_x

    # Vectorized CDF inversion to sample (y_true, y_proxy) per sample
    p00_x = 1.0 - p_t - p_p + p11_x
    u = rng.uniform(0.0, 1.0, size=n_total)
    samples = np.where(
        u < p00_x,
        0,
        np.where(
            u < 1.0 - p_t,
            1,
            np.where(u < 1.0 - p11_x, 2, 3),
        ),
    )
    y_true_oracle_arr = samples // 2
    y_proxy_arr = samples % 2

    # Oracle uncertainty: sqrt(P(error | x_i))
    uncertainty = np.sqrt(error_prob_x)

    return y_true_oracle_arr.astype(float), y_proxy_arr.astype(float), uncertainty

generate_clustered_binary_dataset

generate_clustered_binary_dataset(
    n_total,
    n_clusters,
    true_mean=0.7,
    proxy_mean=0.6,
    correlation=0.8,
    within_cluster_diversity=0.9,
    random_seed=None,
)

Generate a synthetic clustered binary-label dataset for evaluation.

Draws n_total i.i.d. (y_true, y_proxy) pairs from the joint binary distribution defined by true_mean, proxy_mean, and correlation, then randomly partitions the observations into n_clusters non-empty groups.

Parameters:

Name	Type	Description	Default
`n_total`	`int`	Exact total number of observations across all clusters.	required
`n_clusters`	`int`	Exact number of clusters. Must be at least 2.	required
`true_mean`	`float`	Expected mean value of the true labels. Must be in `(0, 1)`.	`0.7`
`proxy_mean`	`float`	Expected mean value of the proxy labels. Must be in `(0, 1)`.	`0.6`
`correlation`	`float`	Pearson correlation between true and proxy labels.	`0.8`
`within_cluster_diversity`	`float`	Controls how many distinct label pairs exist within each cluster. Each cluster retains max(1, floor(within_cluster_diversity * cluster_size)) of its observations as originals; the rest have their labels resampled from those originals, producing repetition. A value of 0 collapses each cluster to a single label pair; a value of 1 leaves all labels unchanged. Must be in [0, 1].	`0.9`
`random_seed`	`int or SeedSequence`	Seed for reproducibility.	`None`

Returns:

Type	Description
`Tuple[NDArray, NDArray, NDArray]`	[0]: `y_true` — shape `(n_total,)`, values in `{0.0, 1.0}`. [1]: `y_proxy` — shape `(n_total,)`, values in `{0.0, 1.0}`. [2]: `clusters` — shape `(n_total,)`, integer cluster identifiers in `{0, 1, ..., n_clusters - 1}`.

Raises:

Type	Description
`ValueError`	If `true_mean` is not in `(0, 1)`.
`ValueError`	If `proxy_mean` is not in `(0, 1)`.
`ValueError`	If the combination of `true_mean`, `proxy_mean`, and `correlation` is impossible (leads to negative joint probabilities).
`ValueError`	If `n_clusters < 2`.
`ValueError`	If `n_total < n_clusters`.

Notes

Step 1 — Draw observations

Call generate_binary_dataset(n_total, ...) to obtain n_total i.i.d. (y_true, y_proxy) pairs from the joint binary distribution defined by true_mean, proxy_mean, and correlation.

Step 2 — Random cluster partition

Draw n_clusters - 1 cut positions uniformly without replacement from {1, 2, ..., n_total - 1} and sort them. Combined with 0 and n_total, these define n_clusters contiguous intervals of random lengths that sum to n_total. Assign cluster identifier k to all observations whose position falls in the k-th interval. Every cluster contains at least 1 observation by construction.

Step 3 — Shuffle

Randomly permute the cluster identifier array so that cluster membership is not determined by position in the output.

Step 4 — Reduce within-cluster diversity

For each cluster, draw a random permutation of its observation indices. The first max(1, floor(within_cluster_diversity * cluster_size)) permuted positions retain their original labels; the remaining positions have their labels replaced by a uniform resample from those originals. Setting within_cluster_diversity to 0 produces clusters where all observations share a single label value; setting it to 1 leaves the dataset unchanged. As a result, when within_cluster_diversity < 1, observations within a cluster are no longer independent. Standard correlation estimators (such as np.corrcoef) rely on this independence assumption and will therefore generally not recover the correlation input.

Examples:

>>> import numpy as np
>>> from glide.simulators import generate_clustered_binary_dataset
>>> y_true, y_proxy, clusters = generate_clustered_binary_dataset(
...     n_total=10, n_clusters=4, random_seed=0
... )
>>> y_true
array([0., 1., 1., 1., 1., 1., 1., 0., 1., 1.])
>>> y_proxy
array([0., 1., 1., 0., 1., 1., 1., 0., 1., 1.])
>>> clusters
array([3, 0, 3, 1, 0, 3, 3, 2, 0, 0])

Source code in glide/simulators/clustered_binary.py

def generate_clustered_binary_dataset(
    n_total: int,
    n_clusters: int,
    true_mean: float = 0.7,
    proxy_mean: float = 0.6,
    correlation: float = 0.8,
    within_cluster_diversity: float = 0.9,
    random_seed: Optional[Union[int, np.random.SeedSequence]] = None,
) -> Tuple[NDArray, NDArray, NDArray]:
    """Generate a synthetic clustered binary-label dataset for evaluation.

    Draws ``n_total`` i.i.d. ``(y_true, y_proxy)`` pairs from the
    joint binary distribution defined by ``true_mean``, ``proxy_mean``, and
    ``correlation``, then randomly partitions the observations into
    ``n_clusters`` non-empty groups.

    Parameters
    ----------
    n_total : int
        Exact total number of observations across all clusters.
    n_clusters : int
        Exact number of clusters. Must be at least 2.
    true_mean : float
        Expected mean value of the true labels. Must be in ``(0, 1)``.
    proxy_mean : float
        Expected mean value of the proxy labels. Must be in ``(0, 1)``.
    correlation : float
        Pearson correlation between true and proxy labels.
    within_cluster_diversity : float
        Controls how many distinct label pairs exist within each cluster.
        Each cluster retains max(1, floor(within_cluster_diversity * cluster_size))
        of its observations as originals; the rest have their labels resampled from
        those originals, producing repetition. A value of 0 collapses each cluster
        to a single label pair; a value of 1 leaves all labels unchanged. Must be
        in [0, 1].
    random_seed : int or np.random.SeedSequence, optional
        Seed for reproducibility.

    Returns
    -------
    Tuple[NDArray, NDArray, NDArray]
        [0]: ``y_true`` — shape ``(n_total,)``, values in ``{0.0, 1.0}``.
        [1]: ``y_proxy`` — shape ``(n_total,)``, values in ``{0.0, 1.0}``.
        [2]: ``clusters`` — shape ``(n_total,)``, integer cluster
             identifiers in ``{0, 1, ..., n_clusters - 1}``.

    Raises
    ------
    ValueError
        If ``true_mean`` is not in ``(0, 1)``.
    ValueError
        If ``proxy_mean`` is not in ``(0, 1)``.
    ValueError
        If the combination of ``true_mean``, ``proxy_mean``, and
        ``correlation`` is impossible (leads to negative joint probabilities).
    ValueError
        If ``n_clusters < 2``.
    ValueError
        If ``n_total < n_clusters``.

    Notes
    -----
    **Step 1 — Draw observations**

    Call ``generate_binary_dataset(n_total, ...)`` to obtain ``n_total``
    i.i.d. ``(y_true, y_proxy)`` pairs from the joint binary
    distribution defined by ``true_mean``, ``proxy_mean``, and
    ``correlation``.

    **Step 2 — Random cluster partition**

    Draw ``n_clusters - 1`` cut positions uniformly without replacement from
    ``{1, 2, ..., n_total - 1}`` and sort them. Combined with ``0`` and
    ``n_total``, these define ``n_clusters`` contiguous intervals of random
    lengths that sum to ``n_total``. Assign cluster identifier ``k`` to all
    observations whose position falls in the ``k``-th interval. Every cluster
    contains at least 1 observation by construction.

    **Step 3 — Shuffle**

    Randomly permute the cluster identifier array so that cluster membership is
    not determined by position in the output.

    **Step 4 — Reduce within-cluster diversity**

    For each cluster, draw a random permutation of its observation indices.
    The first max(1, floor(within_cluster_diversity * cluster_size)) permuted
    positions retain their original labels; the remaining positions have their
    labels replaced by a uniform resample from those originals. Setting
    within_cluster_diversity to 0 produces clusters where all observations
    share a single label value; setting it to 1 leaves the dataset unchanged.
    As a result, when ``within_cluster_diversity < 1``, observations within a
    cluster are no longer independent. Standard correlation estimators (such as
    ``np.corrcoef``) rely on this independence assumption and will therefore
    generally not recover the ``correlation`` input.

    Examples
    --------
    >>> import numpy as np
    >>> from glide.simulators import generate_clustered_binary_dataset
    >>> y_true, y_proxy, clusters = generate_clustered_binary_dataset(
    ...     n_total=10, n_clusters=4, random_seed=0
    ... )
    >>> y_true
    array([0., 1., 1., 1., 1., 1., 1., 0., 1., 1.])
    >>> y_proxy
    array([0., 1., 1., 0., 1., 1., 1., 0., 1., 1.])
    >>> clusters
    array([3, 0, 3, 1, 0, 3, 3, 2, 0, 0])
    """
    _validate_bounds(n_clusters, "n_clusters", lower=2, error_message=f"'n_clusters' must be >= 2; got {n_clusters}.")
    _validate_bounds(
        n_total,
        "n_total",
        lower=n_clusters,
        error_message=f"'n_total' must be >= 'n_clusters'; got n_total={n_total} and n_clusters={n_clusters}.",
    )
    _validate_bounds(within_cluster_diversity, "within_cluster_diversity", lower=0, upper=1)

    if isinstance(random_seed, np.random.SeedSequence):
        seed_sequence = random_seed
    else:
        seed_sequence = np.random.SeedSequence(random_seed)
    data_seed, partition_seed = seed_sequence.spawn(2)

    y_true, y_proxy = generate_binary_dataset(
        n_total=n_total,
        true_mean=true_mean,
        proxy_mean=proxy_mean,
        correlation=correlation,
        random_seed=data_seed,
    )

    rng = np.random.default_rng(partition_seed)

    cut_positions = np.sort(rng.choice(n_total - 1, size=n_clusters - 1, replace=False) + 1)
    interval_lengths = np.diff(np.hstack([[0], cut_positions, [n_total]]))
    clusters = np.repeat(np.arange(n_clusters, dtype=np.int64), interval_lengths)
    rng.shuffle(clusters)

    for cluster_id in range(n_clusters):
        cluster_indices = np.where(clusters == cluster_id)[0]
        cluster_size = len(cluster_indices)
        n_sources = max(1, int(within_cluster_diversity * cluster_size))
        permutation = rng.permutation(cluster_size)
        source_indices = cluster_indices[permutation[:n_sources]]
        copy_indices = cluster_indices[permutation[n_sources:]]
        y_true[copy_indices] = rng.choice(y_true[source_indices], size=len(copy_indices))
        y_proxy[copy_indices] = rng.choice(y_proxy[source_indices], size=len(copy_indices))

    return y_true, y_proxy, clusters

generate_gaussian_dataset

generate_gaussian_dataset(
    n_total,
    true_mean=0.7,
    true_std=1,
    proxy_mean=0.6,
    proxy_std=1,
    correlation=0.8,
    random_seed=None,
)

Generate a synthetic Gaussian dataset for evaluation.

Parameters:

Name	Type	Description	Default
`n_total`	`int`	Total number of samples to generate.	required
`true_mean`	`float`	Mean of the true label distribution.	`0.7`
`true_std`	`float`	Standard deviation of the true label distribution.	`1`
`proxy_mean`	`float`	Mean of the proxy label distribution.	`0.6`
`proxy_std`	`float`	Standard deviation of the proxy label distribution.	`1`
`correlation`	`float`	Pearson correlation between true and proxy labels.	`0.8`
`random_seed`	`int`	Seed for reproducibility.	`None`

Returns:

Type	Description
`Tuple[NDArray, NDArray]`	[0]: array of shape `(n_total,)`, oracle true labels [1]: array of shape `(n_total,)`, proxy labels

Notes

Target distribution

The goal is to jointly sample (y_true, y_proxy) from a bivariate Gaussian:

(y_true, y_proxy) ~ N(μ, Σ)

where:

μ = (true_mean, proxy_mean)

Σ = [[true_std²,                          ρ · true_std · proxy_std],
     [ρ · true_std · proxy_std,           proxy_std²              ]]

and ρ is the target Pearson correlation.

Step 1 — Cholesky decomposition of Σ

To sample from N(0, Σ), we find a lower-triangular matrix L such that Σ = L @ Lᵀ (Cholesky factor). The construction uses the angle θ = arccos(ρ), so that cos(θ) = ρ and sin(θ) = √(1 - ρ²):

L = [[true_std,                  0                  ],
     [proxy_std · cos(θ),        proxy_std · sin(θ) ]]

One can verify L @ Lᵀ = Σ directly:

L @ Lᵀ = [[true_std²,                    true_std · proxy_std · cos(θ)],
          [true_std · proxy_std · cos(θ), proxy_std² · (cos²(θ)+sin²(θ))]]

       = [[true_std²,                    true_std · proxy_std · ρ],
          [true_std · proxy_std · ρ,     proxy_std²              ]]  = Σ

Step 2 — Sampling via the linear transform

Let Z be a 2 × n_total matrix whose entries are i.i.d. standard normals Z_i ~ N(0, 1). Then:

Y = L @ Z

gives a 2 × n_total matrix where each column is a zero-mean sample from N(0, Σ). In component form, each column (Z₁, Z₂) maps to:

Y₁ = true_std · Z₁
Y₂ = proxy_std · cos(θ) · Z₁ + proxy_std · sin(θ) · Z₂

The resulting properties are: - Var(Y₁) = true_std² and Var(Y₂) = proxy_std² (correct marginal variances) - Cov(Y₁, Y₂) = true_std · proxy_std · cos(θ) = true_std · proxy_std · ρ - Corr(Y₁, Y₂) = ρ (correct Pearson correlation)

Step 3 — Shifting by the means

Adding the desired means shifts the distribution to N(μ, Σ):

y_true  = true_mean  + Y[0, :]
y_proxy = proxy_mean + Y[1, :]

Examples:

>>> import numpy as np
>>> from glide.simulators import generate_gaussian_dataset
>>> y_true, y_proxy = generate_gaussian_dataset(n_total=8, random_seed=42)
>>> len(y_true)
8
>>> len(y_proxy)
8

Source code in glide/simulators/gaussian.py

def generate_gaussian_dataset(
    n_total: int,
    true_mean: float = 0.7,
    true_std: float = 1,
    proxy_mean: float = 0.6,
    proxy_std: float = 1,
    correlation: float = 0.8,
    random_seed: Optional[int] = None,
) -> Tuple[NDArray, NDArray]:
    """Generate a synthetic Gaussian dataset for evaluation.

    Parameters
    ----------
    n_total : int
        Total number of samples to generate.
    true_mean : float
        Mean of the true label distribution.
    true_std : float
        Standard deviation of the true label distribution.
    proxy_mean : float
        Mean of the proxy label distribution.
    proxy_std : float
        Standard deviation of the proxy label distribution.
    correlation : float
        Pearson correlation between true and proxy labels.
    random_seed : int, optional
        Seed for reproducibility.

    Returns
    -------
    Tuple[NDArray, NDArray]
        [0]: array of shape ``(n_total,)``, oracle true labels
        [1]: array of shape ``(n_total,)``, proxy labels

    Notes
    -----
    **Target distribution**

    The goal is to jointly sample ``(y_true, y_proxy)`` from a bivariate Gaussian:

    ```
    (y_true, y_proxy) ~ N(μ, Σ)
    ```

    where:

    ```
    μ = (true_mean, proxy_mean)

    Σ = [[true_std²,                          ρ · true_std · proxy_std],
         [ρ · true_std · proxy_std,           proxy_std²              ]]
    ```

    and ``ρ`` is the target Pearson correlation.

    **Step 1 — Cholesky decomposition of Σ**

    To sample from ``N(0, Σ)``, we find a lower-triangular matrix ``L`` such that
    ``Σ = L @ Lᵀ`` (Cholesky factor). The construction uses the angle
    ``θ = arccos(ρ)``, so that ``cos(θ) = ρ`` and ``sin(θ) = √(1 - ρ²)``:

    ```
    L = [[true_std,                  0                  ],
         [proxy_std · cos(θ),        proxy_std · sin(θ) ]]
    ```

    One can verify ``L @ Lᵀ = Σ`` directly:

    ```
    L @ Lᵀ = [[true_std²,                    true_std · proxy_std · cos(θ)],
              [true_std · proxy_std · cos(θ), proxy_std² · (cos²(θ)+sin²(θ))]]

           = [[true_std²,                    true_std · proxy_std · ρ],
              [true_std · proxy_std · ρ,     proxy_std²              ]]  = Σ
    ```

    **Step 2 — Sampling via the linear transform**

    Let ``Z`` be a ``2 × n_total`` matrix whose entries are i.i.d. standard normals
    ``Z_i ~ N(0, 1)``. Then:

    ```
    Y = L @ Z
    ```

    gives a ``2 × n_total`` matrix where each column is a zero-mean sample from
    ``N(0, Σ)``. In component form, each column ``(Z₁, Z₂)`` maps to:

    ```
    Y₁ = true_std · Z₁
    Y₂ = proxy_std · cos(θ) · Z₁ + proxy_std · sin(θ) · Z₂
    ```

    The resulting properties are:
    - ``Var(Y₁) = true_std²`` and ``Var(Y₂) = proxy_std²`` (correct marginal variances)
    - ``Cov(Y₁, Y₂) = true_std · proxy_std · cos(θ) = true_std · proxy_std · ρ``
    - ``Corr(Y₁, Y₂) = ρ`` (correct Pearson correlation)

    **Step 3 — Shifting by the means**

    Adding the desired means shifts the distribution to ``N(μ, Σ)``:

    ```
    y_true  = true_mean  + Y[0, :]
    y_proxy = proxy_mean + Y[1, :]
    ```

    Examples
    --------
    >>> import numpy as np
    >>> from glide.simulators import generate_gaussian_dataset
    >>> y_true, y_proxy = generate_gaussian_dataset(n_total=8, random_seed=42)
    >>> len(y_true)
    8
    >>> len(y_proxy)
    8
    """
    _validate_bounds(correlation, "correlation", lower=-1, upper=1)
    rng = np.random.default_rng(seed=random_seed)
    angle = np.arccos(correlation)
    lin_transform = np.array([[true_std, 0], [proxy_std * np.cos(angle), proxy_std * np.sin(angle)]])

    Y = lin_transform @ rng.standard_normal(size=(2, n_total))

    y_true = true_mean + Y[0, :]
    y_proxy = proxy_mean + Y[1, :]

    return y_true, y_proxy

generate_multi_binary_dataset

generate_multi_binary_dataset(
    n_total,
    true_mean,
    proxy_means,
    correlations,
    random_seed=None,
)

Generate a synthetic binary oracle dataset with multiple proxy models.

Parameters:

Name	Type	Description	Default
`n_total`	`int`	Total number of samples.	required
`true_mean`	`float`	Expected mean value of the true labels. Must be in (0, 1).	required
`proxy_means`	`list of float or NDArray of shape (n_proxies,)`	Expected mean value of each proxy label. The length determines the number of proxies. Each value must be in (0, 1).	required
`correlations`	`list of float or NDArray of shape (n_proxies,)`	Pearson correlation between the true label and each proxy label. Length must equal `len(proxy_means)`. Each value must yield a valid joint binary distribution given `true_mean` and the corresponding `proxy_means` entry.	required
`random_seed`	`int or SeedSequence`	Seed for reproducibility.	`None`

Returns:

Type	Description
`Tuple[NDArray, NDArray]`	[0]: array of shape `(n_total,)`, y_true containing ground-truth labels. [1]: array of shape `(n_total, n_proxies)`, y_proxies where column m contains proxy labels with mean `proxy_means[m]` and correlation `correlations[m]` with y_true.

Raises:

Type	Description
`ValueError`	If `proxy_means` and `correlations` have different lengths.
`ValueError`	If `true_mean` is not in (0, 1).
`ValueError`	If any `proxy_means[m]` is not in (0, 1).
`ValueError`	If the combination of `true_mean`, `proxy_means[m]`, and `correlations[m]` is impossible (leads to negative joint probabilities).

Notes

Step 1 — Generate y_true

Each sample's ground-truth label is drawn independently from a Bernoulli distribution:

y_true_i ~ Bernoulli(true_mean)

Step 2 — Conditional probabilities for each proxy

For proxy m with marginal p_p = proxy_means[m] and correlation rho_m, the bivariate binary joint distribution (see generate_binary_dataset) gives:

D_m  = sqrt(p_t * p_p * (1 - p_t) * (1 - p_p))
p11_m = rho_m * D_m + p_t * p_p
p01_m = p_p - p11_m

where p_t = true_mean. The conditional probabilities follow:

P(proxy_m = 1 | y_true = 1) = p11_m / p_t
P(proxy_m = 1 | y_true = 0) = p01_m / (1 - p_t)

Step 3 — Vectorized conditional generation

For each sample i and proxy m, draw independently:

proxy_{m,i} | y_true_i ~ Bernoulli(P(proxy_m = 1 | y_true_i))

All proxies are generated in a single vectorised call over the (n_total, n_proxies) matrix of conditional probabilities. The proxies are conditionally independent given y_true, so they have positive marginal correlation with each other only through the shared y_true.

Validation bounds

The same feasibility constraint as in generate_binary_dataset applies per proxy:

max(-p_t * p_p_m, p_p_m + p_t - 1 - p_t * p_p_m) / D_m
    <= correlations[m]
    <= min(p_t * (1 - p_p_m), p_p_m * (1 - p_t)) / D_m

Examples:

>>> import numpy as np
>>> from glide.simulators import generate_multi_binary_dataset
>>> y_true, y_proxies = generate_multi_binary_dataset(4, 0.7, [0.6, 0.5], [0.7, 0.6], random_seed=42)
>>> y_true
array([0., 1., 0., 1.])
>>> y_proxies
array([[0., 1.],
       [1., 0.],
       [0., 0.],
       [1., 0.]])

Source code in glide/simulators/multi_binary.py

def generate_multi_binary_dataset(
    n_total: int,
    true_mean: float,
    proxy_means: ArrayLike,
    correlations: ArrayLike,
    random_seed: Optional[Union[int, np.random.SeedSequence]] = None,
) -> Tuple[NDArray, NDArray]:
    """Generate a synthetic binary oracle dataset with multiple proxy models.

    Parameters
    ----------
    n_total : int
        Total number of samples.
    true_mean : float
        Expected mean value of the true labels. Must be in (0, 1).
    proxy_means : list of float or NDArray of shape (n_proxies,)
        Expected mean value of each proxy label. The length determines the number of proxies.
        Each value must be in (0, 1).
    correlations : list of float or NDArray of shape (n_proxies,)
        Pearson correlation between the true label and each proxy label. Length must equal
        ``len(proxy_means)``. Each value must yield a valid joint binary distribution
        given ``true_mean`` and the corresponding ``proxy_means`` entry.
    random_seed : int or np.random.SeedSequence, optional
        Seed for reproducibility.

    Returns
    -------
    Tuple[NDArray, NDArray]
        [0]: array of shape ``(n_total,)``, y_true containing ground-truth labels.
        [1]: array of shape ``(n_total, n_proxies)``, y_proxies where column m contains
             proxy labels with mean ``proxy_means[m]`` and correlation ``correlations[m]``
             with y_true.

    Raises
    ------
    ValueError
        If ``proxy_means`` and ``correlations`` have different lengths.
    ValueError
        If ``true_mean`` is not in (0, 1).
    ValueError
        If any ``proxy_means[m]`` is not in (0, 1).
    ValueError
        If the combination of ``true_mean``, ``proxy_means[m]``, and ``correlations[m]``
        is impossible (leads to negative joint probabilities).

    Notes
    -----
    **Step 1 — Generate y_true**

    Each sample's ground-truth label is drawn independently from a Bernoulli distribution:

    ```
    y_true_i ~ Bernoulli(true_mean)
    ```

    **Step 2 — Conditional probabilities for each proxy**

    For proxy m with marginal ``p_p = proxy_means[m]`` and correlation ``rho_m``, the
    bivariate binary joint distribution (see ``generate_binary_dataset``) gives:

    ```
    D_m  = sqrt(p_t * p_p * (1 - p_t) * (1 - p_p))
    p11_m = rho_m * D_m + p_t * p_p
    p01_m = p_p - p11_m
    ```

    where ``p_t = true_mean``. The conditional probabilities follow:

    ```
    P(proxy_m = 1 | y_true = 1) = p11_m / p_t
    P(proxy_m = 1 | y_true = 0) = p01_m / (1 - p_t)
    ```

    **Step 3 — Vectorized conditional generation**

    For each sample i and proxy m, draw independently:

    ```
    proxy_{m,i} | y_true_i ~ Bernoulli(P(proxy_m = 1 | y_true_i))
    ```

    All proxies are generated in a single vectorised call over the ``(n_total, n_proxies)``
    matrix of conditional probabilities. The proxies are conditionally independent given
    y_true, so they have positive marginal correlation with each other only through the
    shared y_true.

    **Validation bounds**

    The same feasibility constraint as in ``generate_binary_dataset`` applies per proxy:

    ```
    max(-p_t * p_p_m, p_p_m + p_t - 1 - p_t * p_p_m) / D_m
        <= correlations[m]
        <= min(p_t * (1 - p_p_m), p_p_m * (1 - p_t)) / D_m
    ```

    Examples
    --------
    >>> import numpy as np
    >>> from glide.simulators import generate_multi_binary_dataset
    >>> y_true, y_proxies = generate_multi_binary_dataset(4, 0.7, [0.6, 0.5], [0.7, 0.6], random_seed=42)
    >>> y_true
    array([0., 1., 0., 1.])
    >>> y_proxies
    array([[0., 1.],
           [1., 0.],
           [0., 0.],
           [1., 0.]])
    """
    p_p = np.asarray(proxy_means, dtype=float)
    correlations_arr = np.asarray(correlations, dtype=float)

    _validate_equal_lengths(p_p, correlations_arr, names=["proxy_means", "correlations"])
    _validate_bounds(true_mean, "true_mean", lower=0, upper=1, left_inclusive=False, right_inclusive=False)

    for m, proxy_mean in enumerate(p_p):
        _validate_bounds(proxy_mean, f"proxy_means[{m}]", lower=0, upper=1, left_inclusive=False, right_inclusive=False)

    p_t = true_mean
    D = np.sqrt(p_t * p_p * (1 - p_t) * (1 - p_p))
    min_correlations = np.maximum(-p_t * p_p, p_p + p_t - 1 - p_t * p_p) / D
    max_correlations = np.minimum(p_t * (1 - p_p), p_p * (1 - p_t)) / D

    for m, (rho, lo, hi, proxy_mean) in enumerate(zip(correlations_arr, min_correlations, max_correlations, p_p)):
        if rho < lo or rho > hi:
            raise ValueError(
                f"Proxy {m}: impossible combination of 'true_mean'={true_mean!r}, "
                f"'proxy_means[{m}]'={proxy_mean!r}, and 'correlations[{m}]'={rho!r}: "
                f"leads to negative joint probabilities; "
                f"possible 'correlations[{m}]' values are in the range ({lo:.3f}, {hi:.3f})."
            )

    rng = np.random.default_rng(seed=random_seed)
    y_true = rng.binomial(1, p_t, size=n_total).astype(float)

    p11 = correlations_arr * D + p_t * p_p
    p01 = p_p - p11
    cond_prob_given_1 = p11 / p_t
    cond_prob_given_0 = p01 / (1 - p_t)

    is_true_one = y_true[:, np.newaxis].astype(bool)
    cond_probs = np.where(is_true_one, cond_prob_given_1, cond_prob_given_0)
    cond_probs = np.clip(cond_probs, 0.0, 1.0)
    y_proxies = rng.binomial(1, cond_probs).astype(float)

    return y_true, y_proxies

generate_stratified_binary_dataset

generate_stratified_binary_dataset(
    n_total,
    true_mean,
    proxy_mean,
    correlation,
    random_seed=None,
)

Generate a synthetic stratified binary-label oracle dataset.

Generate multiple strata with potentially different parameters (true_mean, proxy_mean, correlation, n_total per stratum). This enables simulation of heterogeneous data where different groups have different proxy-truth relationships.

Parameters:

Name	Type	Description	Default
`n_total`	`list of int or NDArray of shape (K,)`	Total number of samples per stratum. All samples have both true and proxy labels. Length must equal number of strata.	required
`true_mean`	`list of float or NDArray of shape (K,)`	Expected mean value of the true labels per stratum. Length must equal number of strata.	required
`proxy_mean`	`list of float or NDArray of shape (K,)`	Expected mean value of the proxy labels per stratum. Length must equal number of strata.	required
`correlation`	`list of float or NDArray of shape (K,)`	Pearson correlation between true and proxy per stratum. Length must equal number of strata.	required
`random_seed`	`int`	Seed for reproducibility. If provided, seeds are derived deterministically.	`None`

Returns:

Type	Description
`Tuple[NDArray, NDArray, NDArray]`	Let `N = sum(n_total)` be the total number of samples across all strata. [0]: array of shape `(N,)`, y_true containing ground-truth labels. [1]: array of shape `(N,)`, y_proxy containing proxy labels. [2]: array of shape `(N,)`, integer stratum identifiers.

Raises:

Type	Description
`ValueError`	If input lists have different lengths.
`ValueError`	If fewer than 1 stratum is specified.
`ValueError`	If any stratum has invalid parameters (see generate_binary_dataset).

Examples:

>>> import numpy as np
>>> from glide.simulators import generate_stratified_binary_dataset
>>> y_true, y_proxy, groups = generate_stratified_binary_dataset(
...     n_total=[6, 8],
...     true_mean=[0.6, 0.8],
...     proxy_mean=[0.5, 0.7],
...     correlation=[0.7, 0.75],
...     random_seed=42
... )
>>> len(y_true)
14
>>> len(groups)
14
>>> len(y_proxy)
14
>>> bool(np.all(np.isin(y_true, [0.0, 1.0])))
True
>>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
True

Source code in glide/simulators/stratified_binary.py

def generate_stratified_binary_dataset(
    n_total: ArrayLike,
    true_mean: ArrayLike,
    proxy_mean: ArrayLike,
    correlation: ArrayLike,
    random_seed: Optional[int] = None,
) -> Tuple[NDArray, NDArray, NDArray]:
    """Generate a synthetic stratified binary-label oracle dataset.

    Generate multiple strata with potentially different parameters (true_mean, proxy_mean,
    correlation, n_total per stratum). This enables simulation of heterogeneous data where
    different groups have different proxy-truth relationships.

    Parameters
    ----------
    n_total : list of int or NDArray of shape (K,)
        Total number of samples per stratum. All samples have both true and proxy labels.
        Length must equal number of strata.
    true_mean : list of float or NDArray of shape (K,)
        Expected mean value of the true labels per stratum.
        Length must equal number of strata.
    proxy_mean : list of float or NDArray of shape (K,)
        Expected mean value of the proxy labels per stratum.
        Length must equal number of strata.
    correlation : list of float or NDArray of shape (K,)
        Pearson correlation between true and proxy per stratum.
        Length must equal number of strata.
    random_seed : int, optional
        Seed for reproducibility. If provided, seeds are derived deterministically.

    Returns
    -------
    Tuple[NDArray, NDArray, NDArray]
        Let ``N = sum(n_total)`` be the total number of samples across all strata.

        [0]: array of shape ``(N,)``, y_true containing ground-truth labels.
        [1]: array of shape ``(N,)``, y_proxy containing proxy labels.
        [2]: array of shape ``(N,)``, integer stratum identifiers.

    Raises
    ------
    ValueError
        If input lists have different lengths.
    ValueError
        If fewer than 1 stratum is specified.
    ValueError
        If any stratum has invalid parameters (see generate_binary_dataset).

    Examples
    --------
    >>> import numpy as np
    >>> from glide.simulators import generate_stratified_binary_dataset
    >>> y_true, y_proxy, groups = generate_stratified_binary_dataset(
    ...     n_total=[6, 8],
    ...     true_mean=[0.6, 0.8],
    ...     proxy_mean=[0.5, 0.7],
    ...     correlation=[0.7, 0.75],
    ...     random_seed=42
    ... )
    >>> len(y_true)
    14
    >>> len(groups)
    14
    >>> len(y_proxy)
    14
    >>> bool(np.all(np.isin(y_true, [0.0, 1.0])))
    True
    >>> bool(np.all(np.isin(y_proxy, [0.0, 1.0])))
    True
    """
    n_total_arr = np.asarray(n_total, dtype=int)
    true_mean_arr = np.asarray(true_mean, dtype=float)
    proxy_mean_arr = np.asarray(proxy_mean, dtype=float)
    correlation_arr = np.asarray(correlation, dtype=float)

    _validate_non_empty(n_total_arr, "n_total")
    num_strata = len(n_total_arr)

    _validate_equal_lengths(
        n_total_arr,
        true_mean_arr,
        proxy_mean_arr,
        correlation_arr,
        names=["n_total", "true_mean", "proxy_mean", "correlation"],
    )

    # Generate data for each stratum
    y_true_per_stratum = []
    y_proxy_per_stratum = []
    groups_per_stratum = []

    seed_sequence = np.random.SeedSequence(random_seed)
    seeds = seed_sequence.spawn(num_strata)

    for stratum_id in range(num_strata):
        y_true_k, y_proxy_k = generate_binary_dataset(
            n_total=n_total_arr[stratum_id],
            true_mean=true_mean_arr[stratum_id],
            proxy_mean=proxy_mean_arr[stratum_id],
            correlation=correlation_arr[stratum_id],
            random_seed=seeds[stratum_id],
        )
        y_true_per_stratum.append(y_true_k)
        y_proxy_per_stratum.append(y_proxy_k)
        groups_per_stratum.append(np.full_like(y_true_k, stratum_id))

    y_true = np.hstack(y_true_per_stratum)
    y_proxy = np.hstack(y_proxy_per_stratum)
    groups = np.hstack(groups_per_stratum)

    return y_true, y_proxy, groups

simulate_annotation

simulate_annotation(y_true_oracle, xi)

Reveal oracle labels where annotated and mask the rest as NaN.

Given a full oracle label array and an annotation indicator, returns an array where labels are kept for annotated elements (xi == 1) and set to np.nan for unannotated ones (xi == 0 or xi == np.nan). The input arrays are not mutated.

Parameters:

Name	Type	Description	Default
`y_true_oracle`	`NDArray`	Full oracle ground-truth labels for all elements.	required
`xi`	`NDArray`	Annotation indicator of the same length. A value of `1` means the element was sent to a human annotator; `0` or `np.nan` means it was not.	required

Returns:

Type	Description
`NDArray`	Array of the same length as `y_true_oracle`, with oracle values where `xi == 1` and `np.nan` where `xi == 0` or `xi == np.nan`.

Raises:

Type	Description
`ValueError`	If `y_true_oracle` and `xi` have different lengths.
`ValueError`	If `y_true_oracle` contains NaN values.
`ValueError`	If `xi` contains values other than `0`, `1`, and `np.nan`.

Examples:

>>> import numpy as np
>>> from glide.simulators import simulate_annotation
>>> y_true_oracle = np.array([0, 1, 1, 0])
>>> xi = np.array([1, 0, 1, np.nan])
>>> simulate_annotation(y_true_oracle, xi)
array([ 0., nan,  1., nan])

Source code in glide/simulators/annotation.py

def simulate_annotation(
    y_true_oracle: NDArray,
    xi: NDArray,
) -> NDArray:
    """Reveal oracle labels where annotated and mask the rest as NaN.

    Given a full oracle label array and an annotation indicator, returns an array where labels
    are kept for annotated elements (``xi == 1``) and set to ``np.nan`` for unannotated ones
    (``xi == 0`` or ``xi == np.nan``). The input arrays are not mutated.

    Parameters
    ----------
    y_true_oracle : NDArray
        Full oracle ground-truth labels for all elements.
    xi : NDArray
        Annotation indicator of the same length. A value of ``1`` means the element was sent
        to a human annotator; ``0`` or ``np.nan`` means it was not.

    Returns
    -------
    NDArray
        Array of the same length as ``y_true_oracle``, with oracle values where ``xi == 1``
        and ``np.nan`` where ``xi == 0`` or ``xi == np.nan``.

    Raises
    ------
    ValueError
        If ``y_true_oracle`` and ``xi`` have different lengths.
    ValueError
        If ``y_true_oracle`` contains NaN values.
    ValueError
        If ``xi`` contains values other than ``0``, ``1``, and ``np.nan``.

    Examples
    --------
    >>> import numpy as np
    >>> from glide.simulators import simulate_annotation
    >>> y_true_oracle = np.array([0, 1, 1, 0])
    >>> xi = np.array([1, 0, 1, np.nan])
    >>> simulate_annotation(y_true_oracle, xi)
    array([ 0., nan,  1., nan])
    """
    _validate_equal_lengths(y_true_oracle, xi, names=["y_true_oracle", "xi"])
    _validate_has_no_nan(y_true_oracle, "y_true_oracle")
    xi_float = xi.astype(float)
    _validate_binary_or_nan(xi, "xi")

    y_true = y_true_oracle.astype(float)
    y_true[xi_float != 1] = np.nan
    return y_true