Reference¶

bootstrap¶

Bootstrap resampling tools.

Compute estimator bias, variance, confidence intervals with bootstrap resampling.

Several forms of bootstrapping on N-dimensional data are supported: ordinary, balanced, extended, parametric, and stratified sampling, see resample() for details. Parametric bootstrapping fits a user-specified distribution to the data and samples from the parametric distribution. The distributions are taken from scipy.stats.

Confidence intervals can be computed with the ordinary percentile method and with the more efficient BCa method, see confidence_interval() for details.

resample.bootstrap.bootstrap(fn: Callable[[...], ndarray], sample: ArrayLike, *args: ArrayLike, **kwargs: Any) → ndarray¶

Calculate function values from bootstrap samples.

This is equivalent to numpy.array([fn(b) for b in resample(sample)]) and implemented for convenience.

Parameters:

fn (Callable) – Bootstrap samples are passed to this function.
sample (array-like) – Original sample.
*args (array-like) – Optional additional arrays of the same length to resample.
**kwargs – Keywords are forwarded to resample().

Returns:

Results of fn applied to each bootstrap sample.

Return type:

np.array

Examples

>>> from resample.bootstrap import bootstrap
>>> import numpy as np
>>> x = np.arange(10)
>>> fx = np.mean(x)
>>> fb = bootstrap(np.mean, x, size=10000, random_state=1)
>>> print(f"f(x) = {fx:.1f} +/- {np.std(fb):.1f}")
f(x) = 4.5 +/- 0.9

resample.bootstrap.confidence_interval(fn: Callable[[...], ndarray], sample: ArrayLike, *args: ArrayLike, cl: float = 0.95, ci_method: str = 'bca', **kwargs: Any) → Tuple[float, float]¶

Calculate bootstrap confidence intervals.

Parameters:

fn (callable) – Function to be bootstrapped.
sample (array-like) – Original sample.
*args (array-like) – Optional additional arrays of the same length to resample.
cl (float, default : 0.95) – Confidence level. Asymptotically, this is the probability that the interval contains the true value.
ci_method (str, {'bca', 'percentile'}, optional) – Confidence interval method. Default is ‘bca’. See notes for details.
**kwargs – Keyword arguments forwarded to resample().

Returns:

Upper and lower confidence limits.

Return type:

(float, float)

Examples

Compute confidence interval for arithmetic mean.

>>> from resample.bootstrap import confidence_interval
>>> import numpy as np
>>> x = np.arange(10)
>>> a, b = confidence_interval(np.mean, x, size=10000, random_state=1)
>>> round(a, 1), round(b, 1)
(2.6, 6.2)

Notes

Both the ‘percentile’ and ‘bca’ methods produce intervals that are invariant to monotonic transformations of the data values, a desirable and consistent property.

The ‘percentile’ method is straightforward and useful as a fallback. The ‘bca’ method is 2nd order accurate (to O(1/n) where n is the sample size) and generally preferred. It computes a jackknife estimate in addition to the bootstrap, which increases the number of function evaluations in a direct comparison to ‘percentile’. However the increase in accuracy should compensate for this, with the result that less bootstrap replicas are needed overall to achieve the same accuracy.

resample.bootstrap.covariance(fn: Callable[[...], ndarray], sample: ArrayLike, *args: ArrayLike, **kwargs: Any) → ndarray¶

Calculate bootstrap estimate of covariance.

Parameters:

fn (callable) – Estimator. Can be any mapping ℝⁿ → ℝᵏ, where n is the sample size and k is the length of the output array.
sample (array-like) – Original sample.
*args (array-like) – Optional additional arrays of the same length to resample.
**kwargs – Keyword arguments forwarded to resample().

Returns:

Bootstrap estimate of covariance. In general, this is a matrix, but if the function maps to a scalar, it is scalar as well.

Return type:

ndarray

Examples

Compute covariance of sample mean and sample variance.

>>> from resample.bootstrap import variance
>>> import numpy as np
>>> x = np.arange(10)
>>> def fn(x):
...     return np.mean(x), np.var(x)
>>> np.round(covariance(fn, x, size=10000, random_state=1), 1)
array([[0.8, 0. ],
       [0. , 5.5]])

resample.bootstrap.resample(sample: ArrayLike, *args: ArrayLike, size: int = 100, method: str = 'balanced', strata: ArrayLike | None = None, random_state: int | Generator | None = None) → Generator[ndarray, None, None]¶

Return generator of bootstrap samples.

Parameters:

sample (array-like) – Original sample.
*args (array-like) – Optional additional arrays of the same length to resample.
size (int, optional) – Number of bootstrap samples to generate. Default is 100.
method (str or None, optional) – How to generate bootstrap samples. Supported are ‘ordinary’, ‘balanced’, ‘extended’, or a distribution name for a parametric bootstrap. Default is ‘balanced’. Supported distribution names: ‘normal’ (also: ‘gaussian’, ‘norm’), ‘student’ (also: ‘t’), ‘laplace’, ‘logistic’, ‘F’ (also: ‘f’), ‘beta’, ‘gamma’, ‘log-normal’ (also: ‘lognorm’, ‘log-gaussian’), ‘inverse-gaussian’ (also: ‘invgauss’), ‘pareto’, ‘poisson’.
strata (array-like, optional) – Stratification labels. Must have the same shape as sample. Default is None.
random_state (numpy.random.Generator or int, optional) – Random number generator instance. If an integer is passed, seed the numpy default generator with it. Default is to use numpy.random.default_rng().

Yields:

ndarray – Bootstrap sample.

Examples

Compute uncertainty of arithmetic mean.

>>> from resample.bootstrap import resample
>>> import numpy as np
>>> x = np.arange(10)
>>> fx = np.mean(x)
>>> fb = []
>>> for b in resample(x, size=10000, random_state=1):
...     fb.append(np.mean(b))
>>> print(f"f(x) = {fx:.1f} +/- {np.std(fb):.1f}")
f(x) = 4.5 +/- 0.9

Compute uncertainty of function applied to multivariate data.

>>> from resample.bootstrap import resample
>>> import numpy as np
>>> x = np.arange(10)
>>> y = np.arange(10, 20)
>>> fx = np.mean((x, y))
>>> fb = []
>>> for bx, by in resample(x, y, size=10000, random_state=1):
...     fb.append(np.mean((bx, by)))
>>> print(f"f(x, y) = {fx:.1f} +/- {np.std(fb):.1f}")
f(x, y) = 9.5 +/- 0.9

Notes

Balanced vs. ordinary bootstrap:

The balanced bootstrap produces more accurate results for the same number of bootstrap samples than the ordinary bootstrap, but needs to allocate memory for B integers, where B is the number of bootstrap samples. Since values of B larger than 10000 are rarely needed, this is usually not an issue.

Non-parametric vs. parametric bootstrap:

If you know that the data follow a particular parametric distribution, it is better to sample from this parametric distribution, but in most cases it is sufficient and more convenient to do a non-parametric bootstrap (using “balanced”, “ordinary”, “extended”). The parametric bootstrap is essential for estimators sensitive to the tails of a distribution (for example, a quantile close to 0 or 1). In this case, only a parametric bootstrap will give reasonable answers, since the non-parametric bootstrap cannot include rare events in the tail if the original sample did not have them.

Extended bootstrap:

In particle physics and perhaps also in other fields, estimators are used which are that are a function of both the size and shape of a sample (for example, fit of a peak over smooth background to the mass distribution of decay candidates). In this case, the normal bootstrap (parametric or non-parametric) is not correct, since the sample size is kept constant. For such cases, one needs the “extended” bootstrap. The name alludes to the so-called extended maximum-likelihood (EML) method in particle physics. Estimates obtained with the EML need to be bootstrapped with the “extended” bootstrap.

Stratification:

If the sample consists of several distinct classes, stratification ensures that the relative proportions of each class are maintained in each replicated sample. This is a stricter constraint than that offered by the balanced bootstrap, which only guarantees that classes have the original proportions over all replicates, but not within each one replicate.

resample.bootstrap.variance(fn: Callable[[...], ndarray], sample: ArrayLike, *args: ArrayLike, **kwargs: Any) → ndarray¶

Calculate bootstrap estimate of variance.

If the function returns a vector, the variance is computed elementwise.

Parameters:

fn (callable) – Estimator. Can be any mapping ℝⁿ → ℝᵏ, where n is the sample size and k is the length of the output array.
sample (array-like) – Original sample.
*args (array-like) – Optional additional arrays of the same length to resample.
**kwargs – Keyword arguments forwarded to resample().

Returns:

Bootstrap estimate of variance.

Return type:

ndarray

Examples

Compute variance of arithmetic mean.

>>> from resample.bootstrap import variance
>>> import numpy as np
>>> x = np.arange(10)
>>> round(variance(np.mean, x, size=10000, random_state=1), 1)
0.8

jackknife¶

Jackknife resampling tools.

Compute estimator bias and variance with jackknife resampling. The implementation supports resampling of N-dimensional data. The interface of this module mimics that of the bootstrap module, so that you can easily switch between bootstrapping and jackknifing bias and variance of an estimator.

The jackknife is an approximation to the bootstrap, so in general bootstrapping is preferred, especially when the sample is small. The computational cost of the jackknife increases quadratically with the sample size, but only linearly for the bootstrap. An advantage of the jackknife can be the deterministic outcome, since no random sampling is involved.

resample.jackknife.bias(fn: Callable[[...], ndarray], sample: ArrayLike, *args: ArrayLike) → ndarray¶

Calculate jackknife estimate of bias.

The bias estimate is accurate to O(1/n), where n is the number of samples. If the bias is exactly O(1/n), then the estimate is exact.

Wikipedia: https://en.wikipedia.org/wiki/Jackknife_resampling

Parameters:

fn (callable) – Estimator. Can be any mapping ℝⁿ → ℝᵏ, where n is the sample size and k is the length of the output array.
sample (array-like) – Original sample.
*args (array-like) – Optional additional arrays of the same length to resample.

Returns:

Jackknife estimate of bias (= expectation of estimator - true value).

Return type:

ndarray

Examples

Compute bias of numpy.var with and without bias-correction.

>>> from resample.jackknife import bias
>>> import numpy as np
>>> x = np.arange(10)
>>> round(bias(np.var, x), 1)
-0.9
>>> round(bias(lambda x: np.var(x, ddof=1), x), 1)
0.0

resample.jackknife.bias_corrected(fn: Callable[[...], ndarray], sample: ArrayLike, *args: ArrayLike) → ndarray¶

Calculate bias-corrected estimate of the function with the jackknife.

Removes a bias of O(1/n), where n is the sample size, leaving bias of order O(1/n²). If the original function has a bias of exactly O(1/n), the corrected result is now unbiased.

Wikipedia: https://en.wikipedia.org/wiki/Jackknife_resampling

Parameters:

fn (callable) – Estimator. Can be any mapping ℝⁿ → ℝᵏ, where n is the sample size and k is the length of the output array.
sample (array-like) – Original sample.
*args (array-like) – Optional additional arrays of the same length to resample.

Returns:

Estimate with O(1/n) bias removed.

Return type:

ndarray

Examples

Compute bias-corrected estimate of numpy.var.

>>> from resample.jackknife import bias_corrected
>>> import numpy as np
>>> x = np.arange(10)
>>> round(np.var(x), 1)
8.2
>>> round(bias_corrected(np.var, x), 1)
9.2

resample.jackknife.jackknife(fn: Callable[[...], ndarray], sample: ArrayLike, *args: ArrayLike) → ndarray¶

Calculate jackknife estimates for a given sample and estimator.

The jackknife is a linear approximation to the bootstrap. In contrast to the bootstrap it is deterministic and does not use random numbers. The caveat is the computational cost of the jackknife, which is O(n²) for n observations, compared to O(n x k) for k bootstrap replicates. For large samples, the bootstrap is more efficient.

Parameters:

fn (callable) – Estimator. Can be any mapping ℝⁿ → ℝᵏ, where n is the sample size and k is the length of the output array.
sample (array-like) – Original sample.
*args (array-like) – Optional additional arrays of the same length to resample.

Returns:

Jackknife samples.

Return type:

ndarray

Examples

>>> from resample.jackknife import jackknife
>>> import numpy as np
>>> x = np.arange(10)
>>> fx = np.mean(x)
>>> fb = jackknife(np.mean, x)
>>> print(f"f(x) = {fx:.1f} +/- {np.std(fb):.1f}")
f(x) = 4.5 +/- 0.3

resample.jackknife.resample(sample: ArrayLike, *args: ArrayLike, copy: bool = True) → Generator[Any, None, None]¶

Generate jackknifed samples.

Parameters:

sample (array-like) – Sample. If the sequence is multi-dimensional, the first dimension must walk over i.i.d. observations.
*args (array-like) – Optional additional arrays of the same length to resample.
copy (bool, optional) – If True, return the replicated sample as a copy, otherwise return a view into the internal array buffer of the generator. Setting this to False avoids len(sample) copies, which is more efficient, but see notes for caveats.

Yields:

ndarray – Array with same shape and type as input, but with the size of the first dimension reduced by one. Replicates are missing one value of the original in ascending order, e.g. for a sample (1, 2, 3), one gets (2, 3), (1, 3), (1, 2).

permutation¶

Permutation-based tests.

A collection of statistical tests that use permutated samples. Permutations are used to compute the distribution of a test statistic under some null hypothesis to obtain p-values without relying on approximate asymptotic formulas.

The permutation method is generic, it can be used with any test statistic, therefore we also provide a generic test function that accepts a user-defined function to compute the test statistic and then automatically computes the p-value for that statistic. The other tests internally also call this generic test function.

All tests return a TestResult object, which mimics the interface of the result objects returned by tests in scipy.stats, but has a third field to return the estimated distribution of the test statistic under the null hypothesis.

empirical¶

Empirical functions.

Empirical functions based on a data sample instead of a parameteric density function, like the empirical CDF. Implemented here are mostly tools used internally.

resample.empirical.cdf_gen(sample: ArrayLike) → Callable[[ndarray], ndarray]¶

Return the empirical distribution function for the given sample.

Parameters:: sample (array-like) – Sample.
Returns:: Empirical distribution function.
Return type:: callable

resample.empirical.influence(fn: Callable[[ArrayLike], ndarray], sample: ArrayLike) → ndarray¶

Calculate the empirical influence function for a given sample and estimator.

Parameters:

fn (callable) – Estimator. Can be any mapping ℝⁿ → ℝᵏ, where n is the sample size and k is the length of the output array.
sample (array-like) – Sample. Must be one-dimensional.

Returns:

Empirical influence values.

Return type:

ndarray

resample.empirical.quantile_function_gen(sample: ArrayLike) → Callable[[float | ArrayLike], float | ndarray]¶

Return the empirical quantile function for the given sample.

Parameters:: sample (array-like) – Sample.
Returns:: Empirical quantile function.
Return type:: callable