dask_histogram.histogramdd

dask_histogram.histogramdd

dask_histogram.histogramdd(a, bins=10, range=None, normed=None, weights=None, density=False, *, histogram=None, storage=Double(), threads=None, split_every=None)[source]

Histogram Dask data in multiple dimensions.

Parameters
  • a (DaskCollection or tuple[DaskCollection, ...]) –

    Data to histogram. Acceptable input data can be of the form:

    • A dask.array.Array of shape (N, D) where each row is a sample and each column is a specific coordinate for the samples.

    • A sequence of dask collections where each collection (e.g. array or series) contains all values for one coordinate of all samples.

  • bins (sequence of arrays, int, or sequence of ints) –

    The bin specification.

    The possible binning configurations are:

    • A sequence of arrays describing the monotonically increasing bin edges along each dimension.

    • A single int describing the total number of bins that will be used in each dimension (this requires the range argument to be defined).

    • A sequence of ints describing the total number of bins to be used in each dimension (this requires the range argument to be defined).

    When bins are described by arrays, the rightmost edge is included. Bins described by arrays also allows for non-uniform bin widths.

  • range (tuple(tuple(float, float), ...) optional) – A sequence of length D, each a (min, max) tuple giving the outer bin edges to be used if the edges are not given explicitly in bins. If defined, this argument is required to have an entry for each dimension. Unlike numpy.histogramdd(), if bins does not define bin edges, this argument is required (this function will not automatically use the min and max of of the value in a given dimension because the input data may be lazy in dask).

  • normed (bool, optional) – An unsupported argument that has been deprecated in the NumPy API (preserved to maintain calls dependent on argument order).

  • weights (dask.array.Array or dask.dataframe.Series, optional) – An array of values weighing each sample in the input data. The chunks of the weights must be identical to the chunking along the 0th (row) axis of the data sample.

  • density (bool) – If False (default), the returned array represents the number of samples in each bin. If True, the returned array represents the probability density function at each bin.

  • histogram (Any, optional) – If not None, a collection instance is returned instead of the array style return.

  • storage (boost_histogram.storage.Storage) – Define the storage used by the Histogram object.

  • threads (int, optional) – Enable threading on Histogram.fill() calls.

Returns

The default return is the style of dask.array.histogramdd(): An array of bin contents and arrays of bin edges. If the histogram argument is used then the return is a dask_histogram.AggHistogram collection instance.

Return type

tuple[da.Array, tuple[da.Array, …]] or AggHistogram

Examples

Creating a three dimensional histogram with variable width bins in each dimension. First, using three 1D arrays for each coordinate:

>>> import dask.array as da
>>> import dask_histogram as dh
>>> x = da.random.standard_normal(size=(10000,), chunks=(2000,))
>>> y = da.random.standard_normal(size=(10000,), chunks=(2000,))
>>> z = da.random.standard_normal(size=(10000,), chunks=(2000,))
>>> bins = [
...    [-3, -2, 0, 1, 3],
...    [-3, -1, 1, 2, 3],
...    [-3, -2, 0, 2, 3],
... ]
>>> h, edges = dh.histogramdd((x, y, z), bins=bins)
>>> type(h)
<class 'dask.array.core.Array'>
>>> h.shape
(4, 4, 4)
>>> len(edges)
3

Now the same histogram but instead of a dask.array.histogramdd() style return (which mirrors the return style of numpy.histogramdd()), we use the histogram argument to trigger the return of a collection object:

>>> import dask.array as da
>>> import dask_histogram as dh
>>> x = da.random.standard_normal(size=(10000,), chunks=(2000,))
>>> y = da.random.standard_normal(size=(10000,), chunks=(2000,))
>>> z = da.random.standard_normal(size=(10000,), chunks=(2000,))
>>> bins = [
...    [-3, -2, 0, 1, 3],
...    [-3, -1, 1, 2, 3],
...    [-3, -2, 0, 2, 3],
... ]
>>> h = dh.histogramdd((x, y, z), bins=bins, histogram=True)
>>> h
dask_histogram.AggHistogram<histreduce-agg, ndim=3, storage=Double()>
>>> h.ndim
3
>>> h = h.compute()
>>> h  
Histogram(
  Variable([-3, -2, 0, 1, 3]),
  Variable([-3, -1, 1, 2, 3]),
  Variable([-3, -2, 0, 2, 3]),
  storage=Double()) # Sum: 9919.0 (10000.0 with flow)

Another 3D histogram example but with an alternative dataset form (a single array with three columns), fixed bin widths, sample weights, and usage of the boost-histogram Weight() storage:

>>> a = da.random.standard_normal(size=(10000, 3), chunks=(2000, 3))
>>> w = da.random.uniform(0.5, 0.7, size=(10000,), chunks=2000)
>>> bins = (7, 5, 6)
>>> range = ((-3, 3), (-2.9, 2.9), (-3.1, 3.1))
>>> h = dh.histogramdd(
...     a,
...     bins=bins,
...     range=range,
...     weights=w,
...     histogram=True,
...     storage=dh.storage.Weight()
... )
>>> h
dask_histogram.AggHistogram<histreduce-agg, ndim=3, storage=Weight()>