dask_histogram.factory

dask_histogram.factory

dask_histogram.factory(*data, histref=None, axes=None, storage=None, weights=None, sample=None, split_every=None, keep_partitioned=False)[source]

Daskified Histogram collection factory function.

Given some data represented by Dask collections and the characteristics of a histogram (either a reference boost_histogram.Histogram object or a set of axes), this routine will create an AggHistogram or PartitionedHistogram collection.

Parameters
  • *data (DaskCollection) –

    The data to histogram. The supported forms of input data:

    • Single one dimensional dask array or Series: for creating a 1D histogram.

    • Single multidimensional dask array or DataFrame: for creating multidimensional histograms.

    • Multiple one dimensional dask arrays or Series: for creating multidimensional histograms.

  • histref (bh.Histogram or tuple, optional) – A reference histogram object, required if axes is not used. The dimensionality of histref must be compatible with the input data. If a tuple, it must be three elements where element one is a tuple of axes, element two is the storage, and element three is the metadata.

  • axes (Sequence[bh.axis.Axis], optional) – The axes of the histogram, required if histref is not used. The total number of axes must be equal to the number of dimensions of the resulting histogram given the structure of data.

  • storage (bh.storage.Storage, optional) – Storage type of the histogram, only compatible with use of the axes argument.

  • weights (DaskCollection, optional) – Weights associated with the data. The partitioning/chunking of the weights must be compatible with the input data.

  • sample (DaskCollection, optional) – Provide samples if the histogram storage allows it. The partitioning/chunking of the samples must be compatible with the input data.

  • split_every (int, optional) – How many blocks to use in each split during aggregation.

  • keep_partitioned (bool, optional) – Deprecated argument. Use partitioned_factory().

Returns

The resulting histogram collection.

Return type

AggHistogram or PartitionedHistogram

Raises

ValueError – If histref and axes are both not None, or if storage is used with histref.

Examples

Creating a three dimensional histogram using the axes argument:

>>> import boost_histogram as bh
>>> import dask.array as da
>>> import dask_histogram as dh
>>> x = da.random.uniform(size=(10000,), chunks=(2000,))
>>> y = da.random.uniform(size=(10000,), chunks=(2000,))
>>> z = da.random.uniform(size=(10000,), chunks=(2000,))
>>> bins = [
...    [0.0, 0.3, 0.4, 0.5, 1.0],
...    [0.0, 0.1, 0.2, 0.8, 1.0],
...    [0.0, 0.2, 0.3, 0.4, 1.0],
... ]
>>> axes = [bh.axis.Variable(b) for b in bins]
>>> h = dh.factory(x, y, z, axes=axes)
>>> h.shape
(4, 4, 4)
>>> h.compute()
Histogram(
  Variable([0, 0.3, 0.4, 0.5, 1]),
  Variable([0, 0.1, 0.2, 0.8, 1]),
  Variable([0, 0.2, 0.3, 0.4, 1]),
  storage=Double()) # Sum: 10000.0

Creating a weighted one dimensional histogram with the histref argument, then converting to the dask.array histogramming return style.

>>> x = da.random.uniform(size=(10000,), chunks=(2000,))
>>> w = da.random.uniform(size=(10000,), chunks=(2000,))
>>> ref = bh.Histogram(bh.axis.Regular(10, 0, 1))
>>> h = dh.factory(x, histref=ref, weights=w)
>>> counts, edges = h.to_dask_array()
>>> counts
dask.array<to-dask-array, shape=(10,), dtype=float64, chunksize=(10,), chunktype=numpy.ndarray>
>>> edges
dask.array<array, shape=(11,), dtype=float64, chunksize=(11,), chunktype=numpy.ndarray>