dask_histogram.factory
dask_histogram.factory¶
- dask_histogram.factory(*data, histref=None, axes=None, storage=None, weights=None, sample=None, split_every=None, keep_partitioned=False)[source]¶
Daskified Histogram collection factory function.
Given some data represented by Dask collections and the characteristics of a histogram (either a reference
boost_histogram.Histogram
object or a set of axes), this routine will create anAggHistogram
orPartitionedHistogram
collection.- Parameters
*data (DaskCollection) –
The data to histogram. The supported forms of input data:
Single one dimensional dask array or Series: for creating a 1D histogram.
Single multidimensional dask array or DataFrame: for creating multidimensional histograms.
Multiple one dimensional dask arrays or Series: for creating multidimensional histograms.
histref (bh.Histogram or tuple, optional) – A reference histogram object, required if axes is not used. The dimensionality of histref must be compatible with the input data. If a tuple, it must be three elements where element one is a tuple of axes, element two is the storage, and element three is the metadata.
axes (Sequence[bh.axis.Axis], optional) – The axes of the histogram, required if histref is not used. The total number of axes must be equal to the number of dimensions of the resulting histogram given the structure of data.
storage (bh.storage.Storage, optional) – Storage type of the histogram, only compatible with use of the axes argument.
weights (DaskCollection, optional) – Weights associated with the data. The partitioning/chunking of the weights must be compatible with the input data.
sample (DaskCollection, optional) – Provide samples if the histogram storage allows it. The partitioning/chunking of the samples must be compatible with the input data.
split_every (int, optional) – How many blocks to use in each split during aggregation.
keep_partitioned (bool, optional) – Deprecated argument. Use
partitioned_factory()
.
- Returns
The resulting histogram collection.
- Return type
- Raises
ValueError – If histref and axes are both not
None
, or if storage is used with histref.
Examples
Creating a three dimensional histogram using the axes argument:
>>> import boost_histogram as bh >>> import dask.array as da >>> import dask_histogram as dh >>> x = da.random.uniform(size=(10000,), chunks=(2000,)) >>> y = da.random.uniform(size=(10000,), chunks=(2000,)) >>> z = da.random.uniform(size=(10000,), chunks=(2000,)) >>> bins = [ ... [0.0, 0.3, 0.4, 0.5, 1.0], ... [0.0, 0.1, 0.2, 0.8, 1.0], ... [0.0, 0.2, 0.3, 0.4, 1.0], ... ] >>> axes = [bh.axis.Variable(b) for b in bins] >>> h = dh.factory(x, y, z, axes=axes) >>> h.shape (4, 4, 4) >>> h.compute() Histogram( Variable([0, 0.3, 0.4, 0.5, 1]), Variable([0, 0.1, 0.2, 0.8, 1]), Variable([0, 0.2, 0.3, 0.4, 1]), storage=Double()) # Sum: 10000.0
Creating a weighted one dimensional histogram with the histref argument, then converting to the dask.array histogramming return style.
>>> x = da.random.uniform(size=(10000,), chunks=(2000,)) >>> w = da.random.uniform(size=(10000,), chunks=(2000,)) >>> ref = bh.Histogram(bh.axis.Regular(10, 0, 1)) >>> h = dh.factory(x, histref=ref, weights=w) >>> counts, edges = h.to_dask_array() >>> counts dask.array<to-dask-array, shape=(10,), dtype=float64, chunksize=(10,), chunktype=numpy.ndarray> >>> edges dask.array<array, shape=(11,), dtype=float64, chunksize=(11,), chunktype=numpy.ndarray>