Examples
Contents
Examples¶
Using the dask_histogram.factory function¶
The dask_histogram.factory()
function is the core piece of
the dask-histogram API; all other parts of the public API use it.
The function takes in two core inputs: the Dask data to be histogrammed and the information that defines the histogram’s structure. The Dask data can be in Array, Series, or DataFrame form. The histogram structure can be defined using the axes and (optionally) storage arguments, or the histref argument can be used.
Histogramming one dimensional data:
>>> import boost_histogram as bh
>>> import dask.array as da
>>> import dask_histogram as dh
>>> x = da.random.uniform(size=(1000,), chunks=(250,))
>>> h = dh.factory(x, axes=(bh.axis.Regular(10, 0, 1),))
>>> h
dask_histogram.AggHistogram<histreduce-agg, ndim=1, storage=Double()>
>>> h.compute()
Histogram(Regular(10, 0, 1), storage=Double()) # Sum: 1000.0
Using weights and a reference histogram:
>>> w = da.random.uniform(size=(1000,), chunks=(250,))
>>> ref = bh.Histogram(bh.axis.Regular(10, 0, 1), storage=bh.storage.Weight())
>>> h = dh.factory(x, weights=w, histref=ref)
>>> h
dask_histogram.AggHistogram<histreduce-agg, ndim=1, storage=Weight()>
dask.array/NumPy-like Interface¶
We can create histograms via the API which mirrors the functions in
the dask.array
module (of course, dask.array
mirrors the
numpy
API).
First, we explictly ask for an AggHistogram
object by using the histogram
argument.
>>> import dask.array as da
>>> import dask_histogram as dh
>>> x = da.random.standard_normal(size=(10000, 2), chunks=(2000, 2))
>>> h = dh.histogramdd(x, bins=(10, 10), range=((-3, 3), (-3, 3)), histogram=True)
>>> h
dask_histogram.AggHistogram<histreduce-agg, ndim=2, storage=Double()>
If the histogram argument is left as the default value (None
) we
get the return style of the dask.array
module (which itself is
supporting a NumPy like API), but we’re using the AggHistogram
object in the background; again, the computation is still lazy:
>>> h, edges = dh.histogramdd(x, bins=(10, 10), range=((-3, 3), (-3, 3)))
>>> type(h)
<class 'dask.array.core.Array'>
>>> len(edges)
2
>>> type(edges[0])
<class 'dask.array.core.Array'>
>>> h.compute()
<result will be a NumPy array>
Let’s consider a DataFrame called df
with four columns: a, b,
c, and w:
>>> df
Dask DataFrame Structure:
a b c w
npartitions=5
0 float64 float64 float64 float64
200 ... ... ... ...
... ... ... ... ...
800 ... ... ... ...
999 ... ... ... ...
Dask Name: from_pandas, 5 tasks
First let’s consider a one dimensional histogram of a with weights w:
>>> h, edges = dh.histogram(df["a"], bins=12, range=(-3, 3), weights=df["w"])
>>> h
dask.array<from-value, shape=(12,), dtype=float64, chunksize=(12,), chunktype=numpy.ndarray>
>>> edges
dask.array<array, shape=(13,), dtype=float64, chunksize=(13,), chunktype=numpy.ndarray>
Note that the same histogram can be created with
dask_histogram.factory()
like so:
>>> h = dh.factory(df["a"], axes=(bh.axis.Regular(12, -3, 3),), weights=df["w"])
>>> h
dask_histogram.AggHistogram<histreduce-agg, ndim=1, storage=Double()>
We can also grab multiple columns to histogram and return a
Histogram
object:
>>> h = dh.histogramdd(
... df[["a", "b", "c"]],
... bins=(6, 7, 8),
... range=((-3, 3),) * 3,
... histogram=True,
... )
>>> h
dask_histogram.AggHistogram<histreduce-agg, ndim=3, storage=Double()>
With weights and variable width bins:
>>> h = dh.histogramdd(
... df[["a", "c"]],
... bins=[
... [-3, -2, 0, 1, 2, 3],
... [-2, -1, 1, 2],
... ],
... weights=df["w"],
... storage=dh.storage.Weight(),
... histogram=True,
... )
>>> h
dask_histogram.AggHistogram<histreduce-agg, ndim=2, storage=Weight()>
boost-histogram Inheriting Example¶
You’re encouraged to check out the documentation for boost-histogram; any example you see there should work in dask-histogram if the input data is a Dask collection.
In this example we will fill a 2D histogram with Gaussian data in both
dimensions (notice that, for convenience, the boost_histogram.axis
and boost_histogram.storage
namespaces are brought in as
dh.axis
and dh.storage
):
>>> import dask_histogram.boost as dhb
>>> import dask.array as da
>>> x = da.random.standard_normal(size=(100_000_000, 2), chunks=(10_000_000, 2))
>>> h = dhb.Histogram(
... dh.axis.Regular(10, -3, 3),
... dh.axis.Regular(10, -3, 3),
... storage=dh.storage.Double(),
... )
>>> h.fill(x) # <-- no computation occurs
Histogram(
Regular(10, -3, 3),
Regular(10, -3, 3),
storage=Double()) # (has staged fills)
>>> h.empty()
True
>>> h.compute()
Histogram(
Regular(50, -3, 3),
Regular(50, -3, 3),
storage=Double()) # Sum: 99459483.0 (100000000.0 with flow)
>>> import dask
>>> dask.compute(h)
(Histogram(
Regular(50, -3, 3),
Regular(50, -3, 3),
storage=Double()) # Sum: 99459483.0 (100000000.0 with flow),)
>>> dask.compute(h.to_delayed())
(Histogram(
Regular(50, -3, 3),
Regular(50, -3, 3),
storage=Double()) # Sum: 99459483.0 (100000000.0 with flow),)
Note
More examples are shown in the API Reference.