dask-histogram
Contents
dask-histogram¶
This is the documentation for dask-histogram, a Python library for computing histograms with Dask collections. The project is hosted on GitHub.
Getting Started¶
Installation¶
The only dependencies are Dask and boost-histogram.
Install dask-histogram with pip:
pip install dask-histogram
Or with conda via the conda-forge channel:
conda install dask-histogram -c conda-forge
We test dask-histogram on GNU/Linux, macOS, and Windows.
Overview¶
Dask-histogram provides a new collection type for lazily constructing histogram objects. The API provided by boost-histogram is leveraged to calculate histograms on chunked/partitioned data from the core Dask Array and DataFrame collections.
The main component is the dask_histogram.AggHistogram
class.
Users will typically create AggHistogram
objects via the
dask_histogram.factory()
function, or the
NumPy/dask.array-like functions in the
dask_histogram.routines
module. Another histogram class
exists in the dask_histogram.boost
module
(dask_histogram.boost.Histogram
) which inherits from
boost_histogram.Histogram
and overrides the fill
function
such that it is aware of chunked/partitioned Dask collections. This
class is backed by dask_histogram.AggHistogram
.
Examples¶
Using the dask_histogram.factory function¶
The dask_histogram.factory()
function is the core piece of
the dask-histogram API; all other parts of the public API use it.
The function takes in two core inputs: the Dask data to be histogrammed and the information that defines the histogram’s structure. The Dask data can be in Array, Series, or DataFrame form. The histogram structure can be defined using the axes and (optionally) storage arguments, or the histref argument can be used.
Histogramming one dimensional data:
>>> import boost_histogram as bh
>>> import dask.array as da
>>> import dask_histogram as dh
>>> x = da.random.uniform(size=(1000,), chunks=(250,))
>>> h = dh.factory(x, axes=(bh.axis.Regular(10, 0, 1),))
>>> h
dask_histogram.AggHistogram<hist-aggregate, ndim=1, storage=Double()>
>>> h.compute()
Histogram(Regular(10, 0, 1), storage=Double()) # Sum: 1000.0
Using weights and a reference histogram:
>>> w = da.random.uniform(size=(1000,), chunks=(250,))
>>> ref = bh.Histogram(bh.axis.Regular(10, 0, 1), storage=bh.storage.Weight())
>>> h = dh.factory(x, weights=w, histref=ref)
>>> h
dask_histogram.AggHistogram<hist-aggregate, ndim=1, storage=Weight()>
dask.array/NumPy-like Interface¶
We can create histograms via the API which mirrors the functions in
the dask.array
module (of course, dask.array
mirrors the
numpy
API).
First, we explictly ask for an AggHistogram
object by using the histogram
argument.
>>> import dask.array as da
>>> import dask_histogram as dh
>>> x = da.random.standard_normal(size=(10000, 2), chunks=(2000, 2))
>>> h = dh.histogramdd(x, bins=(10, 10), range=((-3, 3), (-3, 3)), histogram=True)
>>> h
dask_histogram.AggHistogram<hist-aggregate, ndim=2, storage=Double()>
If the histogram argument is left as the default value (None
) we
get the return style of the dask.array
module (which itself is
supporting a NumPy like API), but we’re using the AggHistogram
object in the background; again, the computation is still lazy:
>>> h, edges = dh.histogramdd(x, bins=(10, 10), range=((-3, 3), (-3, 3)))
>>> type(h)
<class 'dask.array.core.Array'>
>>> len(edges)
2
>>> type(edges[0])
<class 'dask.array.core.Array'>
>>> h.compute()
<result will be a NumPy array>
Let’s consider a DataFrame called df
with four columns: a, b,
c, and w:
>>> df
Dask DataFrame Structure:
a b c w
npartitions=5
0 float64 float64 float64 float64
200 ... ... ... ...
... ... ... ... ...
800 ... ... ... ...
999 ... ... ... ...
Dask Name: from_pandas, 5 tasks
First let’s consider a one dimensional histogram of a with weights w:
>>> h, edges = dh.histogram(df["a"], bins=12, range=(-3, 3), weights=df["w"])
>>> h
dask.array<from-value, shape=(12,), dtype=float64, chunksize=(12,), chunktype=numpy.ndarray>
>>> edges
dask.array<array, shape=(13,), dtype=float64, chunksize=(13,), chunktype=numpy.ndarray>
Note that the same histogram can be created with
dask_histogram.factory()
like so:
>>> h = dh.factory(df["a"], axes=(bh.axis.Regular(12, -3, 3),), weights=df["w"])
>>> h
dask_histogram.AggHistogram<hist-aggregate, ndim=1, storage=Double()>
We can also grab multiple columns to histogram and return a
Histogram
object:
>>> h = dh.histogramdd(
... df[["a", "b", "c"]],
... bins=(6, 7, 8),
... range=((-3, 3),) * 3,
... histogram=True,
... )
>>> h
dask_histogram.AggHistogram<hist-aggregate, ndim=3, storage=Double()>
With weights and variable width bins:
>>> h = dh.histogramdd(
... df[["a", "c"]],
... bins=[
... [-3, -2, 0, 1, 2, 3],
... [-2, -1, 1, 2],
... ],
... weights=df["w"],
... storage=dh.storage.Weight(),
... histogram=True,
... )
>>> h
dask_histogram.AggHistogram<hist-aggregate, ndim=2, storage=Weight()>
boost-histogram Inheriting Example¶
You’re encouraged to check out the documentation for boost-histogram; any example you see there should work in dask-histogram if the input data is a Dask collection.
In this example we will fill a 2D histogram with Gaussian data in both
dimensions (notice that, for convenience, the boost_histogram.axis
and boost_histogram.storage
namespaces are brought in as
dh.axis
and dh.storage
):
>>> import dask_histogram.boost as dhb
>>> import dask.array as da
>>> x = da.random.standard_normal(size=(100_000_000, 2), chunks=(10_000_000, 2))
>>> h = dhb.Histogram(
... dh.axis.Regular(10, -3, 3),
... dh.axis.Regular(10, -3, 3),
... storage=dh.storage.Double(),
... )
>>> h.fill(x) # <-- no computation occurs
Histogram(
Regular(10, -3, 3),
Regular(10, -3, 3),
storage=Double()) # (has staged fills)
>>> h.empty()
True
>>> h.compute()
Histogram(
Regular(50, -3, 3),
Regular(50, -3, 3),
storage=Double()) # Sum: 99459483.0 (100000000.0 with flow)
>>> import dask
>>> dask.compute(h)
(Histogram(
Regular(50, -3, 3),
Regular(50, -3, 3),
storage=Double()) # Sum: 99459483.0 (100000000.0 with flow),)
>>> dask.compute(h.to_delayed())
(Histogram(
Regular(50, -3, 3),
Regular(50, -3, 3),
storage=Double()) # Sum: 99459483.0 (100000000.0 with flow),)
Note
More examples are shown in the API Reference.
Contributing¶
Development¶
The recommended way to setup development of a feature branch is an
editable install of a fork of the git repository. First make sure that
you are working in a conda
environment or your favorite style of
virtual environment.
$ git clone git@github.com:<username>/dask-histogram.git
$ cd dask-histogram
$ git remote add upstream https://github.com/dask-contrib/dask-histogram.git
$ git checkout -b my-feature upstream/main
$ pip install -e .[complete]
The use of [complete]
ensures that you install the dependencies
for testing and building the documentation. You can also use
[test]
or [docs]
to install only the dependencies for running
tests or building documentation, respectively.
After running the tests on your new feature, push your branch to your fork and create a pull request.
Testing, etc.¶
We use pytest
for testing; after installing with the
[complete]
option you can run (from the top level of the
repository):
$ python -m pytest
We use black
for formatting:
$ black src tests
And ruff
for linting:
$ python -m ruff .
Type hints are encouraged; we use mypy
for static type checking:
$ mypy
Documentation¶
We use Sphinx to build the documentation.
$ cd docs
$ make html
# Open _build/html/index.html in a web browser
Install sphinx-autobuild
to get a live updated instance of the
documentation, and run it from the project root directory.
$ pip install sphinx-autobuild
$ sphinx-autobuild docs docs/_build/html
# Open a web browser at http://127.0.0.1:8000/
dask_histogram¶
dask.array/NumPy-like API¶
|
Histogram Dask data in one dimension. |
|
Histogram Dask data in two dimensions. |
|
Histogram Dask data in multiple dimensions. |
Collection API¶
|
Aggregated Histogram collection. |
|
Partitioned Histogram collection. |
|
Daskified Histogram collection factory function. |
|
Daskified Histogram collection factory function; keep partitioned. |