dask-histogram

This is the documentation for dask-histogram, a Python library for computing histograms with Dask collections. The project is hosted on GitHub.

Getting Started

Installation

The only dependencies are Dask and boost-histogram.

Install dask-histogram with pip:

pip install dask-histogram

Or with conda via the conda-forge channel:

conda install dask-histogram -c conda-forge

We test dask-histogram on GNU/Linux, macOS, and Windows.

Overview

Dask-histogram provides a new collection type for lazily constructing histogram objects. The API provided by boost-histogram is leveraged to calculate histograms on chunked/partitioned data from the core Dask Array and DataFrame collections.

The main component is the dask_histogram.AggHistogram class. Users will typically create AggHistogram objects via the dask_histogram.factory() function, or the NumPy/dask.array-like functions in the dask_histogram.routines module. Another histogram class exists in the dask_histogram.boost module (dask_histogram.boost.Histogram) which inherits from boost_histogram.Histogram and overrides the fill function such that it is aware of chunked/partitioned Dask collections. This class is backed by dask_histogram.AggHistogram.

Examples

Using the dask_histogram.factory function

The dask_histogram.factory() function is the core piece of the dask-histogram API; all other parts of the public API use it.

The function takes in two core inputs: the Dask data to be histogrammed and the information that defines the histogram’s structure. The Dask data can be in Array, Series, or DataFrame form. The histogram structure can be defined using the axes and (optionally) storage arguments, or the histref argument can be used.

Histogramming one dimensional data:

>>> import boost_histogram as bh
>>> import dask.array as da
>>> import dask_histogram as dh
>>> x = da.random.uniform(size=(1000,), chunks=(250,))
>>> h = dh.factory(x, axes=(bh.axis.Regular(10, 0, 1),))
>>> h
dask_histogram.AggHistogram<histreduce-agg, ndim=1, storage=Double()>
>>> h.compute()
Histogram(Regular(10, 0, 1), storage=Double()) # Sum: 1000.0

Using weights and a reference histogram:

>>> w = da.random.uniform(size=(1000,), chunks=(250,))
>>> ref = bh.Histogram(bh.axis.Regular(10, 0, 1), storage=bh.storage.Weight())
>>> h = dh.factory(x, weights=w, histref=ref)
>>> h
dask_histogram.AggHistogram<histreduce-agg, ndim=1, storage=Weight()>

dask.array/NumPy-like Interface

We can create histograms via the API which mirrors the functions in the dask.array module (of course, dask.array mirrors the numpy API).

First, we explictly ask for an AggHistogram object by using the histogram argument.

>>> import dask.array as da
>>> import dask_histogram as dh
>>> x = da.random.standard_normal(size=(10000, 2), chunks=(2000, 2))
>>> h = dh.histogramdd(x, bins=(10, 10), range=((-3, 3), (-3, 3)), histogram=True)
>>> h
dask_histogram.AggHistogram<histreduce-agg, ndim=2, storage=Double()>

If the histogram argument is left as the default value (None) we get the return style of the dask.array module (which itself is supporting a NumPy like API), but we’re using the AggHistogram object in the background; again, the computation is still lazy:

>>> h, edges = dh.histogramdd(x, bins=(10, 10), range=((-3, 3), (-3, 3)))
>>> type(h)
<class 'dask.array.core.Array'>
>>> len(edges)
2
>>> type(edges[0])
<class 'dask.array.core.Array'>
>>> h.compute() 
<result will be a NumPy array>

Let’s consider a DataFrame called df with four columns: a, b, c, and w:

>>> df  
Dask DataFrame Structure:
                     a        b        c        w
npartitions=5
0              float64  float64  float64  float64
200                ...      ...      ...      ...
...                ...      ...      ...      ...
800                ...      ...      ...      ...
999                ...      ...      ...      ...
Dask Name: from_pandas, 5 tasks

First let’s consider a one dimensional histogram of a with weights w:

>>> h, edges = dh.histogram(df["a"], bins=12, range=(-3, 3), weights=df["w"]) 
>>> h  
dask.array<from-value, shape=(12,), dtype=float64, chunksize=(12,), chunktype=numpy.ndarray>
>>> edges 
dask.array<array, shape=(13,), dtype=float64, chunksize=(13,), chunktype=numpy.ndarray>

Note that the same histogram can be created with dask_histogram.factory() like so:

>>> h = dh.factory(df["a"], axes=(bh.axis.Regular(12, -3, 3),), weights=df["w"]) 
>>> h 
dask_histogram.AggHistogram<histreduce-agg, ndim=1, storage=Double()>

We can also grab multiple columns to histogram and return a Histogram object:

>>> h = dh.histogramdd(  
...     df[["a", "b", "c"]],
...     bins=(6, 7, 8),
...     range=((-3, 3),) * 3,
...     histogram=True,
... )
>>> h 
dask_histogram.AggHistogram<histreduce-agg, ndim=3, storage=Double()>

With weights and variable width bins:

>>> h = dh.histogramdd(  
...     df[["a", "c"]],
...     bins=[
...         [-3, -2, 0, 1, 2, 3],
...         [-2, -1, 1, 2],
...     ],
...     weights=df["w"],
...     storage=dh.storage.Weight(),
...     histogram=True,
... )
>>> h 
dask_histogram.AggHistogram<histreduce-agg, ndim=2, storage=Weight()>

boost-histogram Inheriting Example

You’re encouraged to check out the documentation for boost-histogram; any example you see there should work in dask-histogram if the input data is a Dask collection.

In this example we will fill a 2D histogram with Gaussian data in both dimensions (notice that, for convenience, the boost_histogram.axis and boost_histogram.storage namespaces are brought in as dh.axis and dh.storage):

>>> import dask_histogram.boost as dhb
>>> import dask.array as da
>>> x = da.random.standard_normal(size=(100_000_000, 2), chunks=(10_000_000, 2))
>>> h = dhb.Histogram(
...     dh.axis.Regular(10, -3, 3),
...     dh.axis.Regular(10, -3, 3),
...     storage=dh.storage.Double(),
... )
>>> h.fill(x)  # <-- no computation occurs
Histogram(
  Regular(10, -3, 3),
  Regular(10, -3, 3),
  storage=Double()) # (has staged fills)
>>> h.empty()
True
>>> h.compute() 
Histogram(
  Regular(50, -3, 3),
  Regular(50, -3, 3),
  storage=Double()) # Sum: 99459483.0 (100000000.0 with flow)
>>> import dask
>>> dask.compute(h)  
(Histogram(
  Regular(50, -3, 3),
  Regular(50, -3, 3),
  storage=Double()) # Sum: 99459483.0 (100000000.0 with flow),)
>>> dask.compute(h.to_delayed())  
(Histogram(
  Regular(50, -3, 3),
  Regular(50, -3, 3),
  storage=Double()) # Sum: 99459483.0 (100000000.0 with flow),)

Note

More examples are shown in the API Reference.

Contributing

Development

The recommended way to setup development of a feature branch is an editable install of a fork of the git repository. First make sure that you are working in a conda environment or your favorite style of virtual environment.

$ git clone git@github.com:<username>/dask-histogram.git
$ cd dask-histogram
$ git remote add upstream https://github.com/dask-contrib/dask-histogram.git
$ git checkout -b my-feature upstream/main
$ pip install -e .[complete]

The use of [complete] ensures that you install the dependencies for testing and building the documentation. You can also use [test] or [docs] to install only the dependencies for running tests or building documentation, respectively.

After running the tests on your new feature, push your branch to your fork and create a pull request.

Testing, etc.

We use pytest for testing; after installing with the [complete] option you can run (from the top level of the repository):

$ python -m pytest

We use black for formatting:

$ black src tests

And ruff for linting:

$ python -m ruff .

Type hints are encouraged; we use mypy for static type checking:

$ mypy

Documentation

We use Sphinx to build the documentation.

$ cd docs
$ make html
# Open _build/html/index.html in a web browser

Install sphinx-autobuild to get a live updated instance of the documentation, and run it from the project root directory.

$ pip install sphinx-autobuild
$ sphinx-autobuild docs docs/_build/html
# Open a web browser at http://127.0.0.1:8000/

Releasing

Tagging a version

We use calendar versioning (CalVer) with the format: YYYY.MM.X where X is incremented depending on how many releases have already occurred in the same year and month. For example, if the most recent release is the first release from March of 2023 it would be 2023.3.0, the next release (on any day in that month) would be 2023.3.1.

Check the latest tag with git (or just visit the GitHub repository tags list):

$ git fetch --all --tags
$ git describe --tags $(git rev-list --tags --max-count=1)
2023.3.0

Create a new tag that follows our CalVer convention (using 2023.3.0 example above, we write the next tag accordingly):

$ git tag -a -m "2023.3.1" 2023.3.1

Push the tag to GitHub (assuming origin points to the dask-contrib/dask-histogram remote):

$ git push origin 2023.3.1

Making the release

To make a release of dask-histogram we just need the build and twine packages:

$ pip install build twine

The build-system that we use (hatch with hatch-vcs) will automatically set a version based on the latest tag in the repository; after making the tag we just need to generate the source distribution and wheel, this is handled by the build package:

$ python -m build

Now a new dist/ directory will appear, which contains the files (continuing to use our example version 2023.3.1):

dask_histogram-2023.3.1.tar.gz
dask_histogram-2023.3.1-py3-none-any.whl

Now we just upload these files to PyPI with twine:

$ twine upload dist/dask_histogram-2023.3.1*

The GitHub regro-cf-autotick-bot account will automatically create a pull request to release a new version on conda-forge.

dask_histogram

dask.array/NumPy-like API

histogram(x[, bins, range, normed, weights, ...])

Histogram Dask data in one dimension.

histogram2d(x, y[, bins, range, normed, ...])

Histogram Dask data in two dimensions.

histogramdd(a[, bins, range, normed, ...])

Histogram Dask data in multiple dimensions.

Collection API

AggHistogram(dsk, name, histref[, layer])

Aggregated Histogram collection.

PartitionedHistogram(dsk, name, npartitions, ...)

Partitioned Histogram collection.

factory(*data[, histref, axes, storage, ...])

Daskified Histogram collection factory function.

partitioned_factory(*data[, histref, axes, ...])

Daskified Histogram collection factory function; keep partitioned.

boost-histogram-like API

Histogram(*axes[, storage, metadata])

Histogram object capable of lazy computation.