Examples -------- Using the dask_histogram.factory function ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :py:func:`dask_histogram.factory` function is the core piece of the dask-histogram API; all other parts of the public API use it. The function takes in two core inputs: the Dask data to be histogrammed and the information that defines the histogram's structure. The Dask data can be in Array, Series, or DataFrame form. The histogram structure can be defined using the `axes` and (optionally) `storage` arguments, or the `histref` argument can be used. Histogramming one dimensional data: .. code-block:: python >>> import boost_histogram as bh >>> import dask.array as da >>> import dask_histogram as dh >>> x = da.random.uniform(size=(1000,), chunks=(250,)) >>> h = dh.factory(x, axes=(bh.axis.Regular(10, 0, 1),)) >>> h dask_histogram.AggHistogram >>> h.compute() Histogram(Regular(10, 0, 1), storage=Double()) # Sum: 1000.0 Using weights and a reference histogram: .. code-block:: python >>> w = da.random.uniform(size=(1000,), chunks=(250,)) >>> ref = bh.Histogram(bh.axis.Regular(10, 0, 1), storage=bh.storage.Weight()) >>> h = dh.factory(x, weights=w, histref=ref) >>> h dask_histogram.AggHistogram dask.array/NumPy-like Interface ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We can create histograms via the API which mirrors the functions in the ``dask.array`` module (of course, ``dask.array`` mirrors the ``numpy`` API). First, we explictly ask for an :py:obj:`AggHistogram ` object by using the `histogram` argument. .. code-block:: python >>> import dask.array as da >>> import dask_histogram as dh >>> x = da.random.standard_normal(size=(10000, 2), chunks=(2000, 2)) >>> h = dh.histogramdd(x, bins=(10, 10), range=((-3, 3), (-3, 3)), histogram=True) >>> h dask_histogram.AggHistogram If the `histogram` argument is left as the default value (``None``) we get the return style of the ``dask.array`` module (which itself is supporting a NumPy like API), but we're using the ``AggHistogram`` object in the background; again, the computation is still lazy: .. code-block:: python >>> h, edges = dh.histogramdd(x, bins=(10, 10), range=((-3, 3), (-3, 3))) >>> type(h) >>> len(edges) 2 >>> type(edges[0]) >>> h.compute() # doctest:+SKIP Let's consider a DataFrame called ``df`` with four columns: `a`, `b`, `c`, and `w`: .. code-block:: python >>> df # doctest:+SKIP Dask DataFrame Structure: a b c w npartitions=5 0 float64 float64 float64 float64 200 ... ... ... ... ... ... ... ... ... 800 ... ... ... ... 999 ... ... ... ... Dask Name: from_pandas, 5 tasks First let's consider a one dimensional histogram of `a` with weights `w`: .. code-block:: python >>> h, edges = dh.histogram(df["a"], bins=12, range=(-3, 3), weights=df["w"]) # doctest:+SKIP >>> h # doctest:+SKIP dask.array >>> edges # doctest:+SKIP dask.array Note that the same histogram can be created with :py:func:`dask_histogram.factory` like so: .. code-block:: python >>> h = dh.factory(df["a"], axes=(bh.axis.Regular(12, -3, 3),), weights=df["w"]) # doctest:+SKIP >>> h # doctest:+SKIP dask_histogram.AggHistogram We can also grab multiple columns to histogram and return a :py:obj:`Histogram ` object: .. code-block:: python >>> h = dh.histogramdd( # doctest:+SKIP ... df[["a", "b", "c"]], ... bins=(6, 7, 8), ... range=((-3, 3),) * 3, ... histogram=True, ... ) >>> h # doctest: +SKIP dask_histogram.AggHistogram With weights and variable width bins: >>> h = dh.histogramdd( # doctest:+SKIP ... df[["a", "c"]], ... bins=[ ... [-3, -2, 0, 1, 2, 3], ... [-2, -1, 1, 2], ... ], ... weights=df["w"], ... storage=dh.storage.Weight(), ... histogram=True, ... ) >>> h # doctest:+SKIP dask_histogram.AggHistogram boost-histogram Inheriting Example ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You're encouraged to check out the documentation for boost-histogram_; any example you see there should work in dask-histogram if the input data is a Dask collection. In this example we will fill a 2D histogram with Gaussian data in both dimensions (notice that, for convenience, the ``boost_histogram.axis`` and ``boost_histogram.storage`` namespaces are brought in as ``dh.axis`` and ``dh.storage``): .. code-block:: python >>> import dask_histogram.boost as dhb >>> import dask.array as da >>> x = da.random.standard_normal(size=(100_000_000, 2), chunks=(10_000_000, 2)) >>> h = dhb.Histogram( ... dh.axis.Regular(10, -3, 3), ... dh.axis.Regular(10, -3, 3), ... storage=dh.storage.Double(), ... ) >>> h.fill(x) # <-- no computation occurs Histogram( Regular(10, -3, 3), Regular(10, -3, 3), storage=Double()) # (has staged fills) >>> h.empty() True >>> h.compute() # doctest:+SKIP Histogram( Regular(50, -3, 3), Regular(50, -3, 3), storage=Double()) # Sum: 99459483.0 (100000000.0 with flow) >>> import dask >>> dask.compute(h) # doctest:+SKIP (Histogram( Regular(50, -3, 3), Regular(50, -3, 3), storage=Double()) # Sum: 99459483.0 (100000000.0 with flow),) >>> dask.compute(h.to_delayed()) # doctest:+SKIP (Histogram( Regular(50, -3, 3), Regular(50, -3, 3), storage=Double()) # Sum: 99459483.0 (100000000.0 with flow),) .. note:: More examples are shown in the API Reference. .. _boost-histogram: https://boost-histogram.readthedocs.io/en/latest/ .. _Dask: https://docs.dask.org/en/latest/