Skip to content

Notebook for xarray.map_blocks lacks description of how chunks affect the computation #317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 30, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 78 additions & 12 deletions advanced/map_blocks/simple_map_blocks.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"`map_blocks` is inspired by the `dask.array` function of the same name and lets\n",
"you map a function on blocks of the xarray object (including Datasets!).\n",
"\n",
"At _compute_ time, your function will receive an xarray object with concrete\n",
"At _compute_ time, your function will receive a chunk of an xarray object with concrete\n",
"(computed) values along with appropriate metadata. This function should return\n",
"an xarray object.\n"
]
Expand Down Expand Up @@ -89,7 +89,9 @@
"id": "7",
"metadata": {},
"source": [
"Let's open a dataset. We specify `chunks` so that we create a dask arrays for the DataArrays"
"Let's open a dataset. We specify `chunks` so that we create a dask arrays for the DataArrays.\n",
"\n",
"Depending on the desired function to be applied on the chunks, it is vital to set the chunks correctly. Our goal is to compute the mean along the time dimension. Therefore we do not chunk the time dimension at all (indicated by `\"time\": -1`). We deliberately set `lat` and `lon` chunks to something smaller then the size of their respective dimension (otherwise we would potentially end up with a single big chunk for the entire `ds`)."
]
},
{
Expand All @@ -99,7 +101,7 @@
"metadata": {},
"outputs": [],
"source": [
"ds = xr.tutorial.open_dataset(\"air_temperature\", chunks={\"time\": 100})\n",
"ds = xr.tutorial.open_dataset(\"air_temperature\", chunks={\"time\": -1, \"lat\": 5, \"lon\": 10})\n",
"ds"
]
},
Expand All @@ -120,11 +122,11 @@
"metadata": {},
"outputs": [],
"source": [
"def time_mean(obj):\n",
"def time_mean(obj: xr.Dataset):\n",
" # use xarray's convenient API here\n",
" # you could convert to a pandas dataframe and use pandas' extensive API\n",
" # or use .plot() and plt.savefig to save visualizations to disk in parallel.\n",
" return obj.mean(\"lat\")\n",
" return obj.mean(\"time\")\n",
"\n",
"\n",
"ds.map_blocks(time_mean) # this is lazy!"
Expand All @@ -136,33 +138,97 @@
"id": "11",
"metadata": {},
"outputs": [],
"source": [
"# this triggers the actual computation\n",
"ds.map_blocks(time_mean).compute()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "12",
"metadata": {},
"outputs": [],
"source": [
"# this will calculate values and will return True if the computation works as expected\n",
"ds.map_blocks(time_mean).identical(ds.mean(\"lat\"))"
"ds.map_blocks(time_mean).equals(ds.mean(\"time\"))"
]
},
{
"cell_type": "markdown",
"id": "12",
"id": "13",
"metadata": {},
"source": [
"### Exercises\n"
]
},
{
"cell_type": "markdown",
"id": "14",
"metadata": {},
"source": [
"::::{admonition} Exercise 1\n",
":class: tip\n",
"\n",
"When opening the dataset, set the chunks for the dimension to anything smaller than the size of the time dimension (< 2920), e.g., `\"time\": 100`, and keep the size of the other chunks the same:\n",
"\n",
"```python\n",
"ds = xr.tutorial.open_dataset(\n",
" \"air_temperature\",\n",
" chunks={\"time\": 100, \"lat\": 5, \"lon\": 10},\n",
")\n",
"```\n",
"\n",
"Now run the notebook again. The result of `ds.map_blocks(time_mean)` is no more equivalent to `ds.mean(\"time\")`. Why does `ds.map_blocks(time_mean)` return a different result this time?\n",
"\n",
":::{admonition} Solution\n",
":class: dropdown\n",
"\n",
"Quoting from the documentation of `map_blocks`: _The function will receive a subset or ‘block’ of obj (see below), corresponding to one chunk along each chunked dimension._\n",
"\n",
"`ds.mean(\"time\")` computes the mean over the entire time dimension. In our example `ds.map_blocks(time_mean)` passes individual chunks of `ds` to `time_mean`. Once the time dimension is chunked, `time_mean` receives more than a single chunk along the dimension, meaning `time_mean` computes the mean along the time dimension for a single chunk rather than along the entire time dimension. Therefore we do not receive an identical result.\n",
"\n",
"You can also modify the function to show the shape of the chunks passed to `time_mean`. Compare the output of the modified function with `ds.chunks` to find out how they relate to each other!\n",
"\n",
"```python\n",
"def time_mean(obj: xr.Dataset):\n",
" print(f\"received obj of type {type(obj)}\")\n",
" print(\"obj contains the following data variables:\")\n",
" for data_var in obj.data_vars:\n",
" print(f\"'{data_var}' with shape {obj[data_var].shape}\")\n",
"\n",
" return obj.mean(\"time\")\n",
"```\n",
"\n",
":::\n",
"::::\n"
]
},
{
"cell_type": "markdown",
"id": "15",
"metadata": {},
"source": [
"### Exercise\n",
"::::{admonition} Exercise 2\n",
":class: tip \n",
"\n",
"Try applying the following function with `map_blocks`. Specify `scale` as an\n",
"argument and `offset` as a kwarg.\n",
"\n",
"The docstring should help:\n",
"https://docs.xarray.dev/en/stable/generated/xarray.map_blocks.html\n",
"\n",
"```\n",
"```python\n",
"def time_mean_scaled(obj, scale, offset):\n",
" return obj.mean(\"lat\") * scale + offset\n",
"```\n"
"```\n",
"\n",
"::::"
]
},
{
"cell_type": "markdown",
"id": "13",
"id": "16",
"metadata": {},
"source": [
"### More advanced functions\n",
Expand All @@ -177,7 +243,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "14",
"id": "17",
"metadata": {},
"outputs": [],
"source": [
Expand Down
Loading