diff --git a/advanced/map_blocks/simple_map_blocks.ipynb b/advanced/map_blocks/simple_map_blocks.ipynb index 0fe06b16..af46f405 100644 --- a/advanced/map_blocks/simple_map_blocks.ipynb +++ b/advanced/map_blocks/simple_map_blocks.ipynb @@ -12,7 +12,7 @@ "`map_blocks` is inspired by the `dask.array` function of the same name and lets\n", "you map a function on blocks of the xarray object (including Datasets!).\n", "\n", - "At _compute_ time, your function will receive an xarray object with concrete\n", + "At _compute_ time, your function will receive a chunk of an xarray object with concrete\n", "(computed) values along with appropriate metadata. This function should return\n", "an xarray object.\n" ] @@ -89,7 +89,9 @@ "id": "7", "metadata": {}, "source": [ - "Let's open a dataset. We specify `chunks` so that we create a dask arrays for the DataArrays" + "Let's open a dataset. We specify `chunks` so that we create a dask arrays for the DataArrays.\n", + "\n", + "Depending on the desired function to be applied on the chunks, it is vital to set the chunks correctly. Our goal is to compute the mean along the time dimension. Therefore we do not chunk the time dimension at all (indicated by `\"time\": -1`). We deliberately set `lat` and `lon` chunks to something smaller then the size of their respective dimension (otherwise we would potentially end up with a single big chunk for the entire `ds`)." ] }, { @@ -99,7 +101,7 @@ "metadata": {}, "outputs": [], "source": [ - "ds = xr.tutorial.open_dataset(\"air_temperature\", chunks={\"time\": 100})\n", + "ds = xr.tutorial.open_dataset(\"air_temperature\", chunks={\"time\": -1, \"lat\": 5, \"lon\": 10})\n", "ds" ] }, @@ -120,11 +122,11 @@ "metadata": {}, "outputs": [], "source": [ - "def time_mean(obj):\n", + "def time_mean(obj: xr.Dataset):\n", " # use xarray's convenient API here\n", " # you could convert to a pandas dataframe and use pandas' extensive API\n", " # or use .plot() and plt.savefig to save visualizations to disk in parallel.\n", - " return obj.mean(\"lat\")\n", + " return obj.mean(\"time\")\n", "\n", "\n", "ds.map_blocks(time_mean) # this is lazy!" @@ -136,17 +138,79 @@ "id": "11", "metadata": {}, "outputs": [], + "source": [ + "# this triggers the actual computation\n", + "ds.map_blocks(time_mean).compute()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12", + "metadata": {}, + "outputs": [], "source": [ "# this will calculate values and will return True if the computation works as expected\n", - "ds.map_blocks(time_mean).identical(ds.mean(\"lat\"))" + "ds.map_blocks(time_mean).equals(ds.mean(\"time\"))" ] }, { "cell_type": "markdown", - "id": "12", + "id": "13", + "metadata": {}, + "source": [ + "### Exercises\n" + ] + }, + { + "cell_type": "markdown", + "id": "14", + "metadata": {}, + "source": [ + "::::{admonition} Exercise 1\n", + ":class: tip\n", + "\n", + "When opening the dataset, set the chunks for the dimension to anything smaller than the size of the time dimension (< 2920), e.g., `\"time\": 100`, and keep the size of the other chunks the same:\n", + "\n", + "```python\n", + "ds = xr.tutorial.open_dataset(\n", + " \"air_temperature\",\n", + " chunks={\"time\": 100, \"lat\": 5, \"lon\": 10},\n", + ")\n", + "```\n", + "\n", + "Now run the notebook again. The result of `ds.map_blocks(time_mean)` is no more equivalent to `ds.mean(\"time\")`. Why does `ds.map_blocks(time_mean)` return a different result this time?\n", + "\n", + ":::{admonition} Solution\n", + ":class: dropdown\n", + "\n", + "Quoting from the documentation of `map_blocks`: _The function will receive a subset or ‘block’ of obj (see below), corresponding to one chunk along each chunked dimension._\n", + "\n", + "`ds.mean(\"time\")` computes the mean over the entire time dimension. In our example `ds.map_blocks(time_mean)` passes individual chunks of `ds` to `time_mean`. Once the time dimension is chunked, `time_mean` receives more than a single chunk along the dimension, meaning `time_mean` computes the mean along the time dimension for a single chunk rather than along the entire time dimension. Therefore we do not receive an identical result.\n", + "\n", + "You can also modify the function to show the shape of the chunks passed to `time_mean`. Compare the output of the modified function with `ds.chunks` to find out how they relate to each other!\n", + "\n", + "```python\n", + "def time_mean(obj: xr.Dataset):\n", + " print(f\"received obj of type {type(obj)}\")\n", + " print(\"obj contains the following data variables:\")\n", + " for data_var in obj.data_vars:\n", + " print(f\"'{data_var}' with shape {obj[data_var].shape}\")\n", + "\n", + " return obj.mean(\"time\")\n", + "```\n", + "\n", + ":::\n", + "::::\n" + ] + }, + { + "cell_type": "markdown", + "id": "15", "metadata": {}, "source": [ - "### Exercise\n", + "::::{admonition} Exercise 2\n", + ":class: tip \n", "\n", "Try applying the following function with `map_blocks`. Specify `scale` as an\n", "argument and `offset` as a kwarg.\n", @@ -154,15 +218,17 @@ "The docstring should help:\n", "https://docs.xarray.dev/en/stable/generated/xarray.map_blocks.html\n", "\n", - "```\n", + "```python\n", "def time_mean_scaled(obj, scale, offset):\n", " return obj.mean(\"lat\") * scale + offset\n", - "```\n" + "```\n", + "\n", + "::::" ] }, { "cell_type": "markdown", - "id": "13", + "id": "16", "metadata": {}, "source": [ "### More advanced functions\n", @@ -177,7 +243,7 @@ { "cell_type": "code", "execution_count": null, - "id": "14", + "id": "17", "metadata": {}, "outputs": [], "source": [