Use path-like tools with DataTree #281
Description
Hello,
I open this issue to discuss about integrating more path manipulations into the datatree library.
TLDR:
To summarize:
pathlib
-like methods for DataTree- Indexing with
Path
objects - To what extent is it possible to delegate to
pathlib
some path operations
Details
This is a suggestion regarding DataTrees
Since a DataTree can be accessed with string representing unix-like paths (slash-separated strings), and since there is the pathlib
library in Python, I wondered if the DataTree could comply with some of the "pathlib API". It's especially true for Zarr files that are persisted to literal directories into the filesystem. It seems natural to have all the tools for Path manipulation inside a DataTree.
For instance, with pathlib
we can glob
or rglob
on a directory path. Would it be possible to glob
and rglob
in DataTrees too? Currently there is a match
function, but it introduces a new word, "match". dt.match("*/B")
or dt.glob("*/B")
would be okay, or dt.glob('**/B')
or dt.rglob('B')
.
The idea is that the, glob
and rglob
are already "standard" in the pathlib
library, and well-known by Python developers using pathlib.
Also, we can imagine, to keep the path aspect separated from the datatree API, to index a datatree with a Path
: dt[Path('/a/b/c')]
. Or, equivalent to dt.rglob('B')
, dt[Path('/').rglob('B')]
. I don't know how much pathlib is tied to the underlying file-system. We could imagine being able to use it from a "virtual filesystem hierarchy" provided by the DataTree itself:
dt_path: DataTreePath = dt.path()
for path in dt_path.rglob('B'):
print(path)
(Note: i saw that the path property already exists on the DataTree object.)
The current way to iterate over all nodes is using .subtree
for node in vertebrates.subtree:
print(node.path)
For a python developer discovering datatree but already knowing well how to work with paths, we could imagine:
for path, node in dt.rglob('*').items():
...