Skip to content

[NOMRG] TransformsV2 questions / comments #7092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
25 changes: 25 additions & 0 deletions torchvision/prototype/datapoints/_bounding_box.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,31 @@ class BoundingBoxFormat(StrEnum):
CXCYWH = StrEnum.auto()


# What if... we just removed the format and spatial_size meta-data?
# A: We could, but it comes with trade-offs. For the format, this wouldn't lead
# to much of a difference, except that users would have to convert to XYXY
# before doing anything. All of the current stable ops expect XYXY already so
# it's not much of a change. Worth noting as well that a few BBox transforms
# only have an implementation for the XYXY format, and they convert / re-convert
# internally (see e.g. affine_bounding_box, but there are others)
# Removing spatial_size however would make the dispatcher-level more cluncky for
# users. It wouldn't change much of the tranforms classes as long as they're
# called with their respective image e.g.
# T(image, bbox)
# because the spatial_size can be known from the image param. But in a mid-level
# dispatcher which only accept 1 kind of input like
# dispatcher(bbox)
# there's no way to know the spatial_size unless it's passed as a parameter.
# Users would also need to keep track of it since some transforms actually
# change it:
# bbox, sz = resize(bbox, spatial_size=sz)
# This also means the mid-level dispatchers:
# - need to accept as input anything that was a meta-data 9in this case
# spatial_size
# - need to return them as well; which means they need to return either a single
# image, a single video, or a tuple of (bbox, spatial_size),
# TL;DR: things would get messy for users and for us.

class BoundingBox(Datapoint):
format: BoundingBoxFormat # TODO: do not use a builtin?
# TODO: This is the size of the image, not the box. Maybe make this explicit in the name?
Expand Down
2 changes: 1 addition & 1 deletion torchvision/prototype/transforms/_augment.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ def _get_params(self, flat_inputs: List[Any]) -> Dict[str, Any]:
def _transform(
self, inpt: Union[datapoints.ImageType, datapoints.VideoType], params: Dict[str, Any]
) -> Union[datapoints.ImageType, datapoints.VideoType]:
if params["v"] is not None:
if params["v"] is not None: # What is this?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a value tensor or None used to erase the image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically it is the replacement that is put in the "erased" area. In v1, in case we didn't find an area to erase, we return the bounding box of the whole image as well as the image

return 0, 0, img_h, img_w, img

With that we call F.erase unconditionally, which ultimately leads to replacing every value in the original image with itself:

if not inplace:
img = img.clone()
img[..., i : i + h, j : j + w] = v
return img

Since that is quite nonsensical, we opted to also allow None as a return value and use it as a sentinel to do nothing. I think the previous implementation came from a time were JIT didn't support Union (or Optional for that matter) and thus we couldn't return Optional[torch.Tensor].

inpt = F.erase(inpt, **params, inplace=self.inplace)

return inpt
Expand Down
5 changes: 5 additions & 0 deletions torchvision/prototype/transforms/functional/_meta.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,7 @@ def _xyxy_to_cxcywh(xyxy: torch.Tensor, inplace: bool) -> torch.Tensor:
return xyxy


# TODO: Maybe make this available as a class transform as well?
def convert_format_bounding_box(
bounding_box: torch.Tensor, old_format: BoundingBoxFormat, new_format: BoundingBoxFormat, inplace: bool = False
) -> torch.Tensor:
Expand Down Expand Up @@ -437,6 +438,10 @@ def convert_dtype_video(video: torch.Tensor, dtype: torch.dtype = torch.float) -
return convert_dtype_image_tensor(video, dtype)


# TODO: this doesn't just change the dtype, it also changes the value range.
# This name relies on the implicit assumption that the value range is determined
# by the dtype. Maybe think of a more descriptive name if we can (once and for
# all)
def convert_dtype(
inpt: Union[datapoints.ImageTypeJIT, datapoints.VideoTypeJIT], dtype: torch.dtype = torch.float
) -> torch.Tensor:
Expand Down