Recent changes to transforms v2

After the initial publication of the blog post for transforms v2, we made some changes to the API: 

- We have renamed our tensor subclasses from `Feature` to `Datapoint` and changed the namespace from `torchvision.features` to `torchvision.datapoints` accordingly.
- We have changed the fallback heuristic for plain tensors: previously any plain tensor input would be treated as an image and transformed as such. However, this was too limiting as it prohibited passing any non-image data as tensors to the transforms that in theory should just be passed through. The new heuristic goes as follows: if we find an explicit image or video (`datapoints.Image`, `datapoints.Video`, `PIL.Image.Image`) in the input sample, all other plain tensors are passed through. If there is no explicit image or video, only the first plain tensor will be treated as an image. The order is defined by traversing depth-first through the input sample, which is compatible with all torchvision datasets, and should also work well for the vast majority of datasets out there.
- We have removed the `color_space` metadata from `datapoints.Image` and `datapoints.Video` as well as the general `ConvertColorSpace` conversion transform and corresponding functionals. This was done for three reasons:
  1. There is no apparent need for it. v1 comprises `Grayscale` and `RandomGrayscale` and so far they seem to be sufficient. Apart from `ConvertColorSpace`, no other transform in v2 relied on the attribute. We acknowledge that of course there are use cases for color space conversions in a general CV library, but that doesn't apply to `torchvision`. 
  2. It is inefficient. Instead of reading an image in its native color space and converting it afterwards to the color space we want on the tensor level, `torchvision.io` offers the [`ImageReadMode` enum](https://github.com/pytorch/vision/blob/a192c95e77a4a4de3a8aeee45130ddc4d2773a83/torchvision/io/image.py#L21), which handles this all on the C level with the highly optimized routines of the decoding libraries we build against.
- Some transforms, with `Normalize` being the most prominent here, returned plain tensors instead of returning `datapoint.Image`’s or `datapoint.Video`’s. We dropped that in favor of preserving the original type everywhere (i.e.they now return `datapoint.Image`’s or `datapoint.Video`’s), for two reasons:
  1. Returning a tensor was originally chosen in order to add an extra layer of security: after the image is normalized, its range becomes non-standard and so an RGB image in `[0, 1]` can now be in an arbitrary range, e.g. `[-2, 3]`.  By returning a tensor instead of an Image,  we wanted to convey the sense that it’s not clear whether the image is still RGB. However, we realized that this didn’t add any security layer, since plain tensors fall back to being transformed as image or video anyway. On top of that, while a lot of transform make an assumption about the range of an image (0-1, 0-255), this assumption is embeded in the *dtype* of the image, not its type. Returning tensors would only change the type, not the dtype, and so wouldn’t prevent the assumption from being applied anyway.
  2. With the new fallback heuristic, this could even lead to problems when you have plain tensors before the explicit image or video in the sample.
- Transformations that potentially partially or completely remove objects from the image, i.e. the affine transformations (`F.affine`, `F.rotate`, `F.perspective`, `F.elastic`) as well as cropping (`F.crop`), now clamp bounding boxes before returning them. Note that this *does not remove* bounding boxes that are fully outside the image. See the next point for that.
- We introduced the `SanitizeBoundingBox` transform that removes degenerate bounding boxes, for example bounding boxes that are fully outside the image after cropping, as well as the corresponding labels and optionally masks. It should be sufficient to have a single of these transforms at the end of the pipeline, but it can also be used multiple times throughout. This sanitization was removed from transformations that previously had it builtin, e.g. `RandomIoUCrop`.

None of the above should affect the UX in a negative way. Unfortunately, there are also a few things that didn't make the initial cut:

- Batch transformations: `RandomCutMix`, `RandomMixUp`, and `SimpleCopyPaste` all operate on a batch of samples. This doesn't fit the canonical way of passing the transforms to the `torchvision.dataset`'s since they will be applied on a per-sample level. Thus, they will have to be used after batching is done, either in a custom collation function or separate from the data loader. In any case, using the default collation function loses the datapoint subclass and thus the sample needs to be rewrapped before being passed into the transform. In their current state, these transforms barely improved the current workflow (i.e. relying on the implementation in our training references). We’re trying to come up with a significant workflow improvement before releasing these transforms to a wide range of users.
- `FixedSizedCrop` is the same as `RandomCrop`, but with a slightly different padding strategy in case the crop size is larger than the input. Although it is a 1-to-1 replica from a research paper, we feel it makes little sense to have both at the same time. Since `RandomCrop` is already present in v1, we kept it. Note that similar to `RandomIoUCrop`, `FixedSizeCrop` had the bounding box sanitization builtin, while `RandomCrop` does not.
- `datapoints.Label` and `datapoints.OneHotLabel`: These datapoints were needed for `RandomCutMix` and `RandomMixUp` as well as for the sanitization behavior of `RandomIoUCrop` and `FixedSizeCrop`. Since we are not releasing the former just yet and the new fallback heuristic allows us to pass plain tensors as images, the label datapoints currently don't have a use case. Another reason we removed the Label class is that it was not really clear whether a label referred to a `datapoint.BoundingBox`, a `datapoint.Mask`, or an `datapoint.Image` - there was nothing that structurally enforces that in our API. So each transform would make its own assumption about what the labels correspond to, and that could quickly lead to conflicts. Instead we have decided to remove the Label class altogether and to always pass-through labels (as plain tensors) in all transforms. The assumption about “what does the label correspond to” is now encapsulated in the `SanitizeBoundingBox` transform, which lets users manually specify the mapping. This avoids all other transforms having to make assumptions about whether they should be transforming the labels or not and simplifies the mental model.
- `PermuteDimensions`, `TransposeDimensions`, and the `temporal_dim` parameter on `UniformTemporalSubsample`: These were introduced to improve the UX for video users, since the transformations expect videos to be in format `*TCHW`, while our models expect `CTHW`. However, this violates the assumptions regarding the format that we make for all transformations. Meaning, these transformations can only ever come at the end of a pipeline. Thus, we require users to call `video.transpose(-4, -3)` for now.

To be clear, we didn't remove this functionality. It is still available under `torchvision.prototype`. We want to add the batch transformations to the API, but haven't figured out a way to do it without making the API inconsistent in general. The others are less clear and need a more general discussion first. Please stay tuned on any updates here.


cc @vfdev-5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recent changes to transforms v2 #7384

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Recent changes to transforms v2 #7384

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions