Skip to content

Recent changes to transforms v2 #7384

Open
@pmeier

Description

@pmeier

After the initial publication of the blog post for transforms v2, we made some changes to the API:

  • We have renamed our tensor subclasses from Feature to Datapoint and changed the namespace from torchvision.features to torchvision.datapoints accordingly.
  • We have changed the fallback heuristic for plain tensors: previously any plain tensor input would be treated as an image and transformed as such. However, this was too limiting as it prohibited passing any non-image data as tensors to the transforms that in theory should just be passed through. The new heuristic goes as follows: if we find an explicit image or video (datapoints.Image, datapoints.Video, PIL.Image.Image) in the input sample, all other plain tensors are passed through. If there is no explicit image or video, only the first plain tensor will be treated as an image. The order is defined by traversing depth-first through the input sample, which is compatible with all torchvision datasets, and should also work well for the vast majority of datasets out there.
  • We have removed the color_space metadata from datapoints.Image and datapoints.Video as well as the general ConvertColorSpace conversion transform and corresponding functionals. This was done for three reasons:
    1. There is no apparent need for it. v1 comprises Grayscale and RandomGrayscale and so far they seem to be sufficient. Apart from ConvertColorSpace, no other transform in v2 relied on the attribute. We acknowledge that of course there are use cases for color space conversions in a general CV library, but that doesn't apply to torchvision.
    2. It is inefficient. Instead of reading an image in its native color space and converting it afterwards to the color space we want on the tensor level, torchvision.io offers the ImageReadMode enum, which handles this all on the C level with the highly optimized routines of the decoding libraries we build against.
  • Some transforms, with Normalize being the most prominent here, returned plain tensors instead of returning datapoint.Image’s or datapoint.Video’s. We dropped that in favor of preserving the original type everywhere (i.e.they now return datapoint.Image’s or datapoint.Video’s), for two reasons:
    1. Returning a tensor was originally chosen in order to add an extra layer of security: after the image is normalized, its range becomes non-standard and so an RGB image in [0, 1] can now be in an arbitrary range, e.g. [-2, 3]. By returning a tensor instead of an Image, we wanted to convey the sense that it’s not clear whether the image is still RGB. However, we realized that this didn’t add any security layer, since plain tensors fall back to being transformed as image or video anyway. On top of that, while a lot of transform make an assumption about the range of an image (0-1, 0-255), this assumption is embeded in the dtype of the image, not its type. Returning tensors would only change the type, not the dtype, and so wouldn’t prevent the assumption from being applied anyway.
    2. With the new fallback heuristic, this could even lead to problems when you have plain tensors before the explicit image or video in the sample.
  • Transformations that potentially partially or completely remove objects from the image, i.e. the affine transformations (F.affine, F.rotate, F.perspective, F.elastic) as well as cropping (F.crop), now clamp bounding boxes before returning them. Note that this does not remove bounding boxes that are fully outside the image. See the next point for that.
  • We introduced the SanitizeBoundingBox transform that removes degenerate bounding boxes, for example bounding boxes that are fully outside the image after cropping, as well as the corresponding labels and optionally masks. It should be sufficient to have a single of these transforms at the end of the pipeline, but it can also be used multiple times throughout. This sanitization was removed from transformations that previously had it builtin, e.g. RandomIoUCrop.

None of the above should affect the UX in a negative way. Unfortunately, there are also a few things that didn't make the initial cut:

  • Batch transformations: RandomCutMix, RandomMixUp, and SimpleCopyPaste all operate on a batch of samples. This doesn't fit the canonical way of passing the transforms to the torchvision.dataset's since they will be applied on a per-sample level. Thus, they will have to be used after batching is done, either in a custom collation function or separate from the data loader. In any case, using the default collation function loses the datapoint subclass and thus the sample needs to be rewrapped before being passed into the transform. In their current state, these transforms barely improved the current workflow (i.e. relying on the implementation in our training references). We’re trying to come up with a significant workflow improvement before releasing these transforms to a wide range of users.
  • FixedSizedCrop is the same as RandomCrop, but with a slightly different padding strategy in case the crop size is larger than the input. Although it is a 1-to-1 replica from a research paper, we feel it makes little sense to have both at the same time. Since RandomCrop is already present in v1, we kept it. Note that similar to RandomIoUCrop, FixedSizeCrop had the bounding box sanitization builtin, while RandomCrop does not.
  • datapoints.Label and datapoints.OneHotLabel: These datapoints were needed for RandomCutMix and RandomMixUp as well as for the sanitization behavior of RandomIoUCrop and FixedSizeCrop. Since we are not releasing the former just yet and the new fallback heuristic allows us to pass plain tensors as images, the label datapoints currently don't have a use case. Another reason we removed the Label class is that it was not really clear whether a label referred to a datapoint.BoundingBox, a datapoint.Mask, or an datapoint.Image - there was nothing that structurally enforces that in our API. So each transform would make its own assumption about what the labels correspond to, and that could quickly lead to conflicts. Instead we have decided to remove the Label class altogether and to always pass-through labels (as plain tensors) in all transforms. The assumption about “what does the label correspond to” is now encapsulated in the SanitizeBoundingBox transform, which lets users manually specify the mapping. This avoids all other transforms having to make assumptions about whether they should be transforming the labels or not and simplifies the mental model.
  • PermuteDimensions, TransposeDimensions, and the temporal_dim parameter on UniformTemporalSubsample: These were introduced to improve the UX for video users, since the transformations expect videos to be in format *TCHW, while our models expect CTHW. However, this violates the assumptions regarding the format that we make for all transformations. Meaning, these transformations can only ever come at the end of a pipeline. Thus, we require users to call video.transpose(-4, -3) for now.

To be clear, we didn't remove this functionality. It is still available under torchvision.prototype. We want to add the batch transformations to the API, but haven't figured out a way to do it without making the API inconsistent in general. The others are less clear and need a more general discussion first. Please stay tuned on any updates here.

cc @vfdev-5

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions