Open
Description
After the initial publication of the blog post for transforms v2, we made some changes to the API:
- We have renamed our tensor subclasses from
Feature
toDatapoint
and changed the namespace fromtorchvision.features
totorchvision.datapoints
accordingly. - We have changed the fallback heuristic for plain tensors: previously any plain tensor input would be treated as an image and transformed as such. However, this was too limiting as it prohibited passing any non-image data as tensors to the transforms that in theory should just be passed through. The new heuristic goes as follows: if we find an explicit image or video (
datapoints.Image
,datapoints.Video
,PIL.Image.Image
) in the input sample, all other plain tensors are passed through. If there is no explicit image or video, only the first plain tensor will be treated as an image. The order is defined by traversing depth-first through the input sample, which is compatible with all torchvision datasets, and should also work well for the vast majority of datasets out there. - We have removed the
color_space
metadata fromdatapoints.Image
anddatapoints.Video
as well as the generalConvertColorSpace
conversion transform and corresponding functionals. This was done for three reasons:- There is no apparent need for it. v1 comprises
Grayscale
andRandomGrayscale
and so far they seem to be sufficient. Apart fromConvertColorSpace
, no other transform in v2 relied on the attribute. We acknowledge that of course there are use cases for color space conversions in a general CV library, but that doesn't apply totorchvision
. - It is inefficient. Instead of reading an image in its native color space and converting it afterwards to the color space we want on the tensor level,
torchvision.io
offers theImageReadMode
enum, which handles this all on the C level with the highly optimized routines of the decoding libraries we build against.
- There is no apparent need for it. v1 comprises
- Some transforms, with
Normalize
being the most prominent here, returned plain tensors instead of returningdatapoint.Image
’s ordatapoint.Video
’s. We dropped that in favor of preserving the original type everywhere (i.e.they now returndatapoint.Image
’s ordatapoint.Video
’s), for two reasons:- Returning a tensor was originally chosen in order to add an extra layer of security: after the image is normalized, its range becomes non-standard and so an RGB image in
[0, 1]
can now be in an arbitrary range, e.g.[-2, 3]
. By returning a tensor instead of an Image, we wanted to convey the sense that it’s not clear whether the image is still RGB. However, we realized that this didn’t add any security layer, since plain tensors fall back to being transformed as image or video anyway. On top of that, while a lot of transform make an assumption about the range of an image (0-1, 0-255), this assumption is embeded in the dtype of the image, not its type. Returning tensors would only change the type, not the dtype, and so wouldn’t prevent the assumption from being applied anyway. - With the new fallback heuristic, this could even lead to problems when you have plain tensors before the explicit image or video in the sample.
- Returning a tensor was originally chosen in order to add an extra layer of security: after the image is normalized, its range becomes non-standard and so an RGB image in
- Transformations that potentially partially or completely remove objects from the image, i.e. the affine transformations (
F.affine
,F.rotate
,F.perspective
,F.elastic
) as well as cropping (F.crop
), now clamp bounding boxes before returning them. Note that this does not remove bounding boxes that are fully outside the image. See the next point for that. - We introduced the
SanitizeBoundingBox
transform that removes degenerate bounding boxes, for example bounding boxes that are fully outside the image after cropping, as well as the corresponding labels and optionally masks. It should be sufficient to have a single of these transforms at the end of the pipeline, but it can also be used multiple times throughout. This sanitization was removed from transformations that previously had it builtin, e.g.RandomIoUCrop
.
None of the above should affect the UX in a negative way. Unfortunately, there are also a few things that didn't make the initial cut:
- Batch transformations:
RandomCutMix
,RandomMixUp
, andSimpleCopyPaste
all operate on a batch of samples. This doesn't fit the canonical way of passing the transforms to thetorchvision.dataset
's since they will be applied on a per-sample level. Thus, they will have to be used after batching is done, either in a custom collation function or separate from the data loader. In any case, using the default collation function loses the datapoint subclass and thus the sample needs to be rewrapped before being passed into the transform. In their current state, these transforms barely improved the current workflow (i.e. relying on the implementation in our training references). We’re trying to come up with a significant workflow improvement before releasing these transforms to a wide range of users. FixedSizedCrop
is the same asRandomCrop
, but with a slightly different padding strategy in case the crop size is larger than the input. Although it is a 1-to-1 replica from a research paper, we feel it makes little sense to have both at the same time. SinceRandomCrop
is already present in v1, we kept it. Note that similar toRandomIoUCrop
,FixedSizeCrop
had the bounding box sanitization builtin, whileRandomCrop
does not.datapoints.Label
anddatapoints.OneHotLabel
: These datapoints were needed forRandomCutMix
andRandomMixUp
as well as for the sanitization behavior ofRandomIoUCrop
andFixedSizeCrop
. Since we are not releasing the former just yet and the new fallback heuristic allows us to pass plain tensors as images, the label datapoints currently don't have a use case. Another reason we removed the Label class is that it was not really clear whether a label referred to adatapoint.BoundingBox
, adatapoint.Mask
, or andatapoint.Image
- there was nothing that structurally enforces that in our API. So each transform would make its own assumption about what the labels correspond to, and that could quickly lead to conflicts. Instead we have decided to remove the Label class altogether and to always pass-through labels (as plain tensors) in all transforms. The assumption about “what does the label correspond to” is now encapsulated in theSanitizeBoundingBox
transform, which lets users manually specify the mapping. This avoids all other transforms having to make assumptions about whether they should be transforming the labels or not and simplifies the mental model.PermuteDimensions
,TransposeDimensions
, and thetemporal_dim
parameter onUniformTemporalSubsample
: These were introduced to improve the UX for video users, since the transformations expect videos to be in format*TCHW
, while our models expectCTHW
. However, this violates the assumptions regarding the format that we make for all transformations. Meaning, these transformations can only ever come at the end of a pipeline. Thus, we require users to callvideo.transpose(-4, -3)
for now.
To be clear, we didn't remove this functionality. It is still available under torchvision.prototype
. We want to add the batch transformations to the API, but haven't figured out a way to do it without making the API inconsistent in general. The others are less clear and need a more general discussion first. Please stay tuned on any updates here.
cc @vfdev-5