🚀 The feature
Add support for Creative Commons No-AI-Training license flags
Motivation, pitch
Hello! Creative Commons is introducing "preference signal" licenses, an addition to CC licenses that indicates that a contributor does not wish for their data to be used in model training without attribution, or at all (https://github.com/creativecommons/cc-signals). Currently, they are indicated in the robots.txt and the http header.
From what I can tell, this mechanism can't be meaningfully enforced at the point of site-scraping (as there is no indication within a scraper that data will subsequently be passed to a model), but I am curious about whether the strictest of these are implementable at a technical level at the point of ingestion into the Pytorch Dataloader.
What features would need to be added to ensure that data that is explicitly flagged as do-not-train is not ingestible by a model (is this even doable technically)? If it is not doable, would this change if the license information was implemented in EXIF metadata or similar?
Alternatives
There may be other ways to implement this at other stages within training pipelines.
Additional context
I am not affiliated with Creative Commons! This just seemed like a good discussion to kick off.