[feature-request] Block data that is flagged by a CC "no-AI-training" license from being ingested into a DataLoader

### 🚀 The feature

Add support for Creative Commons No-AI-Training license flags

### Motivation, pitch

Hello! Creative Commons is introducing "preference signal" licenses, an addition to CC licenses that indicates that a contributor does not wish for their data to be used in model training without attribution, or at all (https://github.com/creativecommons/cc-signals). Currently, they are indicated in the robots.txt and the http header.

From what I can tell, this mechanism can't be meaningfully enforced at the point of site-scraping (as there is no indication within a scraper that data will subsequently be passed to a model), but I am curious about whether the strictest of these are implementable at a technical level at the point of ingestion into the Pytorch Dataloader.

What features would need to be added to ensure that data that is explicitly flagged as do-not-train is not ingestible by a model (is this even doable technically)? If it is not doable, would this change if the license information was implemented in EXIF metadata or similar?

### Alternatives

There may be other ways to implement this at other stages within training pipelines.

### Additional context

I am not affiliated with Creative Commons! This just seemed like a good discussion to kick off.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature-request] Block data that is flagged by a CC "no-AI-training" license from being ingested into a DataLoader #1509

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feature-request] Block data that is flagged by a CC "no-AI-training" license from being ingested into a DataLoader #1509

Description

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions