Skip to content

[feature-request] Block data that is flagged by a CC "no-AI-training" license from being ingested into a DataLoader #1509

@GarrettMerz

Description

@GarrettMerz

🚀 The feature

Add support for Creative Commons No-AI-Training license flags

Motivation, pitch

Hello! Creative Commons is introducing "preference signal" licenses, an addition to CC licenses that indicates that a contributor does not wish for their data to be used in model training without attribution, or at all (https://github.com/creativecommons/cc-signals). Currently, they are indicated in the robots.txt and the http header.

From what I can tell, this mechanism can't be meaningfully enforced at the point of site-scraping (as there is no indication within a scraper that data will subsequently be passed to a model), but I am curious about whether the strictest of these are implementable at a technical level at the point of ingestion into the Pytorch Dataloader.

What features would need to be added to ensure that data that is explicitly flagged as do-not-train is not ingestible by a model (is this even doable technically)? If it is not doable, would this change if the license information was implemented in EXIF metadata or similar?

Alternatives

There may be other ways to implement this at other stages within training pipelines.

Additional context

I am not affiliated with Creative Commons! This just seemed like a good discussion to kick off.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions