I’ve been thinking about this lately with all the AI tools becoming popular. These systems learn from tons of code to generate new stuff.
Are there any practical ways to stop AI crawlers from using my open source repositories for training?
I know websites can use robots.txt files to tell search bots what to avoid. Is there something similar for code repositories?
I’m looking for technical solutions that responsible AI companies might actually respect. Could be special files, metadata, or anything else that works. Just want to understand my options here.
Been dealing with this at work lately. Most big AI companies do check licenses, so that’s your best bet.
Switch to GPL or AGPL if possible. These copyleft licenses force derivative works to stay open source. Commercial AI companies usually avoid GPL code because it creates legal nightmares for their proprietary models.
Technically, you can drop a .ai-training-exclusion file in your repo root. GitHub and some platforms are starting to recognize these. Won’t stop everyone but legit companies usually respect them.
I’ve also seen people add explicit anti-training clauses in their README and source headers. Makes your intent obvious.
Reality check though? If someone really wants your code for training, they’ll get it. Focus on legal protections first, then layer on technical barriers.
Also check your platform settings. GitHub’s been adding options to opt out of AI training in repo settings.