How to stop AI-powered companies from using your online content to train their models

August 8, 2024

A US company Cloudflare created a button for website owners to block AI crawlers, according to Euronews.

Previously, users had an ad blocker, but now there is an artificial intelligence (AI) blocker. US cybersecurity company Cloudflare created a button for website customers to block AI crawlers from using their data. These are internet bots that roam the web to collect training data.

John Graham-Cumming, the company’s chief technical officer, said:

We helped people protect against the scraping of their websites by bots (…) so I really think AI is the new iteration of content owners wanting to control how their content is used.

When a connection is made to a website hosted on Cloudflare, they can see who is requesting access to the website, including any artificial intelligence scanners that identify themselves. The blocker will respond by showing them an error.

Graham-Cumming stated that some AI bots impersonate human users when accessing a website, so Cloudflare built a machine learning model that estimated the likelihood that a website request was coming from a human or robot user.

The CTO could not say which customers were using the new button, but said it was “very popular” among many small and large companies. According to a study by the Data Provenance Initiative, a group of independent AI researchers, blocking AI scanners is becoming increasingly popular in general.

Their recent analysis of more than 14,000 web domains found that five per cent of all data collected in the Internet’s publicly available C4, RefinedWeb, and Dolma databases was found restricted. However, the researchers note that this number rises to 25 per cent if the highest quality sources are considered.

Ways to block AI crawlers

There are ways to manually block AI scanners from accessing your content. Raptive, a US-based company that advocates for authors, wrote in a guide that website owners can manually add commands to robots.txt. This is a file that tells search engines who can access your site.

To do so, add the name of popular AI companies, such as Anthropic, to the user-agent and then add “disallow” via a colon and a forward dash. The website host will then clear the cache and add /robots.txt to the end of the website domain in the search bar.

In 2023, OpenAI published lines of code for website users to block three types of bots on websites: OAI-SearchBot, ChatGPT-User, and GPTBot. OpenAI is also working on Media Manager, a tool that will allow creators to better control what content is used to train generative AI.

This will (be) (…) the first-ever tool of its kind to help us identify copyrighted text, images, audio and video across multiple sources and reflect creator preferences.

Some websites, such as Squarespace and Substack, have simple commands or toggles to disable AI scanning. Others, such as Tumblran and WordPress, have “prevent third-party sharing” settings that can be enabled to avoid AI learning. Some websites, such as Squarespace and Substack, have simple commands or toggles to disable AI scanning. Others, such as Tumblran and WordPress, have “prevent third-party sharing” settings that can be enabled to avoid AI learning.

Users can also opt out of AI scraping with Slack by sending their support team an email.

How to stop AI-powered companies from using your online content to train their models

Ways to block AI crawlers

Belgian NGO launches campaign to recruit organ donors through video games

Germany braces for heatwave this week

Divided nation as result of Zelensky’s leadership

Most Popular

Belgian NGO launches campaign to recruit organ donors through video games

Germany braces for heatwave this week

British troops in Kenya defy ban on transactional sex

Divided nation as result of Zelensky’s leadership

ABOUT US