AI Startup Anthropic Is Charged With Ignoring Anti-Scraping Procedures

Read Time:2 Minute, 0 Second

July 29, 2024 – AI startup Anthropic has been accused of violating the anti-scraping policies of the websites iFixit and Freelancer. There are claims that Anthropic, which is well-known for its Claude big language models, aggressively scanned these websites despite their “do not crawl” robots.txt protections.

Anthropic’s ClaudeBot is “the most aggressive scraper by far,” according to Matt Barrie, CEO of Freelancer, who told The Information that the bot had 3.5 million hits in only four hours. This traffic volume is said to be significantly larger than that of other AI crawlers. Similar to this, Anthropic’s bot reportedly accessed iFixit’s servers a million times in a single day, according to CEO Kyle Wiens of iFixit. Wiens said on X/Twitter, “You’re not only stealing our content without paying, you’re tying up our devops resources.”

Further investigation into AI companies circumventing the robots.txt regulations was sparked by Wired’s June allegations of similar activity against Perplexity. Web crawlers can be instructed on acceptable behavior by the Robots Exclusion Protocol, albeit participation is optional. Cases of non-compliance have garnered attention. TollBit has disclosed that other AI companies, such as Anthropic and OpenAI, have disregarded these warnings.

According to Barrie, Freelancer had to completely ban Anthropic’s crawler after first attempting to deny the bot’s requests for access. “This is egregious scraping which makes the site slower for everyone and affects our revenue,” he stated. Wiens stated that because to Anthropic’s activities, iFixit had to set alarms for increased traffic, and his crew had to be roused up at three in the morning. After being expressly forbidden in iFixit’s robots.txt file, the bot stopped operating.

In response, Anthropic said that their crawler “respected that signal when iFixit implemented it” and that it complies with robots.txt. The business states that it is looking into the matter and strives “for minimal disruption by being thoughtful about how quickly [it crawls] the same domains”.

Web crawlers are used by AI companies to collect data for generative AI technology training, which has resulted in several lawsuits from publishers claiming copyright infringement. Companies such as OpenAI have started collaborating on material with publishers including News Corp, Vox Media, the Financial Times, and Reddit in order to reduce legal concerns. Wiens stated that iFixit might discuss conditions for the commercial usage of its material, indicating that the company is open to licensing agreements.