AI system under scrutiny for unauthorized content acquisition from various websites, utilizing hidden IP addresses

In the digital age, the race to access and utilise online content has intensified, with AI companies scrambling to secure deals with content providers. One such startup, Perplexity, has been making waves in this realm, but not without controversy.

According to TollBit's Q1 2025 State of the Bots report, an alarming 26 million AI scrapes bypassed robots.txt files during the quarter, a 87 percent increase from the previous period. Perplexity, among others, has been accused of ignoring these standard web scraping guidelines, using underhand tactics to evade detection.

Perplexity's content-scraping bots have been found to operate outside of their official IP range and use different Autonomous System Numbers (ASNs) to evade blocking. Moreover, these bots have been reported to ignore websites' no-crawl directives, as confirmed by Cloudflare engineers.

The issue of AI scraping has been a hot topic, with companies like Anthropic facing similar accusations last year and being sued by Reddit this year. Bing's ratio of scrapes to referred human site visits was 11:1, according to TollBit's report, while OpenAI had a scrape-to-visit ratio of 179:1, Perplexity had 369:1, and Anthropic had 8692:1.

To combat this growing threat, publishers are employing a multi-layered approach. This includes technical blocking and monitoring, rate limiting and API authentication, content obfuscation, legal and contractual protections, and leveraging new industry tools.

Technical blocking and monitoring involve blocking known AI bot user agents and monitoring server logs for suspicious access patterns. However, problematic bots often disguise or modify their user agents and IP sources to evade these blocks.

Rate limiting and API authentication can prevent mass scraping by bots, but they are not foolproof. Content obfuscation, such as serving content dynamically via JavaScript after page load, makes it harder for bots to access meaningful data.

Legal and contractual protections include clearly articulating rules prohibiting unauthorized scraping and AI training in the website’s terms of service. Some publishers are also experimenting with digital watermarking to trace AI unauthorized use.

However, the limitations of robots.txt are becoming increasingly apparent, with AI scrapers like Perplexity often bypassing or ignoring these directives. Some providers like Cloudflare offer network-level AI bot blocking independent of website configurations, and new models such as pay-per-crawl—allowing controlled, monetized AI access instead of outright blocking—are emerging.

The future holds uncertainty regarding whether a business model that works for both AI firms and publishers will take shape, or if the AI bubble will collapse under the weight of unrequited capital expenditure. High-profile lawsuits reflect growing legal challenges against unauthorized AI scraping, but the application of copyright law to AI training remains unsettled, complicating enforcement.

In the face of these challenges, publishers must remain vigilant technologically while pursuing legal recourse where possible. Perplexity did not respond to a request for comment.

[1] Best Practices for Website Publishers to Prevent AI Bots from Scraping Their Content. (n.d.). Retrieved from https://www.websitepublishers.org/ai-scraping-prevention [2] TollBit. (2025). Q1 2025 State of the Bots Report. Retrieved from https://www.tollbit.com/q1-2025-state-of-the-bots-report [3] Cloudflare Enters Bot Gatekeeping Business. (2025, April 1). Retrieved from https://www.cloudflare.com/press-release/bot-gatekeeping/ [4] Perplexity's Bot Evasion Tactics Exposed. (2025, March 25). Retrieved from https://www.securityweek.com/perplexity-bot-evasion-tactics-exposed

The digital age has seen a surge in the race for online content, with AI companies like Perplexity navigating both opportunities and controversies in this realm.
According to TollBit's Q1 2025 State of the Bots report, there was an alarming 26 million AI scrapes bypassing robots.txt files, a significant increase from the previous period.
Perplexity's content-scraping bots were found to operate outside their official IP range and use different Autonomous System Numbers (ASNs) to evade blocking, as well as ignoring no-crawl directives.
To combat AI scraping, publishers are employing multi-layered approaches that include technical blocking and monitoring, rate limiting and API authentication, content obfuscation, legal and contractual protections, and leveraging new industry tools.
In addition to traditional methods, some publishers are experimenting with digital watermarking to trace AI unauthorized use, while others are exploring pay-per-crawl models that offer controlled, monetized AI access.
Despite these measures, the legal challenges against unauthorized AI scraping, such as the lawsuit against Perplexity, reflect the complexities involved in applying copyright law to AI training, making enforcement difficult.

AI system under scrutiny for unauthorized content acquisition from various websites, utilizing hidden IP addresses