this post was submitted on 19 Aug 2025

864 points (99.3% liked)

Technology

74438 readers

2582 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

864

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall (www.searchenginejournal.com)

submitted 1 week ago* (last edited 1 week ago) by Davriellelouna@lemmy.world to c/technology@lemmy.world

241 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[+] panda_abyss@lemmy.ca -23 points 1 week ago* (last edited 1 week ago) (3 children)

I actually agree with them

This feels like cloudflare trying to collect rent from both sides instead of doing what’s best for the website owners.

There is a problem with AI crawlers, but these technologies are essentially doing a search, fetching a several pages, scanning/summarizing them, then presenting the findings to the user.

I don’t really think that’s wrong, it’s just a faster version of rummaging through the SEO shit you do when you Google something.

(I’ve never used perplexity, I do use Kagi’s ki assistant for similar search. It runs 3 searches and scans the top results and then provides citations)

[–] drspod@lemmy.ml 36 points 1 week ago (1 children)

What’s best for the website owners is to have people actually visit and interact with their website. Blocking AI tools is consistent with that.

[+] panda_abyss@lemmy.ca -10 points 1 week ago* (last edited 1 week ago) (1 children)

For a lot of AI search I actually end up reading the pages, so I don’t know how much this stops that

[–] AstralPath@lemmy.ca 16 points 1 week ago (1 children)

You're the outlier, I promise. People are literally forfeiting their brains in favor of an LLM transplant hese days.

[–] pennomi@lemmy.world 3 points 1 week ago (3 children)

On the flip side, most websites are so ad-ridden these days a reader mode or other summary tool is almost required for normal browsing. Not saying that AI is the right move, but I can understand not wanting to visit the actual page any more.

[–] Tollana1234567@lemmy.today 1 points 6 days ago

i put ublock origin, or another adblock on all my browsers, including phone ones and forks.

[–] spankmonkey@lemmy.world 8 points 1 week ago

On the flip side, most websites are so ad-ridden these days a reader mode or other summary tool is almost required for normal browsing.

Firefox with uBlock Origin works perfectly fine and pages load faster without the ads!

[–] HarkMahlberg@kbin.earth 5 points 1 week ago (1 children)

Maybe I missed something, but ublock still works very fine for me, even on mobile. And running a pihole, while not trivial, also takes care of some ad traffic. Firefox coems with a reader mode (a feature I really like even with the adblockers!).

So why do people not want to visit pages anymore, if all these tools already existed?

[–] pennomi@lemmy.world 2 points 6 days ago

Most people aren’t technical enough to install an ad blocker, believe it or not.

[–] kopasz7@sh.itjust.works 23 points 1 week ago (1 children)

Search engines been going relatively fine for decades now. But the crawlers from AI companies basically DDOS hosts in comparison, sending so many requests in such a short interval. Crawling dynamic links as well that are expensive to render compared to a static page, ignoring the robots.txt entirely, or even using it discover unlinked pages.

Servers have finite resources, especially self hosted sites, while AI companies have disproportinately more at their disposal, easily grinding other systems to a halt by overwhelming them with requests.

[–] Tollana1234567@lemmy.today 1 points 6 days ago

that explains why cloudflare keeps asking your abot or not, making you do that captcha.

[–] pr06lefs@lemmy.ml 8 points 1 week ago (1 children)

If a neighborhood is beset by roving bands of thieves, sooner or later strangers will be greeted by a shotgun rather than an invitation to tea, regardless of their intentions. Them's the breaks. Bots are going to take a hit now and their operators are just going to have to deal with it. Sucks when people don't play nice, but this is what you get.

[–] FauxLiving@lemmy.world -4 points 1 week ago (1 children)

I’m sure people that are attempting to drive to their house in a new vehicle wouldn’t appreciate being riddled with bullets because the neighborhood watch makes no attempt to distinguish between thieves and homeowners.

[–] pr06lefs@lemmy.ml 2 points 1 week ago (1 children)

So sad for them. Try not living in a war zone?

[–] FauxLiving@lemmy.world -4 points 1 week ago* (last edited 1 week ago) (1 children)

It isn’t a war zone, it’s a gated community where the guards have suddenly decided that any vehicle made after 2020 is full of thieves.

They didn’t bother to consult the residents or give them the ability to opt out of having their dinner guests murdered for driving a vehicle the security guards don’t like.

[–] pr06lefs@lemmy.ml 3 points 1 week ago (1 children)

So you're a cloudflare customer and you wish they would let the perplexity traffic multiplier through to your website? You can leave cloudflare any time you want.

[–] FauxLiving@lemmy.world -3 points 1 week ago (1 children)

🙄You’re an Internet user and you don’t like AI so you can leave the Internet anytime you want.

That’s not a good argument, what about the users who want to block mass scraping but want to make their content available to users who are using these tools? Cloudflare exists because it allows legitimate traffic, that websites want, and blocks mass scraping which the sites don’t want.

If they’re not able to distinguish mass scraping traffic from user created traffic then they’re blocking legitimate users that some website owners want.

[–] pr06lefs@lemmy.ml 2 points 1 week ago (1 children)

Yes your "leave the internet any time you want" strawman is not a good argument.

If allowing perplexity while blocking the bad guys is so easy why not find a service that does that for you?

[–] FauxLiving@lemmy.world -2 points 1 week ago (1 children)

The topic is that Cloudflare is classifying human sourced traffic as bot sourced traffic.

Saying “Just don’t use it” is a straw man. It doesn’t change the fact that Cloudflare, one of the largest CDNs representing a significant portion of the websites and services in the US, is misclassifying traffic.

I used mine intentionally while knowing it was a straw man, did you?

The same with “if it’s so easy, just don’t use it” hopefully for obvious reasons.

This affects both the customers of Cloudflare (the web service owners) as well as the users of the web services. A single site/user opting out doesn’t change the fact that a large portion of the Internet is classifying human sourced traffic as bot sourced traffic.

[–] pr06lefs@lemmy.ml 3 points 1 week ago (1 children)

LOL "human sourced traffic" oh the tragedy. I for one am rooting for perplexity to go out of business forever.

[–] FauxLiving@lemmy.world -2 points 1 week ago (1 children)

I for one am rooting for perplexity to go out of business forever.

Yeah, I know.

You’re engaging in motivated reasoning. That’s why you’re saying irrational things, because you’re working backwards from a conclusion (AI bad).

[–] pr06lefs@lemmy.ml 4 points 1 week ago (1 children)

I don't see how categorically blocking non-human traffic is irrational given the current environment of AI scanning. And what's rational about demanding cloudflare distinguish between the 'good guy' AI and 'bad guy' AI without proposing any methodology for doing so.

[–] FauxLiving@lemmy.world -1 points 6 days ago (1 children)

It is blocking human traffic, that’s the entire premise of the article.

Attempting to say that this is non-human traffic makes no sense if you understand how a browser works. When you load a website your browser, acting as an agent, does a lot of tasks for you and generates a bunch of web requests across multiple hosts.

Your browser downloads the HTML from the website, it parses the contents of that file for image, script and CSS links, it retrieves them from the various websites which host them, it interprets the JavaScript and makes any web requests based on that. Often the scripting has a user constantly sending requests to a website in order to update the content (like using web based email).

All of this is automated and done on your behalf. But you wouldn’t classify this traffic as non-human because a person told the browser to do that task and the task resulted in a flurry of web requests and processing on behalf of the user.

Summarization is just another task, which is requested by a human.

The primary difference, and why it is incorrectly classified, is because the summarization tools use a stripped down browser. It doesn’t need JavaScript to be rendered or CSS to change the background color so it doesn’t waste resources on rendering that stuff.

Cloudflare detects this kind of environment, one that doesn’t fully render a page, and assumes that it is a web scraper. This used to be a good way to detect scraping because the average user didn’t use web automation tools and scrapers did.

Regular users do use automation tools now, so detecting automation doesn’t guarantee that the agent is a scraper bot.

The point of the article is that their heuristics doesn’t work anymore because users use automation tools in a manner that doesn’t generate tens of millions of requests per second and overwhelm servers and so it shouldn’t classify them the same way.

The point of Cloudflare’s bot blocking is to prevent a single user from overwhelming a site’s resources. These tools don’t do that. Go use any search summarization tool and see for yourself, it usually grabs one page from each source. That kind of traffic uses less resources than a human user (because it only grabs static content).

[–] pr06lefs@lemmy.ml 3 points 6 days ago (1 children)

so how would cloudflare tell the difference between the good 'stripped down' queries and the bad? still not hearing how that is supposed to work. if there's no way to tell the difference, the baby will be thrown out with the bathwater, and I can't blame them.

[–] FauxLiving@lemmy.world 2 points 6 days ago

A large portion of this kind of traffic comes from identifiable sources, like Perplexity’s data centers, so Cloudflare could whitelist known safe sources. This seems to be what they’re doing now, a user replied to one of my comments saying their Cloudflare control panel now has the option of allowing AI queries from Perplexity.

Another way is to allow users to apply for session keys providing they obey rate limits and whitelist users with valid session keys. Non compliant accounts could be banned, maybe require identity verification to prevent ban avoidance.