this post was submitted on 20 Mar 2025
362 points (99.7% liked)

Open Source

34868 readers
801 users here now

All about open source! Feel free to ask questions, and share news, and interesting stuff!

Useful Links

Rules

Related Communities

Community icon from opensource.org, but we are not affiliated with them.

founded 5 years ago
MODERATORS
top 29 comments
sorted by: hot top controversial new old
[–] HiddenLayer555@lemmy.ml 17 points 19 hours ago* (last edited 18 hours ago)

LLM scraping is a parasite on the internet. In the actual ecological definition of parasite: they place a burden on other unwitting ~~organisms~~ computer systems, making it harder for the host to survive or carry out their own necessary processes, solely for the parasite's own benefit while giving nothing to the host in return.

I know there's an ongoing debate (both in the courts and on social media) about whether AI should have to pay royalties to its training data under copyright law, but I think they should at the very least be paying to use infrastructure while collecting the data, even free data, given that it costs the organisation hosting said data real money and resources to be scraped, and it's orders of magnitude more money and resources compared to serving that data to individual people.

The case can certainly be made that copying is not theft, but copying is by no means free either, especially when done at the scales LLMs do.

[–] carrylex@lemmy.world 7 points 20 hours ago (2 children)

While AI crawlers are a problem I'm also kind of astonished why so many projects don't use tools like ratelimiters or IP-blocklists. These are pretty simple to setup, cause no/very little additional load and don't cause collateral damage for legitimate users that just happend to use a different browser.

[–] bountygiver@lemmy.ml 7 points 17 hours ago

the article posted yesterday mentioned a lot of these requests are only made once per IP address, the botnet is absolutely huge.

[–] MTK@lemmy.world 3 points 19 hours ago (2 children)

IP based blocking is complicated once you are big enough or providing service to users is critical.

For example, if you are providing some critical service such as health care, you cannot have a situation where a user cannot access health care info without hard proof that they are causing an issue and that you did your best to not block the user.

Let's say you have a household of 5 people with 20 devices in the LAN, one can be infected and running some bot, you do not want to block 5 people and 20 devices.

Another example, double NAT, you could have literally hundreds or even thousands of people behind one IP.

[–] litchralee@sh.itjust.works 4 points 18 hours ago* (last edited 18 hours ago)

Let's say you have a household of 5 people with 20 devices in the LAN, one can be infected and running some bot, you do not want to block 5 people and 20 devices.

Why not, though? If a home network is misbehaving, whoever is maintaining that network needs to: 1) be aware that there's something wrong, and 2) needs to fix it on their end. Most homes don't have a Network Operations Center to contact, but throwing an error code in a web browser is often effective since someone in the household will notice. Unlike institutional users, home devices are not totally SOL when blocked, as they can be moved to use cellular networks or other WiFi networks.

At the root of the problem, NAT deprives the users behind it of agency: they're all in the same barrel, and the maxim about bad apples will apply. You're right that it gets even worse for CGNAT, but that's more a reason to refuse all types of NAT and prefer end-to-end IPv6.

[–] carrylex@lemmy.world -2 points 13 hours ago

IP based blocking is complicated once you are big enough

It's literally as simple as importing an ipset into iptables and refreshing it from time to time. There is even predefined tools for that.

[–] naught@sh.itjust.works 15 points 1 day ago

I have a small site that mirrors hacker news but with dark mode and stuff, and it is getting blasted by bot traffic. All the data is freely available from the official api but they're scraping my piddling site which runs on an anemic VPS because it looks like user generated content. Bot protection does little to help from my provider. Gonna have to rethink my whole architecture now. Very annoying

[–] Matriks404@lemmy.world 5 points 22 hours ago

They better don't attack too much, because all of the internet is built on FOSS infrastructure, and they might stop working, lol.

[–] some_guy@lemmy.sdf.org 79 points 1 day ago (1 children)

In a blogpost called, "AI crawlers need to be more respectful", they claim that blocking all AI crawlers immediately decreased their traffic by 75%, going from 800GB/day to 200GB/day. This made the project save up around $1500 a month.

"AI" companies are a plague on humanity. From now on, I'm mentally designating them as terrorists.

[–] Gonzako@lemmy.world 3 points 23 hours ago

that's the pure definition of a parasite

[–] LiveLM@lemmy.zip 88 points 1 day ago* (last edited 1 day ago) (1 children)

If you're wondering if it's really that bad, have this quote:

GNOME sysadmin, Bart Piotrowski, kindly shared some numbers to let people fully understand the scope of the problem. According to him, in around two hours and a half they received 81k total requests, and out of those only 3% passed Anubi's proof of work, hinting at 97% of the traffic being bots

And this is just one quote. The article is full of quotes of people all over reporting they can't focus on their work because either the infra they rely on is constantly down, or because they're the ones fighting to keep it functional.

This shit is unsustainable. Fuck all of these AI companies.

[–] gon@lemm.ee 34 points 1 day ago (1 children)

Great write-up by Niccolò.

I actually agree with the commenter on that post, the lack of quoting and using images is pretty bad, especially for screen-readers (which I use), and not directly linking sources (though they are made clear regardless) is a bit of a pain.

[–] WorkingLemmy@lemmy.world 16 points 1 day ago

Definitely agree. Love TheLibre as it covers subjects I don't see hit on as often but the lack of actually linking to sources and proper quotes blows.

[–] comfy@lemmy.ml 18 points 1 day ago* (last edited 1 day ago)

One of my sites was close to being DoS'd by openAI's crawler along with a couple of other crawlers. Blocking them made the site much faster.

I'd admit the software design offering search suggestions as HTML links didn't exactly help (this is a FOSS software used for hundreds of sites, and this issue likely applies to similar sites) but their rapid speed of requests turned this from pointless queries into a negligent security threat.

[–] beeng@discuss.tchncs.de 14 points 1 day ago (2 children)

You'd think these centralised LLM search providers would be caching a lot of this stuff, eg perplexity or claude.

[–] droplet6585@lemmy.ml 39 points 1 day ago (1 children)

There's two prongs to this

  1. Caching is an optimization strategy used by legitimate software engineers. AI dorks are anything but.

  2. Crippling information sources outside of service means information is more easily "found" inside the service.

So if it was ever a bug, it's now a feature.

[–] jacksilver@lemmy.world 16 points 1 day ago

Third prong, looking constantly for new information. Yeah, most of these sites may be basically static, but it's probably cheaper and easier to just constantly recrawl things.

[–] fuckwit_mcbumcrumble@lemmy.dbzer0.com 9 points 1 day ago (1 children)

They're absolutely not crawling it every time they nee to access the data. That's an incredible waste of processing power on their end as well.

In the case of code though that does change somewhat often. They'd still need to check if the code has been updated at the bare minimum.

[–] beeng@discuss.tchncs.de 2 points 23 hours ago

Hashes for cached content. Anyone know what sort of DB makes sense here?

[–] jagged_circle@feddit.nl 3 points 1 day ago* (last edited 1 day ago) (1 children)

Sad there's no mention of running an Onion Service. That has built-in PoW for DoS protection. So you dont have to be an asshole and block all if Brazil or China or Edge users.

Just use Tor, silly sysadmins

[–] Max_P@lemmy.max-p.me 12 points 1 day ago (2 children)

Proof of work is what those modern captchas tend to do I believe. Not useful to stop creating accounts and such, but very effective to stop crawlers.

Have the same problem at work, and Cloudflare does jack shit about it. Half that traffic uses user agents that have no chance to even support TLS1.3, I see some IE5, IE6, Opera with their old Presto engine, I've even seen Netscape. Complete and utter bullshit. At this point if you're not on an allow list of known common user agents or logged in, you get a PoW captcha.

[–] lightnegative@lemmy.world 1 points 22 hours ago

If I was a bot author intent on causing misery I'd just use the user agent from the latest version of Firefox/Chrome/Edge that legitimate users would use.

It's just a string controlled by the client at the end of the day and I'm surprised the GPT and OpenAI bots announce themselves in it. Associating meaning on the server side is always going to be problematic if the client can control the value

[–] jagged_circle@feddit.nl 2 points 1 day ago

Yeah but Tor's doesn't require JavaScript, so you dont have to block at-risk users and opress them further