this post was submitted on 20 Mar 2025
496 points (99.6% liked)

Technology

67050 readers
4776 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
top 50 comments
sorted by: hot top controversial new old
[–] Fijxu@programming.dev 24 points 6 hours ago (1 children)

AI scrapping is so cancerous. I host a public RedLib instance (redlib.nadeko.net) and due to BingBot and Amazon bots, my instance was always rate limited because the amount of requests they do is insane. What makes me more angry, is that this fucking fuck fuckers use free, privacy respecting services to be able to access Reddit and scrape . THEY CAN'T BE SO GREEDY. Hopefully, blocking their user-agent works fine ;)

[–] green@feddit.nl 3 points 5 hours ago

Thanks for hosting your instances. I use them often and they're really well maintained

[–] grysbok@lemmy.sdf.org 19 points 6 hours ago

It's also a huge problem for library/archive/museum websites. We try so hard to make data available to everyone, then some rude bots come along and bring the site down. Adding more resources just uses more resources--the bots expand to fill the container.

[–] grue@lemmy.world 25 points 16 hours ago (4 children)

ELI5 why the AI companies can't just clone the git repos and do all the slicing and dicing (running git blame etc.) locally instead of running expensive queries on the projects' servers?

[–] Retropunk64@lemmy.world 6 points 4 hours ago

They're stealing peoples data to begin with, they don't give a fuck at all.

[–] green@feddit.nl 6 points 5 hours ago (1 children)

Too many people overestimate the actual capabilities of these companies.

I really do not like saying this because it lacks a lot of nuance, but 90% of programmers are not skilled in their profession. This is not to say they are stupid (though they likely are, see cat-v/harmful) but they do not care about efficiency nor gracefulness - as long as the job gets done.

You assume they are using source control (which is unironically unlikely), you assume they know that they can run a server locally (which I pray they do), and you assume their deadlines allow them to think about actual solutions to problems (which they probably don't)

Yes, they get paid a lot of money. But this does not say much about skill in an age of apathy and lawlessness

[–] turmacar@lemmy.world 3 points 3 hours ago

Also, everyone's solution to a problem is stupid if they're only given 5 minutes to work on it.

Combine that with it being "free" for them to query the website and expensive to have enough local storage to replicate, even temporarily, all the stuff they want to scrape and it's kind of a no brainier to 'just not do that'. The only thing stopping them is morals / whether they want to keep paying rent.

[–] Realitaetsverlust@lemmy.zip 16 points 7 hours ago

Because that would cost you money, so just "abusing" someone else's infrastructure is much cheaper.

[–] zovits@lemmy.world 16 points 14 hours ago (1 children)

Takes more effort and results in a static snapshot without being able to track the evolution of the project. (disclaimer: I don't work with ai, but I'd bet this is the reason and also I don't intend to defend those scraping twatwaffles in any way, but to offer a possible explanation)

[–] Sturgist@lemmy.ca 11 points 8 hours ago

Also having your victim host the costs is an added benefit

[–] daq@lemmy.sdf.org -2 points 6 hours ago (1 children)

I'm not sure how they actually implemented it, but you can easily block ML crawlers via cloud flare. Isn't just about every small site/service behind CF anyway?

[–] grysbok@lemmy.sdf.org 5 points 5 hours ago (1 children)

Last I checked, cloudflare requires the user to have JavaScript and cookies enabled. My institution doesn't want to require those because it would likely impact legitimate users as well as bots.

[–] daq@lemmy.sdf.org 1 points 5 hours ago (1 children)

Huh? I can reach my site via curl that has neither. How did you come up with this random set of requirements?

[–] grysbok@lemmy.sdf.org 0 points 3 hours ago (1 children)

Odd. I just tried

curl https://www.scrapingcourse.com/cloudflare-challenge

and got

Enable JavaScript and cookies to continue

I'm clearly not on the same setup as you are, but my off-the-cuff guess is that your curl command was issued from a system that cloudflare already recognized (IP whitelist, cookies, I dunno).

Anyways, I'm reading through this blog post on using cURL with cloudflare-protected sites and I'm finding it interesting.

[–] daq@lemmy.sdf.org 1 points 1 hour ago

Of course their challenge requires those things. How else could they implement it? Most users will never be presented with a challenge though and it is trivial to disable if you don't want to ever challenge anyone. I was just saying CF blocks ML crawlers.

[–] melpomenesclevage@lemmy.dbzer0.com 28 points 21 hours ago* (last edited 19 hours ago) (1 children)

i hear there's a tool called (I think) 'nepenthe' that creates a loop for an LLM, if you use that in combination with a fairly tight blacklist of IP's you're certain are LLM crawlers, I bet you could do a lot of damage, and maybe make them slow their shit down, or do this in a more reasonable way.

[–] PrivacyDingus@lemmy.world 6 points 12 hours ago (1 children)

nepenthe

It's a Markov-chain-based text generator which could be difficult for people to implement on repos depending upon how they're hosting them. Regardless, any sensibly-built crawler will have rate limits. This means that although Nepenthe is an interesting thought exercise, it's only going to do anything to things knocked together by people who haven't thought about it, not the Big Big companies with the real resources who are likely having the biggest impact.

might hit a few times, or maybe there's a version that can puff stuff up the data in the sense of space, and salt it in the sense of utility.

[–] db0@lemmy.dbzer0.com 61 points 1 day ago

Yep, it hit many lemmy servers as well, including mine. I had to block multiple alibaba subnet to get things back to normal. But I'm expecting the next spam wave.

[–] Buelldozer@lemmy.today 46 points 1 day ago (4 children)

I too read Drew DeVault's article the other day and I'm still wondering how the hell these companies have access to "tens of thousands" of unique IP addresses. Seriously, how the hell do they have access to so many IP addresses that SysAdmins are resorting to banning entire countries to make it stop?

[–] festus@lemmy.ca 7 points 8 hours ago (1 children)

There are residential IP providers that provide services to scrapers, etc. that involves them having thousands of IPs available from the same IP ranges as real users. They route traffic through these IPs via malware, hacked routers, "free" VPN clients, etc. If you block the IP range for one of these addresses you'll also block real users.

[–] Buelldozer@lemmy.today 3 points 6 hours ago (1 children)

There are residential IP providers that provide services to scrapers, etc. that involves them having thousands of IPs available from the same IP ranges as real users.

Now that makes sense. I hadn't considered rogue ISPs.

[–] festus@lemmy.ca 2 points 6 hours ago

It's not even necessarily the ISPs that are doing it. In many cases they don't like this because their users start getting blocked on websites; it's bad actors piggy-packing on legitimate users connections without those users' knowledge.

[–] werefreeatlast@lemmy.world 10 points 20 hours ago (1 children)

If you get something like 156.67.234.6, then 7, then 56 etc just block 156.67.0.0/24

[–] Buelldozer@lemmy.today 2 points 6 hours ago

Sure, network blocking like this has been a thing for decades but it still requires ongoing manual intervention which is what these SysAdmins are complaining about.

load more comments (2 replies)
[–] wjs018@piefed.social 110 points 1 day ago* (last edited 1 day ago) (7 children)

Really great piece. We have recently seen many popular lemmy instances struggle under recent scraping waves, and that is hardly the first time its happened. I have some firsthand experience with the second part of this article that talks about AI-generated bug reports/vulnerabilities for open source projects.

I help maintain a python library and got a bug report a couple weeks back of a user getting a type-checking issue and a bit of additional information. It didn't strictly follow the bug report template we use, but it was well organized enough, so I spent some time digging into it and came up with no way to reproduce this at all. Thankfully, the lead maintainer was able to spot the report for what it was and just closed it and saved me from further efforts to diagnose the issue (after an hour or two were burned already).

[–] Dave@lemmy.nz 31 points 1 day ago

AI scrapers are a massive issue for Lemmy instances. I'm gonna try some things in this article because there are enough of them identifying themselves with user agents that I didn't even think of the ones lying about it.

I guess a bonus (?) is that with 1000 Lemmy instances, the bots get the Lemmy content 1000 times so our input has 1000 times the weighting of reddit.

load more comments (6 replies)
[–] fjordo@feddit.uk 63 points 1 day ago (1 children)

I wish these companies would realise that acting like this is a very fast way to get scraping outlawed altogether, which is a shame because it can be genuinely useful (archival, automation, etc).

[–] jol@discuss.tchncs.de 52 points 1 day ago (17 children)

How can you outlaw something a company in another conhtinent is doing? And specially when they are becoming better as disguising themselves as normal traffic? What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.

load more comments (17 replies)
[–] klu9@lemmy.ca 51 points 1 day ago (2 children)

The Linux Mint forums have been knocked offline multiple times over the last few months, to the point where the admins had to block all Chinese and Brazilian IPs for a while.

load more comments (2 replies)
load more comments
view more: next ›