this post was submitted on 07 Jul 2025

145 points (98.7% liked)

Technology

39834 readers

379 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 3 years ago

MODERATORS

TheRtRevKaiser@beehaw.org

alyaza@beehaw.org

gyrfalcon@beehaw.org

SemioticStandard@beehaw.org

coldredlight@beehaw.org

rs5th@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org

145

The Open-Source Software Saving the Internet From AI Bot Scrapers (www.404media.co)

submitted 4 weeks ago by sabreW4K3@lazysoci.al to c/technology@beehaw.org

24 comments fedilink hide all child comments

top 24 comments

sorted by: hot top controversial new old

[–] theangriestbird@beehaw.org 43 points 4 weeks ago (1 children)

This snip at the end is so good:

Iaso said she thinks AI companies follow her work, and that if they really want to stop her and Anubis they just need to distract her.

“If you are working at an AI company, here's how you can sabotage Anubis development as easily and quickly as possible,” she wrote on her site. “So first is quit your job, second is work for Square Enix, and third is make absolute banger stuff for Final Fantasy XIV. That’s how you can sabotage this the best.”

[–] Geodad@beehaw.org 9 points 4 weeks ago

I'd be fine with this... 🤣

[–] who@feddit.org 16 points 4 weeks ago* (last edited 4 weeks ago) (1 children)

She told me she’s [...] also thinking about a version that doesn’t require JavaScript, which some privacy-minded disable in their browsers.

As someone who is keenly aware of the privacy and security problems that come with allowing web scripts, I hope she prioritizes this soon. It's really disappointing to find sites that were formerly readable without javascript suddenly inaccessible since adopting Anubis. The more sites that do this, the more people are pushed toward enabling scripts by default, exposing them to a great many trackers and web exploits that would otherwise be blocked.

[–] exu@feditown.com 2 points 4 weeks ago (1 children)

There's an option using some very new HTML tag, but it's not the default.

https://anubis.techaro.lol/docs/admin/configuration/challenges/metarefresh

[–] who@feddit.org 1 points 4 weeks ago (1 children)

Interesting. Judging by that option's name, it seems to refer to use of the HTML <meta> tag to refresh a page.

https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/meta/http-equiv

Neither this tag nor using it for refresh is new at all. I don't think I've seen it used to detect bots, though. I wonder what Anubis is doing here.

[–] JohnEdwa@sopuli.xyz 2 points 4 weeks ago

It's simply checking if the connection is from an actual browser, as a scraper pretending to be one won't actually refresh the page as instructed. It's going to buy some time, but like the rest of Anubis in general, it will only work until the scrapers get modified to work around it.

[–] FundMECFSResearch@lemmy.blahaj.zone 13 points 4 weeks ago (3 children)

This thing Anubis always flags me for some reason. I use mullvad and safari (ios) with some add and tracker blocking extensions.

[–] Photuris@lemmy.ml 6 points 4 weeks ago (2 children)

More sites in general are blocking mullvad traffic lately (in my experience), and I’m not sure what, if anything, can be done about it.

[–] FundMECFSResearch@lemmy.blahaj.zone 6 points 4 weeks ago (1 children)

I expect better from a popular FOSS tool being used by privacy aware people though.

[–] SweetCitrusBuzz@beehaw.org 2 points 4 weeks ago

Can you open an issue, or see if one is open already for this?

[–] Powderhorn@beehaw.org 3 points 4 weeks ago

Agreed. Luckily, they don't seem to have the full list of Mullvad IPs, so if I really want to read something, I just try another tunnel.

[–] simple@piefed.social 6 points 4 weeks ago (1 children)

Do you have javascript or cookies disabled? That might stop you from getting past.

[–] FundMECFSResearch@lemmy.blahaj.zone 3 points 4 weeks ago

nope

[–] Appoxo@lemmy.dbzer0.com 4 points 4 weeks ago

I wonder why traffic from known VPN companies are under more scrutiny than traffic from domestic households................

[–] leaky_shower_thought@feddit.nl 11 points 4 weeks ago (1 children)

i like this one better than cloudflare's turnstile.

cf blocks me all the time for the smallest reasons and i can't seem to find their nag email.

[–] fuckwit_mcbumcrumble@lemmy.dbzer0.com 2 points 4 weeks ago (1 children)

I have no issues with Cloudflare, but Anubis always takes it sweet ass time to verify me. Like 30+ seconds just sitting there, but then eventually I get in.

[–] Vanilla_PuddinFudge@infosec.pub 1 points 4 weeks ago* (last edited 4 weeks ago)

Windows XP ended support like 20 years ago if you were wondering if the Pentium 4 build you're using was still viable.

[–] remington@beehaw.org 2 points 4 weeks ago (2 children)

Would you edit your post and add the following archive link to the body, please?

https://archive.is/VcoE1

[–] who@feddit.org 7 points 4 weeks ago* (last edited 4 weeks ago) (1 children)

Unfortunately, archive.is seems to have moved behind a big corporate CAPTCHA service, subjecting readers to having their reading habits (both the articles and the referring communities) tracked at a large scale.

I suggest this archive link instead:

https://web.archive.org/web/20250707135819/https://www.404media.co/the-open-source-software-saving-the-internet-from-ai-bot-scrapers/

[–] remington@beehaw.org 1 points 4 weeks ago (1 children)

Unfortunately, archive.is has moved behind Cloudflare, subjecting readers to having their reading habits (both the articles and the referring communities) tracked at a large scale.

How do you know this?

What about https://ghostarchive.org/?

[–] who@feddit.org 6 points 4 weeks ago* (last edited 4 weeks ago) (1 children)

Sorry; I shouldn't have written Cloudflare specifically. Their CAPTCHA page now contains scripts from Google, not Cloudflare. I have corrected my comment.

How do you know this?

Because a couple months ago, archive.is/archive.today started showing me CAPTCHA pages instead of the archived articles when I use Firefox with scripts disabled. The current page contains scripts hosted by Google, which I won't enable, so I can't read the archived articles.

What about https://ghostarchive.org/?

I haven't used that site enough to have a consistent picture of what it's doing. When I tried it a few minutes ago, it directed me to a CAPTCHA wall when trying to submit an article, but not when searching for an archived article. I'll try to remember to look at it again periodically, to be able to answer this question in the future.

[–] remington@beehaw.org 3 points 4 weeks ago

Thanks. I appreciate the info and effort.

[–] sabreW4K3@lazysoci.al 5 points 4 weeks ago (1 children)

To be honest with you, I refuse on moral grounds. 404 are independent and do good work. You've already linked a pay wall bypass in the comments, if anyone would like to find it, it's not hard to scroll.

[–] remington@beehaw.org 4 points 4 weeks ago

OK. Fair enough.