this post was submitted on 25 Aug 2024

322 points (97.9% liked)

Fuck AI

4320 readers

546 users here now

"We did it, Patrick! We made a technological breakthrough!"

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

founded 2 years ago

MODERATORS

VerbFlow@lemmy.world

MrMcGasion@lemmy.world

TootSweet@lemmy.world

BigMikeInAustin@lemmy.world

cynar@lemmy.world

drmeanfeel@lemmy.world

pavnilschanda@lemmy.world

CriticalMedicine@lemmy.world

WonderfulWanderer@lemmy.world

Communist@lemmy.ml

eatCasserole@lemmy.world

SpaceNoodle@lemmy.world

NutWrench@lemmy.world

Soup@lemmy.cafe

iAvicenna@lemmy.world

Tinks@lemmy.world

wizblizz@lemmy.world

corus_kt@lemmy.world

Prandom_returns@lemm.ee

JimSamtanko@lemm.ee

TrickDacy@lemmy.world

TheFriar@lemm.ee

ArmokGoB@lemmy.dbzer0.com

HawlSera@lemm.ee

andrew_bidlaw@sh.itjust.works

MeDuViNoX@sh.itjust.works

33550336@lemmy.world

Nougat@fedia.io

Lost_My_Mind@lemmy.world

Sterile_Technique@lemmy.world

Quill7513@slrpnk.net

glowing_hans@sopuli.xyz

e8d79@discuss.tchncs.de

ThefuzzyFurryComrade@pawb.social

322

A new web crawler launched by Meta last month is quietly scraping the web for AI training data (fortune.com)

submitted 1 year ago by lemmee_in@lemm.ee to c/fuck_ai@lemmy.world

24 comments fedilink hide all child comments

Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model.

The crawler, named the Meta External Agent, was launched last month, according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.

A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.

While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.

Earlier this year, Mark Zuckerberg, Meta’s cofounder and longtime CEO, boasted on an earnings call that his company’s social platforms had amassed a data set for AI training that was even “greater than the Common Crawl,” an entity that has scraped roughly 3 billion web pages each month since 2011.

you are viewing a single comment's thread
view the rest of the comments

[–] aniki 8 points 1 year ago (1 children)

Crawling the web has fuck all to do with the function of the internet. Most crawlers are useless at most to downright disrespectful.

[–] GarrulousBrevity@lemmy.world 1 points 1 year ago* (last edited 1 year ago) (1 children)

Have you used a search engine? Crawlers are not generative AI.

[–] aniki 7 points 1 year ago* (last edited 1 year ago) (1 children)

The internet is not a search engine, and no - search engines are not generative ai. That's new.

Do you have any idea how many content bot crawlers there are? Most of the corporate sites I host at work are serving content to bots more than half the time.

Do you know altivista still has bots??

When was the last time you used that search engine?

[–] GarrulousBrevity@lemmy.world -1 points 1 year ago (2 children)

I guess I don't really see the problem with that though. There are configuration levers you could be pulling, but those sites you're hosting are not. There are lots of shady questions about how these models are getting training data, but crawlers have a well defined opt out mechanism.

The web would not be what we know it as without them, because it's how you find sites. Why shouldn't Alta Vista have one? I don't object to what Alta Vista does with the data.

[–] aniki 7 points 1 year ago (1 children)

Mate we have absurdly restrictive robots.txt including a custom WordPress plugin that automatically generates the file and the bots don't give a fuck.

[–] GarrulousBrevity@lemmy.world -1 points 1 year ago

But meta's will, and Alta Vista. I'm not angry at them when a script kitty makes a bad crawler

[+] ZDL@ttrpg.network 2 points 1 year ago* (last edited 4 months ago) (1 children)

[removed by mod]

[–] GarrulousBrevity@lemmy.world 1 points 1 year ago (1 children)

I know what you're trying to say, but that phrasing though. Being able to opt out is an important part of consent. No means no, man.

[+] ZDL@ttrpg.network 1 points 1 year ago* (last edited 4 months ago) (1 children)

[removed by mod]

[–] GarrulousBrevity@lemmy.world 1 points 1 year ago (1 children)

I think of this as a problem with opt-in only systems. Think of how sites ask you to opt in to allow tracking cookies every goddamn time a page loads. A rule based system which lets you opt in and opt out, like robots.txt, to let you opt out of cookie requests and tell all sites to fuck would be great. @aniki@lemm.ee is complaining about malicious instances of crawlers that ignore those rules (assuming they're right and that the rules are set up correctly), and lumping that malware with software made by established corporations. However, Meta and other big tech companies haven't historically had a problem with ignoring configurations like robots.txt. They have had an issue with using the data they scrape in ways that are different than what they claimed they would, or scraping data from a site that does not allow scraping by coming at it via a URL on a page that it legitimately scraped, but that's not the kind of shenanigans this article is about, as meta is being pretty upfront about what they're doing with the data. At least after they announced it existed.

An opt-in only solution would just lead to a world where all hosts were being constantly bombarded with requests to opt in. My major take away from how meta handled this is that you should configure any site you own to disallow any action from bots you don't recognize. As much as reddit can fuck off, I don't disagree with their move to change their configuration to:

User-agent: *
Disallow: /