Someone made a dataset of one million Bluesky posts for 'machine learning research' (www.404media.co)

submitted 1 month ago by Emperor@feddit.uk to c/bluesky@lemmy.ml

1 comments fedilink hide all child comments

A machine learning librarian at Hugging Face just released a dataset composed of one million Bluesky posts, complete with when they were posted and who posted them, intended for machine learning research.

...

The data isn’t anonymous. In the dataset, each post is listed alongside the users’ decentralized identifier, or DID; van Strien also made a search tool for finding users based on their DID and published it on Hugging Face. A quick skim through the first few hundred of the million posts shows people doing normal types of Bluesky posting—arguing about politics, talking about concerts, saying stuff like “The cat is gay” and “When’s the last time yall had Boston baked beans?”—but the dataset has also swept up a lot of adult content, too.

It’s also noteworthy that it’s a “snapshot” of time on Bluesky, meaning it could, and probably does, include since-deleted posts.

This dataset could be used for “training and testing language models on social media content, analyzing social media posting patterns, studying conversation structures and reply networks, research on social media content moderation, [and] natural language processing tasks using social media data,” the project page says. “Out of scope use” includes “building automated posting systems for Bluesky, creating fake or impersonated content, extracting personal information about users, [and] any purpose that violates Bluesky's Terms of Service.”

The dataset is already popular: as of writing, it’s one of the top trending Hugging Face projects.

🦋 Bluesky Social

197 readers

1 users here now

Bluesky is a federated social network built on ATProtocol.

Useful Links:

News, discussion, and memes are all allowed here.

Rules:

Follow lemmy.ml's site-wide rules.
All posts must, in some way, relate to Bluesky or ATProto.
Do not make duplicate posts.

founded 10 months ago

MODERATORS

airportline@lemmy.ml