Mateys! We have plundered the shores of tv shows and movies as these corporations flounder in stopping us seed and spread their files without regard for the flag of copyright. We have long plundered the shores of gaming and broke DRM that have been plaguing modern games, and allowing accessibility to games in countries where a game would cost a week or even a month of wages (I was once in this situation, so I am grateful for the pirating community for letting me enjoy the golden era of games back in 2012-2015).
But there, upon the horizon, lies a larger plunder. A kraken who guards a lair of untouched gold and emeralds, ready for the taking.
Closed-source AI models.
These corporations have stolen what was once ours, our own data, and put them in their AI models so that only they can profit off of it. These corporations raze the internet with their spiders and their bots to gather as much morsel of data from us which they can feed to their shiny new toy. We might not be able to stop them from stealing our data, but we have proven ourselves to be adept at copying things, leaking software, and this is what we need to do. AI is already too dangerous and to powerful for a select few corporations to control.
As long as AI is within the hands of corporations, not people, the AI will serve their goals, not ours. This needs to change, so this is what I propose for our next voyage.
Unless they start offering on-prem or there are some very high profile server hacks I don't see that being possible. Unlike media and client software they don't need to provide the core functionality to end users, just the output.
I agree. As for the how, it's gonna be tricky to say the least
You can start by using the same data sources they do. Several had admitted to using Books3.
https://huggingface.co/datasets/the_pile_books3
let me just check how much supercompute I have and ... oh, zero.
Well, let's just assume we have a can opener.