datahoarder

8714 readers

21 users here now

Who are we?

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

We are one. We are legion. And we're trying really hard not to forget.

-- 5-4-3-2-1-bang from this thread

founded 5 years ago

MODERATORS

archivist@lemmy.ml

106

PGSub - A Giant Archive of Subtitles For Everyone (gitlab.com)

submitted 1 year ago by liliumstar@lemmy.dbzer0.com to c/datahoarder@lemmy.ml

3 comments fedilink hide all child comments

I've been working on this subtitle archive project for some time. It is a Postgres database along with a CLI and API application allowing you to easily extract the subs you want. It is primarily intended for encoders or people with large libraries, but anyone can use it!

PGSub is composed from three dumps:

opensubtitles.org.Actually.Open.Edition.2022.07.25
Subscene V2 (prior to shutdown)
Gnome's Hut of Subs (as of 2024-04)

As such, it is a good resource for films and series up to around 2022.

Some stats (copied from README):

Out of 9,503,730 files originally obtained from dumps, 9,500,355 (99.96%) were inserted into the database.
Out of the 9,500,355 inserted, 8,389,369 (88.31%) are matched with a film or series.
There are 154,737 unique films or series represented, though note the lines get a bit hazy when considering TV movies, specials, and so forth. 133,780 are films, 20,957 are series.
93 languages are represented, with a special '00' language indicating a .mks file with multiple languages present.
55% of matched items have a FPS value present.

Once imported, the recommended way to access it is via the CLI application. The CLI and API can be compiled on Windows and Linux (and maybe Mac), and there also pre-built binaries available.

The database dump is distributed via torrent (if it doesn't work for you, let me know), which you can find in the repo. It is ~243 GiB compressed, and uses a little under 300 GiB of table space once imported.

For a limited time I will devote some resources to bug-fixing the applications, or perhaps adding some small QoL improvements. But, of course, you can always fork them or make or own if they don't suit you.

you are viewing a single comment's thread
view the rest of the comments

[–] liliumstar@lemmy.dbzer0.com 16 points 1 year ago (1 children)

As mentioned in the post, from three sources. The two site dimps were publicly available as torrents. The third was distributed privately.

[–] ksynwa@lemmy.ml 4 points 1 year ago

My bad. I read the readme but not the post.