this post was submitted on 19 Sep 2025
5 points (100.0% liked)

Self Hosted - Self-hosting your services.

16066 readers
2 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules

Important

Cross-posting

If you see a rule-breaker please DM the mods!

founded 4 years ago
MODERATORS
 

Context: my father is a lawyer and therefore has a bajillion pdf files that were digitised, stored in a server. I’ve gotten an idea on how to do OCR in all of them.

But after that, how can I make them easily searchable? (Keep in mind that unfortunately, the directory structure is important information to classify the files, aka you may have a path like clientABC/caseAV1/d.pdf

you are viewing a single comment's thread
view the rest of the comments
[–] solrize@lemmy.ml 5 points 1 week ago* (last edited 1 week ago)

What's a bajillion? If the OCR output is less than a few GB, which is a heck of a lot of text (like a million pages), just grepping the files is not too bad. Maybe a second or two. Otherwise you need search software. solr.apache.org is what I'm used to but there are tons of options.