Tesseract OCR is Open Source Software. How can it be a site that they steal information from?
Technology
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
Good question I don't have the answer to. I could speculate that this is all likely being sourced from some sort of marketing material that ShadowDragon put out where they just flatly say they're gathering this information from Tesseract, and in reality they're actually gathering any information they can on users who search for this software and download this software, but like I said I'm speculating.
If you're really interested, I would say you should email the author of this article, reach out to Tesseract's development team, or find a way to get a subpoena against ShadowDragon and/or ICE
I hope you'll update us if you chase this down. I like 404 Media and I want to keep liking them, but only if the reporting is good. Hopefully it's a typical tech journalism mistranslation where they use Tesseract OCR to scrape PDFs and the author just misunderstood, or something like that.
Edit: after looking, I don't have any issues. Looks like just a raw list from whatever source, I don't need 404 Media to try to "curate" that or remove elements that seem irrelevant, they can leave that to us.
I hate this timeline.
Archive.is is not working for me, is there another you can archive to?
Sorry to hear that, try this one
Much appreciated
They are not only pulling data from all the x sites in the list, also pulling something else in the meantime