this post was submitted on 10 Nov 2024

30 points (100.0% liked)

Sustainable Tech

1100 readers

1 users here now

Sabaidee, Welcome!

This is a community for promoting sustainability in tech and computing. This includes: understanding the impact that our tech/computing choices have on the environment; purchasing or re-using devices that are sustainable and repairable; how to properly recycle or dispose of old devices when it is beyond use; and promoting software and services that allow us to reduce our environmental impact in the long term, both at work and in our personal lives.

This isn't a competition, it's a reminder to stay grounded when making your decisions. Remember: The most sustainable device is the one that you are already using.

Rules:

Stay on-topic. Everything from sustainable smartphones to data centers and the green energy that powers them is fair game.
Be excellent to each other.

Note: This is hosted on Lemmy at SDF. If you are browsing from the larger Fediverse, search for

[!sustainabletech@lemmy.sdf.org](/c/sustainabletech@lemmy.sdf.org)

and hit the Subscribe button.

founded 2 years ago

MODERATORS

unix_joe@lemmy.sdf.org

UNIX84@lemmy.sdf.org

How to archive a website in a future-proof way (involves PDF hybrid) (lemmy.sdf.org)

submitted 8 months ago* (last edited 8 months ago) by evenwicht@lemmy.sdf.org to c/sustainabletech@lemmy.sdf.org

11 comments fedilink hide all child comments

MAFF (a shit-show, unsustained)

Firefox used to have an in-house format called MAFF (Mozilla Archive File Format), which boiled down to a zip file that had HTML and a tree of media. I saved several web pages that way. It worked well. Then Mozilla dropped the ball and completely abandoned their own format. WTF. Did not even give people a MAFF→mhtml conversion tool. Just abandoned people while failing to realize the meaning and purpose of archival. Now Firefox today has no replacement. No MHTML. Choices are:

HTML only
HTML complete (but not as a single file but a tree of files)

MHTML (shit-show due to non-portable browser-dependency)

Chromium-based browsers can save a whole complete web page to a single MHTML file. Seems like a good move but then if you open Chromium-generated MHTML files in Firefox, you just get an ascii text dump of the contents which resembles a fake email header, MIME, and encoded (probably base64). So that’s a show-stopper.

exceptionally portable approach: A plugin adds a right-click option called “Save page WE” (available in both Firefox and Chromium). That extension produces an MHTML file that both Chromium and Firefox can open.

PDF (lossy)

Saving or printing a web page to PDF mostly guarantees that the content and representation can reasonably be reproduced well into the future. The problem is that PDF inherently forces the content to be arranged on a fixed width that matches a physical paper geometry (A4, US letter, etc). So you lose some data. You lose information about how to re-render it on different devices with different widths. You might save on A4 paper then later need to print it to US letter paper, which is a bit sloppy and messy.

PDF+MHTML hybrid

First use Firefox with the “Save page WE” plugin to produce an MHTML file. But relying on this alone is foolish considering how unstable HTML specs are even still today in 2024 with a duopoly of browser makers doing whatever the fuck they want - abusing their power. So you should also print the webpage to a PDF file. The PDF will ensure you have a reliable way to reproduce the content in the future. Then embed the MHTML file in the PDF (because PDF is a container format). Use this command:

$ pdfattach webpage.pdf webpage.mhtml webpage_with_HTML.pdf

The PDF will just work as you expect a PDF to, but you also have the option to extract the MHTML file using pdfdetach webpage_with_HTML.pdf if the need arises to re-render the content on a different device.

The downside is duplication. Every image is has one copy stored in the MTHML file and another copy separately stored in the PDF next to it. So it’s shitty from a storage space standpoint. The other downside is plugin dependency. Mozilla has proven browser extensions are unsustainable when they kicked some of them out of their protectionist official repository and made it painful for exiled projects to reach their users. Also the mere fact that plugins are less likely to be maintained than a browser builtin function.

We need to evolve

What we need is a way to save the webpage as a sprawled out tree of files the way Firefox does, then a way to stuff that whole tree of files into a PDF, while also producing a PDF vector graphic that references those other embedded images. I think it’s theoretically possible but no tool exists like this. PDF has no concept of directories AFAIK, so the HTML tree would likely have to be flattened before stuffing into the PDF.

Other approaches I have overlooked? I’m not up to speed on all the ereader formats but I think they are made for variable widths. So saving a webpage to an ereader format of some kind might be more sensible than PDF, if possible.

(update) The goals

Capture the webpage as a static snapshot in time which requires no network to render. Must have a simple and stable format whereby future viewers are unlikely to change their treatment of the archive. PDF comes close to this.
Record the raw original web content in a non-lossy way. This is to enable us to re-render the content on different devices with different widths. Future-proofness of the raw content is likely impossible because we cannot stop the unstable web standards from changing. But capturing a timestamp and web browser user-agent string would facilitate installation of the original browser. A snapshot of audio, video, and the code (JavaScript) which makes the page dynamic is also needed both for forensic purposes (suitable for court) and for being able to faithfully reproduce the dynamic elements if needed. This is to faithfully capture what’s more of an application than a document. wget -m possibly satisfies this. But perhaps tricky to capture 3rd party JS without recursing too far on other links.
A raw code-free (thus partially lossy) snapshot for offline rendering is also needed if goal 1 leads to a width-constrained format. Save page WE and WebScrapBook apparently satisfies this.

PDF satisfies goal 1; wget satisfies goal 2; maff/mhtml satisfies goal 3. There is likely no single format that does all of the above, AFAIK. But I still need to explore these suggestions.

all 14 comments

sorted by: hot top controversial new old

[–] mp3@lemmy.ca 6 points 8 months ago (1 children)

I often don't really care about the actual webpage format, compared to the actual content, and my strategy has been to convert and archive webpages to Markdown. At least I get to keep most of the text formatting, tables, etc in an easily readable format, the only annoyance is that images have to be stored in a file tree.

[–] evenwicht@lemmy.sdf.org 2 points 8 months ago* (last edited 8 months ago) (1 children)

The other thing is, what about JavaScript? JS changes the presentation.

Markdown is probably ideal when saving an article, like a news story. It might even be quite useful to get it into a Gemini-compatible language. But what if you are saving the receipt for a purchase? A tax auditor would suspect shenanigans. So the idea with archival is generally to closely (faithfully) preserve the doc.

[–] mp3@lemmy.ca 4 points 8 months ago* (last edited 8 months ago)

Yeah in that case it would be better to preserve as close as possible the original.

In my case, most of the stuff I archive are articles, tutorials, documentation and stuff that doesn't change often so markdown fits that bill relatively well, and can be read in plain-text quite easily which is great for future-proofing readability.

[+] smpl@discuss.tchncs.de 4 points 8 months ago (1 children)

[deleted]

[–] evenwicht@lemmy.sdf.org 3 points 8 months ago* (last edited 8 months ago) (1 children)

IIUC you are referring to this extension, which is Firefox-only (~~like~~unlike the save page WE, which has a Chromium version).

Indeed the beauty of ZIP is stability. But the contents are not. HTML changes so rapidly, I bet if I unzip an old MAFF file it would not have stood the test of time well. That’s why I like the PDF wrapper. Nonetheless, this WebScrapBook could stand in place of the MHTML from the save page WE extension. In fact, save page WE usually fails to save all objects for some reason. So WebScrapBook is probably more complete.

(edit) Apparently webscrapbook gives a choice between htz and maff. I like that it timestamps the content, which is a good idea for archived docs.

(edit2) Do you know what happens with JavaScript? I think JS can be quite disruptive to archival. If webscrapbook saves the JS, it’s saving an app, in effect, and that language changes. The JS also may depend on being able to access the web, which makes a shitshow of archival because obviously you must be online and all the same external URLs must still be reachable. OTOH, saving the JS is probably desirable if doing the hybrid PDF save because the PDF version would always contain the static result, not the JS. Yet the JS could still be useful to have a copy of.

(edit3) I installed webscrapbook but it had no effect. Right-clicking does not give any new functions.

[–] moonpiedumplings@programming.dev 2 points 8 months ago* (last edited 8 months ago)

Maybe: https://archivebox.io

Or: https://www.httrack.com

I'm pretty sure httrack just saves relevant embedded images along with the site.

[–] CanadaPlus@lemmy.sdf.org 2 points 8 months ago (1 children)

Is there a reason wget -m is bad?

[–] evenwicht@lemmy.sdf.org 4 points 8 months ago (1 children)

It’s perhaps the best way for someone that has a good handle on it. Docs say it “sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.” So you would need to tune it so that it’s not grabbing objects that are irrelevent to the view, and probably exclude some file types like videos and audio. If you get a well-tuned command worked out, that would be quite useful. But I do see a couple shortcomings nonetheless:

If you’re on a page that required you to login to and do some interactive things to get there, then I think passing the cookie from the gui browser to wget would be non-trivial.
If you’re on a capped internet connection, you might want to save from the brower’s cache rather that refetch everything.

But those issues aside I like the fact that wget does not rely on a plugin.

[–] CanadaPlus@lemmy.sdf.org 2 points 8 months ago* (last edited 8 months ago) (1 children)

I find that the things most likely to disappear (like a tinkerer's web 1.0 homepage) tend to have limited recursion depth anyway.

A Tumblr blog takes an awfully long time to crawl politely, IIRC, but the end result wasn't too big on disk. Now I'm wondering how you would pass a cookie to wget, and how you might set a data cap so you can stop and wait for the month to be up before you call it again. I kind of feel like I've done a cookie before to get around a captcha or something...

Edit: There's a couple of ideas for limiting size on StackOverflow. The wget specific one is -Q for quota, which you'd want to set conservatively in case there's one huge file somewhere, since it only checks between individual downloads.

Looks like there's a --load-cookies option that will read a browser export of cookies from a file, as well as load POST data and save cookie options if you want to do something interactive that way.

Edit edit: What I'm remembering is actually adding headers, like this.

[–] evenwicht@lemmy.sdf.org 1 points 8 months ago* (last edited 8 months ago) (1 children)

wget has a --load-cookies file option. It wants the original Netscape cookie file format. Depending on your GUI browser you may have to convert it. I recall in one case I had to parse the session ID out of a cookie file then build the expected format around it. I don’t recall the circumstances.

Another problem: some anti-bot mechanisms crudely look at user-agent headers and block curl attempts on that basis alone.

(edit) when cookies are not an issue, wkhtmltopdf is a good way to get a PDF of a webpage. So you could have a script do a wget to get the HTML faithfully, and wkhtmltopdf to get a PDF, then pdfattach to put the HTML inside the PDF.

(edit2) It’s worth noting there is a project called curl-impersonate which makes curl look more like a GUI browser to get more equal treatment. I think they go as far as adding a javascript engine or something.

[–] CanadaPlus@lemmy.sdf.org 2 points 8 months ago

Ah, looks like you beat my edit by a few seconds.

Good to know about the Netscape thing. It looks like Firefox (still, being a successor to NS) does it that way, and Chrome can do it that way. If you're using a true third option you probably don't need my help.

For the sake of completeness, on Tor Browser you have to copy the SQLite database from the browser directory, since it's too locked down to just export the normal way. Then I'd try just subbing it in on an offline Firefox instance and proceeding the normal way. And obviously, use wget over torsocks as well.