Hello fedi

I would like to find a way I can have a cold copy of articles that I view, something like a self-hosted archive.is … but it does it automatically for every site I visit.

Please help me find out how or where!

Follow

@fluffy

1) Yes, you can do this up to a point. However, manual capture works better than automatic.

2) One approach is Pale Moon plus ScrapBook X. This approach is simple, but it only works for some sites.

3) For near-perfect captures, learn to use WARC toolsets. I've tested openwayback and pywb and suggest starting with those two. This isn't plug and play but I've gotten it to work pretty well.

4) Links:

github.com/webrecorder/pywb
github.com/iipc/openwayback/wi
loc.gov/preservation/digital/f

· · Web · 2 · 1 · 3

@oldcoder Thanks for the detailed explanation.

It looks like I’ll have to do some building to get exactly what I want!

@fluffy

1) These tools do work though in different ways. One of the WARC tools saves captures in multiple formats. You'll find that only the WARC format and multimedia files work well.

2) I started to do this in 2006 using pre-Quantum Firefox and the original ScrapBook extension. It's nice to still be able to read websites that are long gone.

3) There's one more approach that's tricky but automatic as you wished. One sets up a copy of Squid to decode https and cache content forever.

@oldcoder My main grievance is that I on occasion want to refer to an article I read, or to find it by searching by browser history, but if I store a link in my database years later it is gone, and browser history is no longer full text searchable, let alone far in the past.

Is there some reason you don’t automatically archive all of the web pages you visit?

@fluffy

1) Regarding the grievance: Yes, I started to write my own tools in 1995 to capture websites for this and other reasons.

2) Q. Why not capture everything? A. In the past, disks cost a lot more and complete and automated captures, especially for https sites, were more difficult.

But 1 TB disks, even SSDs, are cheap now and the Squid approach is both fast and automatic. So I'll probably capture more pages automatically in the future.

@fluffy

I noticed Conifer but thought that it was a commercial service. However, I see that you've found a FOSS core. I'll give it a try later this Fall.

@oldcoder I look forward to reading about it. Do you have a newsletter or an RSS feed?

@fluffy

I had an RSS feed for some years and plan to start a new one in 2021. Since you've expressed interest, I'll mention it to you at the time.

This is a link to one of my technical sites, presently paused but expected to resume next year:

laclin.com/

I expect to distribute useful scripts that I've developed over the years, including site capture tools, on pages there. A more active FOSS tools site that I recommend is Fossies at:

fossies.org/

@Gamercat @fluffy

I think that openwayback per se is designed more for interaction with full web browsers but there are parts of these toolsets that you can run in CLI. Lynx could be used to trigger captures by way of CLI scripts.

Sign in to participate in the conversation
Minetest Tooter

Mastodon server for creative rational­ists. In­tend­ed for light or tech­nical discussion as op­pos­ed to strong debate.

Discouraged: Identity poli­tics, religion, profanity, national events, X-rated dis­cus­sion or materials. We might set up other ser­vers for those things.

Encour­aged: Crea­tive Com­mons works, Mine­test and other FOSS games, FOSS, writers, artists, reci­pes, rhymes, cat photos, G-rated web­comics.