Archivify

Published January 7, 2020, Page Last Modified July 21, 2023

[Disclaimer: I don’t archive information as much as I used to, though of course, you are free to archive away!].

Milan Griffes (a), Alexey Guzey, and I seem to like the (a) notation for archiving links on our websites.

Here’s a short script, archivify.py, I wrote that takes an html page, archives each link with the Internet Archive, and then automatically adds the ‘(a)’ notation afterward!

Essentially, it converts 'I like tacos' to 'I like tacos (a)'.

Installations

Before you run it, first install Beautiful Soup (a) and archivenow (a):

pip install beautifulsoup4
pip install archivenow

Usage

python archivify.py input.html

Error Handling

While the script works just fine, the Internet Archive sometimes throws errors when trying to archive a link (403 Forbidden, 502 Server Error: Bad Gateway for url). In these cases, the script doesn’t add an ‘(a)’.

archivify.py

import sys
from bs4 import BeautifulSoup
from archivenow import archivenow

filename = sys.argv[1]
f_in = open(filename, "r")
text = f_in.read()
soup = BeautifulSoup(text, 'html.parser')

for a in soup.find_all('a'):
	link = a.get('href')

	# Save link to Internet Archive
	archived_link = archivenow.push(link, "ia")[0]
	print('archived_link')

	# If there was no error archiving
	if not (archived_link.startswith('Error') or 'No HTTP' in archived_link):

		# Add ' (a)' with archived link after linked text
		archived_tag = soup.new_tag("a", href=archived_link)
		archived_tag.append("a")
		a.insert_after(")")
		a.insert_after(archived_tag)
		a.insert_after(" (")

# Write contents to new file
filename_out = filename.split('.')[0] + '-archived.' + filename.split('.')[1]
f_out = open(filename_out, "w")
f_out.write(str(soup))

Citation

Zuckerman, Andrew, "Archivify", January 7, 2020, http://andzuck.com/projects/archivify/

Subscribe

Comments powered by Talkyard.