Archivify

Published January 7, 02020, Page Last Modified January 28, 02020

Milan Griffes (a), Alexey Guzey, and I seem to like the (a) notation for archiving links on our websites.

Here’s a short script, archivify.py, I wrote that takes an html page, archives each link with the Internet Archive, and then automatically adds the ‘(a)’ notation afterward!

Essentially, it converts 'I like tacos' to 'I like tacos (a)'.

Installations

Before you run it, first install Beautiful Soup (a) and archivenow (a):

pip install beautifulsoup4
pip install archivenow


Usage

python archivify.py input.html


Error Handling

While the script works just fine, the Internet Archive sometimes throws errors when trying to archive a link (403 Forbidden, 502 Server Error: Bad Gateway for url). In these cases, the script doesn’t add an ‘(a)’.

archivify.py

import sys
from bs4 import BeautifulSoup
from archivenow import archivenow

filename = sys.argv[1]
f_in = open(filename, "r")
text = f_in.read()
soup = BeautifulSoup(text.decode('UTF-8'), 'html.parser')

for a in soup.find_all('a'):
    link = a.get('href')

    # Save link to Internet Archive
    archived_link = archivenow.push(link, "ia")[0]
    print('archived_link')

    # If there was no error archiving
    if not archived_link.startswith('Error'):

        # Add ' (a)' with archived link after linked text
        archived_tag = soup.new_tag("a", href=archived_link)
        archived_tag.append("a")
        a.insert_after(")")
        a.insert_after(archived_tag)
        a.insert_after(" (")

# Write contents to new file
filename_out = filename.split('.')[0] + '-archived.' + filename.split('.')[1]
f_out = open(filename_out, "w")
f_out.write(soup.encode('UTF-8'))

Citation

Zuckerman, Andrew, "Archivify", January 7, 02020, http://andzuck.com/projects/archivify/

Subscribe

Comments powered by Talkyard.