Finding and Fixing Website Link Rot with Python, BeautifulSoup and Requests

July 17, 2018
Written by
Samuel Huang
Contributor
Opinions expressed by Twilio contributors are their own

qOcPjaielXyexw-Jc56WJNGJ3_L2ADuLgzn8ILMeYFJin_FFb3fDlyeQD5DWvF2Cznez3TCVuNj8udFQCbR7XOAR_xoKVsSOjG3_GWlTE4DzYjKo-LwfSR7hFTW5Gaq5LdAwAduu

When hyperlinks go dead by returning 404 or 500 HTTP status codes or redirect to spam websites, that is the awful phenomenon know as “link rot”. Link rot is a widespread problem; in fact, research shows that an average link lasts four years.

In this blog post, we will look at how link rot affects user experience, using Full Stack Python as our example. We’ll build a Python script that detects link rot in Markdown and HTML files so we can quickly find and fix our issues.

fullstackpython.com is a website created by Twilio employee Matt Makai in 2012. The site has helped many folks, including me, learn how to best use Python and the tools within its ecosystem.

The site now has over 145,000 words and 150 pages, including:

  • 2400+ links in the repository
  • 300+ HTML files
  • 150+ Markdown files

And there’s expected to be more links and files in the future. With 2400+ links on the site, it is really difficult to immediately spot dead links. Users could report these via issues or pull requests at best, or at worst, users may not know what to do and leave the site. On the maintainer’s side, checking all the URLs by hand is not a sustainable solution. Assuming that link checking takes 10 seconds each, it would take at least 24000 seconds (or 6.7 hours) to go through all the links in one sitting.

There must be an automated solution to handle all of the link rot madness!

Python to the Rescue

Our approach will be to aggregate all the links from the site and check each URL using a Python script. Since the site content is all accessible on GitHub as a repository, we can clone the repository and run our script from the base folder.

The first step is to clone the repository:

git clone https://github.com/mattmakai/fullstackpython.com.git
cd fullstackpython.com

Please make sure that Python 3 is installed on your machine before proceeding further. During the time of this writing, the latest GA version is Python 3.7.0.

We will use the following built-in packages for this script:

  • futures is used for asynchronous processing
  • mp is used for determining CPU count
  • os is used for walking through files
  • json is used for printing json
  • uuid is used for generating random identifiers

We will also use the following third-party packages for this script:

  • BeautifulSoup is used for parsing HTML
  • markdown is used for parsing Markdown
  • requests is an easy-to-use interface for doing HTTP requests
  • urllib3 is the underlying implementation for requests

Feel free to install the third-party packages with pip or pipenv.

If you plan to use pip, create a requirements.txt with the following content:

beautifulsoup4==4.6.0
Markdown==2.6.10
requests==2.18.4
urllib3==1.22

These packages are already inside the repository’s requirements.txt file so can also copy and paste it from there.

Once the third-party packages are installed on your machine, create a new file named check_urls_twilio.py. Start the file by declaring imports:

from concurrent import futures
import multiprocessing as mp
import os
import json
import uuid

from bs4 import BeautifulSoup
from markdown import markdown
import requests
import urllib3

Now we can write the algorithm to seek out link rot.

Here is how our first attempt at coding our link rot finder will operate:

  1. Identify URLs from Markdown and HTML in fullstackpython repository
  2. Write identified URLs into an input file
  3. Read the URLs one-by-one from an input file
  4. Run a GET request and check if the URL gives a bad response (4xx/5xx)
  5. Write all bad URLs to output file

Why not just do a regular expression across all the files in the repository? I have access to all of them from the repository. I first configured a Linux command to do such a thing:

find . -type f \
    | xargs grep -hEo 'https?:\/\/[=a-zA-Z0-9\_\/\?\&\.\-]+' \
    | sed '/Binary/d' \
    | sort \
    | uniq > urlin.txt

Here’s a high-level explanation of the command:

  1. Find all files (not directories)
  2. Look for HTTP(S) links
  3. Remove HTTP snippets that have bad snippets (i.e. Binary)
  4. Sort all the links and remove duplicate links

The regular expression is used for finding HTTP(S) links. The following is a description of what the expression means under the hood from left-to-right:

  1. http or https?
  2. :// is the separator between the protocol and link metadata
  3. Any combination of link characters
  4. Alphabetical characters
  5. Numeric characters
  6. Query parameter characters
  7. DNS . separators

But we are writing a Python script, not a Bash script! So update your check_urls_twilio.py file with the same commands coded in Python:


from concurrent import futures
import multiprocessing as mp
import os
import json
import uuid

from bs4 import BeautifulSoup
from markdown import markdown
import requests
import urllib3


# Sources of data (file)
IN_PATH = os.path.join(os.getcwd(), 'urlin.txt')
OUT_PATH = os.path.join(os.getcwd(), 'urlout.txt')

_URL_RE = 'https?:\/\/[=a-zA-Z0-9\_\/\?\&\.\-]+'  # proto://host+path+params
_FIND_URLS = "find . -type f | xargs grep -hEo '{regex}'".format(regex=_URL_RE)
_FILTER_URLS = "sed '/Binary/d' | sort | uniq > {urlin}".format(urlin=IN_PATH)
COMMAND = '{find} | {filter}'.format(find=_FIND_URLS, filter=_FILTER_URLS)
os.system(COMMAND)

We can process each URL one-by-one and check whether they were valid because all the URLs are in IN_PATH:


from concurrent import futures
import multiprocessing as mp
import os
import json
import uuid

from bs4 import BeautifulSoup
from markdown import markdown
import requests
import urllib3


# Sources of data (file)
IN_PATH = os.path.join(os.getcwd(), 'urlin.txt')
OUT_PATH = os.path.join(os.getcwd(), 'urlout.txt')

_URL_RE = 'https?:\/\/[=a-zA-Z0-9\_\/\?\&\.\-]+'  # proto://host+path+params
_FIND_URLS = "find . -type f | xargs grep -hEo '{regex}'".format(regex=_URL_RE)
_FILTER_URLS = "sed '/Binary/d' | sort | uniq > {urlin}".format(urlin=IN_PATH)
COMMAND = '{find} | {filter}'.format(find=_FIND_URLS, filter=_FILTER_URLS)
os.system(COMMAND)

with open(IN_PATH, 'r') as fr:
    urls = map(lambda l: l.strip('\n'), fr.readlines())
with open(OUT_PATH, 'w') as fw:
    url_id = 1
    max_strlen = -1
    for url_path, url_status in run_workers(get_url_status, urls):
        output = 'Currently checking: id={uid} host={uhost}'.format(
            uid=url_id, uhost=urllib3.util.parse_url(url_path).host)
        if max_strlen < len(output):
            max_strlen = len(output)
        print(output.ljust(max_strlen), end='\r')
        if bad_url(url_status) is True:
            fw.write('{}: {}\n'.format(url_path, url_status))
        url_id += 1

The above highlighted code:

  1. Open OUT_PATH for writing, and NOT appending. We don’t want results from previous runs to be included.
  2. Get the status of each URL.
  3. Print the progress of the total execution (i.e. ID and hostname of the URL being checked).
  4. Check if the URL status is bad or good; if it’s bad, then write it as a new line to OUT_PATH.
  5. Increment the ID by 1 for each time steps 1 through 4 execute.

Let’s see how to implement get HTTP status.


from concurrent import futures
import multiprocessing as mp
import os
import json
import uuid

from bs4 import BeautifulSoup
from markdown import markdown
import requests
import urllib3


# Sources of data (file)
IN_PATH = os.path.join(os.getcwd(), 'urlin.txt')
OUT_PATH = os.path.join(os.getcwd(), 'urlout.txt')
URL_TIMEOUT = 10.0

_URL_RE = 'https?:\/\/[=a-zA-Z0-9\_\/\?\&\.\-]+'  # proto://host+path+params
_FIND_URLS = "find . -type f | xargs grep -hEo '{regex}'".format(regex=_URL_RE)
_FILTER_URLS = "sed '/Binary/d' | sort | uniq > {urlin}".format(urlin=IN_PATH)
COMMAND = '{find} | {filter}'.format(find=_FIND_URLS, filter=_FILTER_URLS)

os.system(COMMAND)

with open(IN_PATH, 'r') as fr:
    urls = map(lambda l: l.strip('\n'), fr.readlines())
with open(OUT_PATH, 'w') as fw:
    url_id = 1
    max_strlen = -1
    for url_path, url_status in run_workers(get_url_status, urls):
        output = 'Currently checking: id={uid} host={uhost}'.format(
            uid=url_id, uhost=urllib3.util.parse_url(url_path).host)
        if max_strlen < len(output):
            max_strlen = len(output)
        print(output.ljust(max_strlen), end='\r')
        if bad_url(url_status) is True:
            fw.write('{}: {}\n'.format(url_path, url_status))
        url_id += 1


def get_url_status(url):
    for local in ('localhost', '127.0.0.1', 'app_server'):
        if url.startswith('http://' + local):
            return (url, 0)
    clean_url = url.strip('?.')
    try:
        response = requests.get(
            clean_url, verify=False, timeout=URL_TIMEOUT,
            headers=URL_HEADERS)
        return (clean_url, response.status_code)
    except requests.exceptions.Timeout:
        return (clean_url, 504)
    except requests.exceptions.ConnectionError:
        return (clean_url, -1)

The URL status checks if the link is “bogus” (i.e. localhost). If it’s legitimate, then the cleaned link is sent a GET request with a timeout of 10 seconds.

At the end of the function, an appropriate status code is attached to the link and both are returned to the caller as a tuple result.

We need to watch out for rate limiting mechanisms, so a header was added to change the request library’s identifier as I executed GET requests:


from concurrent import futures
import multiprocessing as mp
import os
import json
import uuid

from bs4 import BeautifulSoup
from markdown import markdown
import requests
import urllib3


# Sources of data (file)
IN_PATH = os.path.join(os.getcwd(), 'urlin.txt')
OUT_PATH = os.path.join(os.getcwd(), 'urlout.txt')
URL_TIMEOUT = 10.0
_URL_BOT_ID = 'Bot {id}'.format(id=str(uuid.uuid4()))
URL_HEADERS = {'User-Agent': _URL_BOT_ID}

_URL_RE = 'https?:\/\/[=a-zA-Z0-9\_\/\?\&\.\-]+'  # proto://host+path+params
_FIND_URLS = "find . -type f | xargs grep -hEo '{regex}'".format(regex=_URL_RE)
_FILTER_URLS = "sed '/Binary/d' | sort | uniq > {urlin}".format(urlin=IN_PATH)
COMMAND = '{find} | {filter}'.format(find=_FIND_URLS, filter=_FILTER_URLS)

os.system(COMMAND)

with open(IN_PATH, 'r') as fr:
    urls = map(lambda l: l.strip('\n'), fr.readlines())
with open(OUT_PATH, 'w') as fw:
    url_id = 1
    max_strlen = -1
    for url_path, url_status in run_workers(get_url_status, urls):
        output = 'Currently checking: id={uid} host={uhost}'.format(
            uid=url_id, uhost=urllib3.util.parse_url(url_path).host)
        if max_strlen < len(output):
            max_strlen = len(output)
        print(output.ljust(max_strlen), end='\r')
        if bad_url(url_status) is True:
            fw.write('{}: {}\n'.format(url_path, url_status))
        url_id += 1


def get_url_status(url):
    for local in ('localhost', '127.0.0.1', 'app_server'):
        if url.startswith('http://' + local):
            return (url, 0)
    clean_url = url.strip('?.')
    try:
        response = requests.get(
            clean_url, verify=False, timeout=URL_TIMEOUT,
            headers=URL_HEADERS)
        return (clean_url, response.status_code)
    except requests.exceptions.Timeout:
        return (clean_url, 504)
    except requests.exceptions.ConnectionError:
        return (clean_url, -1)


def bad_url(url_status):
    if url_status == -1:
        return True
    elif url_status == 401 or url_status == 403:
        return False
    elif url_status == 503:
        return False
    elif url_status >= 400:
        return True
    return False

This meant that I was effectively a “different” user each time I ran the program, making it less likely to hit the periodic limit for throttling mechanisms to kick in on websites like Reddit.

We’ll use Python’s threading via run_workers to bypass some of the I/O bottlenecks that are inherent with the GIL (Global Lock Interpreter) to improve performance:


from concurrent import futures
import multiprocessing as mp
import os
import json
import uuid

from bs4 import BeautifulSoup
from markdown import markdown
import requests
import urllib3


# Sources of data (file)
IN_PATH = os.path.join(os.getcwd(), 'urlin.txt')
OUT_PATH = os.path.join(os.getcwd(), 'urlout.txt')
URL_TIMEOUT = 10.0
_URL_BOT_ID = 'Bot {id}'.format(id=str(uuid.uuid4()))
URL_HEADERS = {'User-Agent': _URL_BOT_ID}

_URL_RE = 'https?:\/\/[=a-zA-Z0-9\_\/\?\&\.\-]+'  # proto://host+path+params
_FIND_URLS = "find . -type f | xargs grep -hEo '{regex}'".format(regex=_URL_RE)
_FILTER_URLS = "sed '/Binary/d' | sort | uniq > {urlin}".format(urlin=IN_PATH)
COMMAND = '{find} | {filter}'.format(find=_FIND_URLS, filter=_FILTER_URLS)

os.system(COMMAND)

with open(IN_PATH, 'r') as fr:
    urls = map(lambda l: l.strip('\n'), fr.readlines())
with open(OUT_PATH, 'w') as fw:
    url_id = 1
    max_strlen = -1
    for url_path, url_status in run_workers(get_url_status, urls):
        output = 'Currently checking: id={uid} host={uhost}'.format(
            uid=url_id, uhost=urllib3.util.parse_url(url_path).host)
        if max_strlen < len(output):
            max_strlen = len(output)
        print(output.ljust(max_strlen), end='\r')
        if bad_url(url_status) is True:
            fw.write('{}: {}\n'.format(url_path, url_status))
        url_id += 1


def get_url_status(url):
    for local in ('localhost', '127.0.0.1', 'app_server'):
        if url.startswith('http://' + local):
            return (url, 0)
    clean_url = url.strip('?.')
    try:
        response = requests.get(
            clean_url, verify=False, timeout=URL_TIMEOUT,
            headers=URL_HEADERS)
        return (clean_url, response.status_code)
    except requests.exceptions.Timeout:
        return (clean_url, 504)
    except requests.exceptions.ConnectionError:
        return (clean_url, -1)


def bad_url(url_status):
    if url_status == -1:
        return True
    elif url_status == 401 or url_status == 403:
        return False
    elif url_status == 503:
        return False
    elif url_status >= 400:
        return True
    return False


def run_workers(work, data, worker_threads=mp.cpu_count()*4):
    with futures.ThreadPoolExecutor(max_workers=worker_threads) as executor:
        future_to_result = {
            executor.submit(work, arg): arg for arg in data}
        for future in futures.as_completed(future_to_result):
            yield future.result()

The number of threads that run_workers uses defaults to four times the number of CPUs the machine running the script has. However, this can be configured to whatever value makes the most sense.

Time to test our script! Assuming you named the script as check_urls_twilio.py, you can run the script by executing python check_urls_twilio.py. Verify that the script creates an output file called urlout.txt with a list of bad links.

First Approach Pros and Cons

This was a simple solution to push as a pull request. It handled 80% of the use cases out there. And helped remove a lot of dead links. However, the script was not accurate enough for a couple nasty URLs. And it even identified false positives! See PR #159 for more context. I soon realized that URLs have so much variation that detecting one in arbitrary text is not trivial without using a parser.

Improving our script

We can definitely improve this script, so it’s time to use a parser! Luckily, we are just checking Markdown and HTML, so that limits our parsing needs considerably. Instead of running the OS system call to extract the links, Markdown and bs4 can extract the required URLs. Remove the OS system call and the variables associated with that call so our script has the exact following code:

from concurrent import futures
import multiprocessing as mp
import os
import json
import uuid

from bs4 import BeautifulSoup
from markdown import markdown
import requests
import urllib3


# Sources of data (file)
IN_PATH = os.path.join(os.getcwd(), 'urlin.txt')
OUT_PATH = os.path.join(os.getcwd(), 'urlout.txt')
URL_TIMEOUT = 10.0
_URL_BOT_ID = 'Bot {id}'.format(id=str(uuid.uuid4()))
URL_HEADERS = {'User-Agent': _URL_BOT_ID}


with open(IN_PATH, 'r') as fr:
    urls = map(lambda l: l.strip('\n'), fr.readlines())
with open(OUT_PATH, 'w') as fw:
    url_id = 1
    max_strlen = -1
    for url_path, url_status in run_workers(get_url_status, urls):
        output = 'Currently checking: id={uid} host={uhost}'.format(
            uid=url_id, uhost=urllib3.util.parse_url(url_path).host)
        if max_strlen < len(output):
            max_strlen = len(output)
        print(output.ljust(max_strlen), end='\r')
        if bad_url(url_status) is True:
            fw.write('{}: {}\n'.format(url_path, url_status))
        url_id += 1


def get_url_status(url):
    for local in ('localhost', '127.0.0.1', 'app_server'):
        if url.startswith('http://' + local):
            return (url, 0)
    clean_url = url.strip('?.')
    try:
        response = requests.get(
            clean_url, verify=False, timeout=URL_TIMEOUT,
            headers=URL_HEADERS)
        return (clean_url, response.status_code)
    except requests.exceptions.Timeout:
        return (clean_url, 504)
    except requests.exceptions.ConnectionError:
        return (clean_url, -1)


def bad_url(url_status):
    if url_status == -1:
        return True
    elif url_status == 401 or url_status == 403:
        return False
    elif url_status == 503:
        return False
    elif url_status >= 400:
        return True
    return False


def run_workers(work, data, worker_threads=mp.cpu_count()*4):
    with futures.ThreadPoolExecutor(max_workers=worker_threads) as executor:
        future_to_result = {
            executor.submit(work, arg): arg for arg in data}
        for future in futures.as_completed(future_to_result):
            yield future.result()

The logic for processing the URLs one-by-one still remained the same. So what we will focus on primarily is the link extraction logic.

Here’s how the code should work:

  1. Find all files recursively
  2. Detect whether they have Markdown or HTML extensions
  3. Extract URLs and add them to a set of unique URLs found so far
  4. Repeat steps 2 through 3 for all files identified in step 1

Let’s implement the extract_urls function. Add the following code to the end of your check_urls_twilio.py file:

def extract_urls(discover_path):
    exclude = ['.git', '.vscode']
    all_urls = set()
    max_strlen = -1
    for root, dirs, files in os.walk(discover_path, topdown=True):
        dirs[:] = [d for d in dirs if d not in exclude]
        for file in files:
            output = f'Currently checking: file={file}'
            file_path = os.path.join(root, file)
            if max_strlen < len(output):
                max_strlen = len(output)
            print(output.ljust(max_strlen), end='\r')
            if file_path.endswith('.html'):
                content = open(file_path)
                extract_urls_from_html(content, all_urls)
            elif file_path.endswith('.markdown'):
                content = markdown(open(file_path).read())
                extract_urls_from_html(content, all_urls)
    return all_urls

The code above does as follows:

  1. Walk through current directory from top-down (recursive).
  2. For each visited directory, obtain a list of files.
  3. Determine if the file ends with a favorable extension.
  4. If a file with .markdown extension was found, then convert to HTML and extract URLs from content.
  5. If a file with .html extension was found, then extract URLs from content (no conversion).

Let’s also look at how content extraction can be implemented. Place the following code after the extract_urls function code you just wrote.

def extract_urls_from_html(content, all_urls):
    soup = BeautifulSoup(content, 'html.parser')
    for a in soup.find_all('a', href=True):
        url = a['href']
        if url.startswith('http'):
            all_urls.add(url)

We use BeautifulSoup to find all anchor DOM elements with a link reference (href=True). If there are no link references, the for loop would terminate early. Otherwise, each loop execution will check if the href starts with http(which means it’s a website that’s active). If it starts with http, then we would add it to the available URLs. Since all_urls is a Python set, adding duplicate URLs will be resolved automatically.

Replace the check links code with the improved highlighted lines so that our full, finished script looks like this:


from concurrent import futures
import multiprocessing as mp
import os
import json
import uuid

from bs4 import BeautifulSoup
from markdown import markdown
import requests
import urllib3


# Sources of data (file)
IN_PATH = os.path.join(os.getcwd(), 'urlin.txt')
OUT_PATH = os.path.join(os.getcwd(), 'urlout.txt')
URL_TIMEOUT = 10.0
_URL_BOT_ID = 'Bot {id}'.format(id=str(uuid.uuid4()))
URL_HEADERS = {'User-Agent': _URL_BOT_ID}


# remove or comment out the following lines that are commented here from your script
#with open(IN_PATH, 'r') as fr:
#    urls = map(lambda l: l.strip('\n'), fr.readlines())
#with open(OUT_PATH, 'w') as fw:
#    url_id = 1
#    max_strlen = -1
#    for url_path, url_status in run_workers(get_url_status, urls):
#        output = 'Currently checking: id={uid} host={uhost}'.format(
#            uid=url_id, uhost=urllib3.util.parse_url(url_path).host)
#        if max_strlen < len(output):
#            max_strlen = len(output)
#        print(output.ljust(max_strlen), end='\r')
#        if bad_url(url_status) is True:
#            fw.write('{}: {}\n'.format(url_path, url_status))
#        url_id += 1


def get_url_status(url):
    for local in ('localhost', '127.0.0.1', 'app_server'):
        if url.startswith('http://' + local):
            return (url, 0)
    clean_url = url.strip('?.')
    try:
        response = requests.get(
            clean_url, verify=False, timeout=URL_TIMEOUT,
            headers=URL_HEADERS)
        return (clean_url, response.status_code)
    except requests.exceptions.Timeout:
        return (clean_url, 504)
    except requests.exceptions.ConnectionError:
        return (clean_url, -1)


def bad_url(url_status):
    if url_status == -1:
        return True
    elif url_status == 401 or url_status == 403:
        return False
    elif url_status == 503:
        return False
    elif url_status >= 400:
        return True
    return False


def run_workers(work, data, worker_threads=mp.cpu_count()*4):
    with futures.ThreadPoolExecutor(max_workers=worker_threads) as executor:
        future_to_result = {
            executor.submit(work, arg): arg for arg in data}
        for future in futures.as_completed(future_to_result):
            yield future.result()


def extract_urls(discover_path):
    exclude = ['.git', '.vscode']
    all_urls = set()
    max_strlen = -1
    for root, dirs, files in os.walk(discover_path, topdown=True):
        dirs[:] = [d for d in dirs if d not in exclude]
        for file in files:
            output = f'Currently checking: file={file}'
            file_path = os.path.join(root, file)
            if max_strlen < len(output):
                max_strlen = len(output)
            print(output.ljust(max_strlen), end='\r')
            if file_path.endswith('.html'):
                content = open(file_path)
                extract_urls_from_html(content, all_urls)
            elif file_path.endswith('.markdown'):
                content = markdown(open(file_path).read())
                extract_urls_from_html(content, all_urls)
    return all_urls


def extract_urls_from_html(content, all_urls):
    soup = BeautifulSoup(content, 'html.parser')
    for a in soup.find_all('a', href=True):
        url = a['href']
        if url.startswith('http'):
            all_urls.add(url)


all_urls = extract_urls(os.getcwd())
bad_urls = {}
url_id = 1
max_strlen = -1
for url_path, url_status in run_workers(get_url_status, all_urls):
    output = f'Currently checking: id={url_id} host={urllib3.util.parse_url(url_path).host}'
    if max_strlen < len(output):
        max_strlen = len(output)
    print(output.ljust(max_strlen), end='\r')
    if bad_url(url_status) is True:
        bad_urls[url_path] = url_status
    url_id += 1
print(f'\nBad urls: {json.dumps(bad_urls, indent=4)}')

The rationale behind this change was to avoid using urlin.txt and urlout.txt in the first place. Now the script can be validated simply by running python check_urls_twilio.py.

Approach 2 Pros and Cons

This was even easier as I did not have to worry about the structure of the URL. I could just reference the href attribute of an anchor tag and check if the link started with http (which encompassed all the urls I wanted to check).

What could make this solution better is the following:

  • Add argparse parameters to customize timeout for GET requests.
  • Identify which files have a certain URL.
  • Apply this URL checking solution for file-scraping and web-scraping use cases.
  • Open-source it so that others don’t have to reinvent the wheel.

Conclusion

In summary, we covered how to check links for a website with over 2000 links. Creating a pull request for bad links concerning this repository is much easier than before we had the script. I am now able to create them in just a few minutes.

Running the link rot checker is much better than creating one after spending 6.7 hours checking all of the links by hand.

To learn more about link rot and how to defend against it, check out these links:

Thanks for reading through this article. Happy link-checking everyone!