Parse HTML for Book quotes with Python, Beautiful Soup, and WhatsApp

Time to read:

August 25, 2018

Written by

Twilion

Parse HTML for Book quotes with Python, Beautiful Soup, and WhatsApp

9bZOx64j80WroqTm2Q4VVrjIh4ie_QW765ykIPC7E5GF5hGTuoU6R0dUcbTnonytVrHzImA35GhpQsTQXPkiom1bHT5B5PKu5kCuqIf26JBeVyWLTM6QD5UbBW7k9X1TQrCOHWlW

My first real paid job was working at the local library in high school. This was perfect because I love reading. With a monthly book stipend company perk and more time to read in my post-grad life, I've started using Goodreads to find new books and to keep track of what I've already read.

This post will go over how to parse the Goodreads quotes page. We'll target popular book quotes and quotes with a specific tag received as input from a WhatsApp message and then send a random quote as an outbound WhatsApp message.

Setting Up Twilio API Sandbox for WhatsApp

At the moment, only approved business accounts can use Twilio's WhatsApp API so we need to use the Twilio API Sandbox for WhatsApp to play around. Let's go on over to the Sandbox in our Twilio console. To activate it we must choose a Sandbox number, agree to the terms of WhatsApp, and select Activate Sandbox.

To join a Sandbox, send “join <your sandbox keyword found in your console>” to your Sandbox number in WhatsApp and you should get a response confirming you’ve joined. You can follow these instructions to install the WhatsApp Sandbox Channel in your account and c onnect your WhatsApp account with the Sandbox.

Setting up your Developer Environment

You will be using the Beautiful Soup Python package to scrape a webpage and extract data from its HTML. Before we can dig into some code, make sure that your Python and Flask development environment is setup. If you haven't done so already,

Install Python 3--this is important! It must be Python 3.
Install Ngrok to make your Flask app visible from the internet so Twilio can send requests to it
Set up your development environment

If you're new to Python and Flask check out this handy guide for more information on getting started.

Now run the following command in the terminal in the directory you'll put your code in.

ngrok http 5000

You should see a screen similar to the one below:

That publicly-accessible ngrok URL to our Flask app needs to be used when a message is sent to our WhatsApp number which we will configure below.

We need the following modules:

Requests to access HTML pages to scrape
Twilio to generate TwiML to respond with a quote to incoming HTTP requests with each message sent to your WhatsApp number
Flask to respond to incoming web requests
String to help clean texts we parsed the HTML page for
bs4 (Beautiful Soup) to read HTML pages
Random to pick a random number to generate random quotes.

In the terminal, run the following command to install those modules:

pip3 install requests twilio flask string bs4 random

If this throws permission errors run

install --user [package]

Reading and Parsing the Web Page

To start off we have to scrape and clean the Goodreads page that is selected based on the inbound message. The message can either be "popular", which is the default for https://goodreads.com/quotes, or it can be anything else, such as a title, series, or author like "Harry Potter" or "Jane Austen".

Create a file called goodreads.py where you'll put all your code. The top of goodreads.py should include the following import statements.

from twilio.twiml.messaging_response import MessagingResponse
from flask import Flask, request
from bs4 import BeautifulSoup 
import requests, string, random

Now we will open the webpage and create a BeautifulSoup object representing the document as a nested data structure. An optional second parameter would be "html.parser" which is the default parser used when using BeautifulSoup. Other parsers you could use include lxml or html5lib which both require an external dependency.

def scrape_and_clean(url):
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")

Next, we want to get quotes from the page. If you visit https://goodreads.com/quotes and right click -> view page source you would see HTML classes are nested like so:

With this nested structure in mind we will find every div with the class "quoteText." We can do this with the shorthand for Beautiful Soup's find_all method: soup("div"), which is the same as writing soup.find_all("div").

div_quotes = soup("div", attrs={"class":"quoteText"}) #soup("div") == soup.find_all("div")

Let's loop through the quotes and find each quote author. As shown in the HTML code, the only a tag in our div_quotes is the author.

If an a tag exists then we know an author exists. We can access all that text inside a tag as a single Unicode string with the get_text() method. If we can't find an author, we skip it because it may not be a good quote.

quotes = ''
for q in div_quotes:
author = ""
# If no author, then skip
try:
                                  author = q.find("a").get_text() + "\n"
except:
continue

Now we need to turn multi-line quotes into a single string. Python 3 uses Unicode by default, so every string is a sequence of Unicode characters. We will loop through a given tag's children by calling .contents on a Beautifulsoup object, and then tell it to encode each child as ASCII while ignoring any foreign Unicode characters. Once those characters have been filtered out they can be converted (or decoded) back into a readable string. A string would not have .contents because it can't contain anything. If a line starts with a tag symbol then characters that are not part of the quote are ignored. Otherwise, that line is added to the quote string to turn multi-line quotes into a single string.

quote = ""
# turn multi-line quotes into a single string
for i in range(len(q.contents)):
#find returns
line =q.contents[i].encode("ascii", errors="ignore").decode("utf-8") #decode converts bytes to unicode string
if (line[0] == "<"): # check if tag
break
else:
quote += line

To clean and format each quote, we will call strip() to remove leading and trailing characters which could mess up what is included in the list of quotes we pull from Goodreads. We also format the quote along with the author, and add on a "#" character to know where to parse the list. Finally, we filter through all the quotes to find and return the printable quotes which may include digits, letters, punctuation, or whitespace.

quote = q.contents[0].encode("ascii", errors="ignore").decode("utf-8")
quote = "\"" + quote.strip() + "\" "
quotes += quote + '-' + author + "#"

quotes_to_return = filter(lambda x: x in string.printable, quotes) #clean
return quotes_to_return

Sending the Goodreads Quotes with Twilio's WhatsApp API

Now that we have the quotes we want from Goodreads, let's send them with Twilio's WhatsApp API.

With the following code we make our Flask app and point it at the /whatsapp route where we will take the incoming message we receive.

app = Flask(__name__)
@app.route('/whatsapp', methods=['POST'])

With the inbound message we now figure out which URL to search through for Goodreads quotes. If the message is "popular", we would search the main /quotes page on Goodreads. Otherwise, we'd search the quotes page for the tag that was sent in.

def send_sms():
    #incoming message
    msg = request.values.get("Body").lower()
    if msg == "popular": #main quotes page on Goodreads
        url = "<a href="https://goodreads.com/quotes">https://goodreads.com/quotes</a>"
    else:
        url = "http://www.goodreads.com/quotes/search?utf8=%E2%9C%93&q=" + msg

We then make a list of quotes from passing that URL to our scrape_and_clean function, splitting at each # character. A different message is sent based on the length of that list of quotes. If there's only one quote for that Goodreads tag, we send that quote. If there are no quotes, we say to send in a different message. Otherwise, we randomly select one of the quotes.

    quotes = "".join(scrape_and_clean(url)).split("#")
    print("all quotes ", quotes)
    if len(quotes) == 1:
        quote = quotes.pop()
    elif len(quotes) == 0:
        quote = "no tags, try another one like \'jane austen\', \'harry potter\', or \'lord of the rings\'."
    else:
        quote = random.choice(quotes) + '\n\n' 
    res = MessagingResponse()
    res.message(quote) 
    return str(res)

Finally we can run our Flask app by adding this code at the bottom.

if __name__ == '__main__':
    app.run(debug=True)

You may have noticed that this code is the same as what you would write to respond to an incoming text message to a Twilio number--the only difference is where the message is coming from.

If you run

goodreads.py

on the command line in the directory your file is saved and text your Twilio WhatsApp Sandbox number a book title or author (can include spaces), you should get a random quote back.

Nice!

We've now parsed an HTML page for book quotes on goodreads.com and returned a randomized quote with Python 3, Beautiful Soup, and Twilio's WhatsApp API. Check out the GitHub repo for completed code and also the official Twilio WhatsApp documentation to keep up-to-date on any new features. If you have any questions, comments, or if you built something neat, feel free to message me online!

Twitter: @lizziepika

GitHub: elizabethsiegle

Email: lsiegle@twilio.com