How to Build a PDF Analyzing Bot with Haystack AI, Python, and WhatsApp

April 02, 2024
Written by
Similoluwa Adegoke
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by
Diane Phan
Twilion

PDF files are a way to share voluminous textual information with anyone, but at times the knowledge you seek is at a point in the PDF which you cannot keep searching for. In cases like this the ability to chat with the PDF comes in handy.

In this post, you will learn how to build a PDF question-answering chatbot in python using Twilio Programmable Message API for WhatsApp and the Haystack Large Language Model (LLM) .

Prerequisites

To follow along the tutorial you need to set the following:

Building the chatbot functions

Set up the Development Environment

To set up a development environment, open the command line and run the following commands:

mkdir askpdf_whatsapp
cd askpdf_whatsapp
python3 -m venv venv 
. venv/bin/activate

These commands create a new directory named askpdf_whatsapp and navigate into it. Then a Python virtual environment name venv is created and activated.

Next, create a file named requirements.txt that would contain the list of the required packages and add the following code to it.

fastapi
uvicorn
python-multipart
pymupdf
farm-haystack[inference]
twilio
python-decouple

Here is a breakdown of all the packages in the file:

  • twilio : A Twilio Python Helper Library. Used to access Twilio API functionalities.
  • fastapi : A package for the FastAPI, a modern, fast (high-performance) web framework for building APIs with Python 3.7+.
  • uvicorn : An ASGI web server implementation for Python. Used together with the FastAPI package.
  • python-multipart : A streaming multipart parser for Python. The FastAPI package uses it to accept multipart form data like files.
  • pymupdf : A Python library for the manipulation of PDF. It is used in this tutorial to read pdf files.
  • Farm-haystack[inference]: An open-source Large Language Model framework in Python. The tutorial uses version 1.24
  • python-decouple : A library used to organize settings and separate settings from code. It can be used to read values from .ini and .env files which is how it is used in this tutorial.

Run the following command in the terminal to install the packages.

pip install -r requirements.txt

Read PDF to Text

Create a file named utils.py and add the following code that extracts the text from the PDF.

import fitz
import shutil
import os
import tempfile
import requests

def extract_text(pdf_url, temp_file_path):
    response = requests.get(pdf_url)
    with tempfile.NamedTemporaryFile(suffix='.pdf') as temp_file:
        temp_file.write(response.content)
        pdf_filename = temp_file.name
        doc = fitz.open(pdf_filename)
        
        if os.path.exists(temp_file_path):
            shutil.rmtree(temp_file_path)

        os.mkdir(temp_file_path)

        for index, page in enumerate(doc):
            filepath = temp_file_path+"/"+str(index+1)+"_output.txt"
            out = open(filepath, "wb")
            text = page.get_text().encode("utf8")
            out.write(text)
        out.close()

The extract_text function takes the URL of the PDF and a path of the output folder as the parameters.

The breakdown of the function:

  • The requests library is used to make an HTTP call to the PDF URL and the content is written to a temporary file, temp_file.write(),
  • The fitz.open(pdf_filename) opens the PDF file and returns a content that can be enumerated over [enumerate(doc)]. The text is then extracted page-by-page using page.get_text() into the output folder with a naming of style 0_ouptut.txt.

Next create the main.py file to expose an API endpoint and use the extract_text() function.

from fastapi import FastAPI, Request
from utils import extract_text

app = FastAPI()

@app.post("/receive")
async def receive(request: Request):
    doc_file_path = "doc_files"
    form_data = await request.form()
    if "MediaUrl0" in form_data:
        pdf_url = form_data["MediaUrl0"]
        if len(pdf_url) > 0:
            extract_text(pdf_url, doc_file_path)
        return “”
    else:
      return {"msg": "Kindly upload a pdf file"}

Here is the breakdown of the code parts:

  • The code file imports FastAPI and Request from the FastAPI package and extract_text function from the utils.py file.
  • An endpoint /receive accepts the incoming request and uses the request.form() to retrieve the request body that is sent. Media data received from the Twilio WhatsApp API is named in style MediaUrl{n}. And since you are expecting one media, you can pick the first one with MediaUrl0. Check out more information about the data received from the Twilio WhatsApp API here .
  • If the key MediaUrl0 is available, it is passed into the extract_text function and the output file path doc_file_path is passed into it. This will create a local directory with the name docs_files.

Build the function to answer questions

To build the Question and Answer function, you will make use of a predefined pipeline provided by the Haystack LLM, the ExtractiveQAPipeline , whose task is to find the answers to a question by selecting a segment of text.

First step is to build the pipeline. Add the following code to the utils.py file:

from haystack.telemetry import tutorial_running
from haystack.document_stores import InMemoryDocumentStore
from haystack.utils import fetch_archive_from_http
from haystack.pipelines.standard_pipelines import TextIndexingPipeline
from haystack.nodes import BM25Retriever, FARMReader
from haystack.pipelines import ExtractiveQAPipeline
import pickle 

def build_pipeline(file_path, model_file):
    document_store = InMemoryDocumentStore(use_bm25=True)
    files_to_Index = [file_path+'/'+ f  for f in os.listdir(file_path)]
    indexing_pipeline = TextIndexingPipeline(document_store)
    indexing_pipeline.run_batch(file_paths=files_to_Index)

    retriever = BM25Retriever(document_store=document_store)
    reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

    pipeline = ExtractiveQAPipeline(reader, retriever)
    file = open(model_file, 'wb')
    pickle.dump(pipeline, file)

The build_pipeline function takes the file_path which is the directory that contains the text files of each page of the PDF file in the format {index}_output.txt (0_output.txt, 1_output.txt). The function also accepts the name of the file to save the model to.

Here is a breakdown of the code parts:

  • The TextIndexingPipeline is created using an in-memory storage InMemoryDocumentStore where all the text files are passed into.
  • The BM25Retriever retrieves the data from the document store. The retriever combs the DocumentStore and returns only the documents it thinks are relevant to the query.
  • Next, the reader is set up using a premade model deepset/roberta-base-squad2. The reader is responsible for accepting the documents that the retriever returns and selecting a text span that provides an answer to the query.
  • Finally, the pipeline is then built and saved to a model_file using pickle. The pipeline accepts the reader and retriever and uses them for each question.

Dumping the model into a pickle file allows its usage anytime it is needed without building the pipeline again.

To make use of the pipeline to answer the question, create a function as shown below in the utils.py file.

def predict_answer(model_file, question):
    answer_dict = {}

    if os.path.exists(model_file) is False:
        answer_dict['answer'] = 'No pdf was uploaded' 
        return answer_dict

    file = open(model_file, 'rb')
    pipeline = pickle.load(file)

    predictions = pipeline.run(
        query = question, 
        params={"Retriever": {"top_k": 1}, "Reader":{"top_k":1}}
    )

    answer_lists = []
    
    if not "answers" in predictions.keys():
        answer_dict['answer'] = "Sorry, No answer to that question"
        return answer_dict
    
    if "query" in predictions.keys():
        answer_lists = [predictions["answers"]]
    elif "queries" in predictions.keys():
        answer_lists = predictions["answers"]
 
    answer_dict['answer'] = answer_lists[0][0].answer
    answer_dict['context'] = answer_lists[0][0].context
    return answer_dict
  • The function takes in the name of the pipeline model_file and the question, and tries to check if the file exists. If it does not, that means a PDF has not been uploaded.
  • Next, the pipeline is loaded using the pickle.load() method and the pipeline.run method from the Haystack framework is used to accept the question and set some parameters for the Retriever and the Reader. Check out more on the parameters here .
  • The predictions variable that is returned contains a list of other values that do not need to be shown as a result; the next code lines filter the answer from the list and return a dictionary (answer_dict) with key-value pair of answer and context.

Finally, use the two functions in the API endpoint in the main.py file.

Update the main.py file with the following code:

from fastapi import FastAPI, Request
from utils import extract_text, build_pipeline, predict_answer

app = FastAPI()

@app.post("/receive")
async def receive(request: Request):
    doc_file_path = "doc_files"
    model_file = "model"
    form_data = await request.form()
    body = form_data["Body"]

    if "MediaUrl0" in form_data:
        pdf_url = form_data["MediaUrl0"]

        if len(pdf_url) > 0:
            extract_text(pdf_url, doc_file_path)
            build_pipeline(doc_file_path, model_file)
            return {"msg": "Kindly ask your questions"}
  
    elif len(body) > 0:
        answer = predict_answer(model_file, body)
        return answer

    else:
        return {"msg": "Kindly upload a pdf file"}

The /receive endpoint now does the following:

  • Retrieves the form data using request.form() and extracts fields with the keys of MediaUrl0 as a file and Body as a body (a naming synonymous to Twilio WhatsApp API Message structure). 

  • If the MediaUrl0 key exists and the corresponding value, pdf_url is not empty, the functions extract_text and build_pipeline are called. Then a message is returned to ask the questions.

  • If it is the body that has some value, the predict_answer function is called and the answer is returned to you.

Integrate the Twilio Programmable Message API  for WhatsApp

If you are on a free trial, you have to join the WhatApp Sandbox. It is an environment set up for prototyping. 

Do not use the WhatsApp Sandbox for production. For production, you can either get a Twilio phone number approved to be used with WhatsApp or you can get your WhatsApp number to be approved for use on Twilio.

 

To set up the WhatsApp Sandbox, login to the Twilio Console. Navigate to Explore Products > Messaging > Try it out > Try WhatsApp.

 

Scan the QR code or send the "join" message shown on the screen to your Twilio WhatsApp Sandbox number. The sandbox expires after three (3) days after which you have to join the sandbox again by scanning the QR code or send the message again.

Twilio Console for WhatsApp Setup

 

 

Next click on the Accounts > API Keys & Tokens section from the dashboard header.

 

Accounts Drop down panel on the Twilio dashboard.

 

Copy the credentials ACCOUNT_SID and AUTH_TOKEN.

Create a .env file to contain the credentials and the you WhatsApp Sandbox number.

Update the .env file as follows:

TWILIO_ACCOUNT_SID=[ACCOUNT SID]
TWILIO_AUTH_TOKEN=[AUTH TOKEN]
TWILIO_NUMBER=[YOUR WHATSAPP SANDBOX NUMBER]

Do not store your credentials inside the code files. Always manage with environment files or secret files.

Add code functionality to send WhatsApp Message

Add the function to send a WhatsApp Message to the utils.py:

from twilio.rest import Client
from decouple import config

account_sid = config("TWILIO_ACCOUNT_SID")
auth_token = config("TWILIO_AUTH_TOKEN")
client = Client(account_sid, auth_token)
twilio_number = config('TWILIO_NUMBER')

def send_message(to_number, body_text):
    message = client.messages.create(
            from_=f"whatsapp:{twilio_number}",
            body=body_text,
            to=f"whatsapp:{to_number}"
            )

A Client object is built from the credentials that were saved in the .env file and used to call the messages.create method that accepts the recipient phonenumber to and the message to be sent body_text.

Then, update the main.py file to send a WhatsApp message after every event.

from fastapi import FastAPI, Request
from utils import extract_text, build_pipeline, predict_answer, send_message
from twilio.twiml.messaging_response import MessagingResponse

app = FastAPI()

@app.post("/receive")
async def receive(request: Request):
    resp = MessagingResponse()
    doc_file_path = "doc_files"
    model_file = "model"
    form_data = await request.form()
    body = form_data["Body"]
    whatsappNumber = form_data['From'].split("whatsapp:")[-1]

    if "MediaUrl0" in form_data:
        pdf_url = form_data["MediaUrl0"]

        if len(pdf_url) > 0:
            extract_text(pdf_url, doc_file_path)
            build_pipeline(doc_file_path, model_file)
            send_message(whatsappNumber, 'Kindly ask your questions')
            return str(resp)
    
    elif len(body) > 0:
        answer = predict_answer(model_file, body)
        resp.message(answer["answer"])
        send_message(whatsappNumber, answer["answer"])
        return resp

    else:
        resp.message('Kindly upload a pdf file')
        send_message(whatsappNumber, 'Kindly upload a pdf file')
        return str(resp)
  • For the /receive endpoint to follow the Twilio webhook convention, it will now return a response of type TwiML (Twilio Markup Language) which is achieved using the resp = MessagingResponse().
  • The sender's WhatsApp number is extracted from the request using the From key and split to include only the number without any other text.
  • The responses are then passed into the send_message function alongside the sender’s WhatsApp number whatsappNumber.

Test the application

Run the following commands to activate the virtual environment and start the app:

. venv/bin/activate
uvicorn main:app   

To receive messages that are sent to the Twilio WhatsApp Number, you will expose the local endpoints over the internet using the ngrok tool and update the webhook on the Twilio dashboard.

Create an account on ngrok and select your operating system type then run the installation command.

This is the command for a macOS:

brew install ngrok/ngrok/ngrok

After installation, run the ngrok command to add the authtoken to the ngrok config.

ngrok config add-authtoken <NGROK_AUTH_TOKEN>
The ngrok dashboard showing the commands for setting up on a macOS

Open another terminal to start ngrok with this command:

 ngrok http 8000 

The ngrok terminal returns a URL that can be used to set up a webhook on Twilio. e.g. https://be59-102-89-42-55.ngrok-free.app. But the full URL for the endpoint is https://be59-102-89-42-55.ngrok-free.app/receive

To set up the webhook, go to the Twilio console Explore Products > Messaging > Try it out > Try WhatsApp > Sandbox Settings. Insert the full URL endpoint https://be59-102-89-42-55.ngrok-free.app/receive into the field with the label When a message comes in option and select the POST method.

This setup sends every message that is sent to the Twilio WhatsApp number to the `/receive` endpoint.

Finally, open the chat you set up when you joined the Twilio Sandbox, upload a PDF file and ask your questions about the PDF file.

What's next for LLM chatbots?

In this tutorial, you have gone through a step-by-step procedure to build a WhatsApp chatbot that can extract and provide answers to questions using knowledge from a PDF. The process covered setting up the development environment, extracting text from the PDF, using a ready made pipeline for question-answering, and integrating it with the Twilio Messaging API for WhatsApp.

Check out this guide on how to use AI with Twilio, or how to interact with the chatbot through audio .

You can also explore how to build a question answering bot with LangChain instead of Haystack.

Similoluwa is a Software Engineer that enjoys tinkering with technologies. When he is not coding he spends his time reading.