How to Analyze Text Documents Using Langchain and SendGrid

March 19, 2024
Written by
Brian Wachanga
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by
Diane Phan
Twilion

In today's digital age, the need for efficient document analysis is crucial. This tutorial aims to guide you through building a solution that leverages Flask , LangChain , Cohere AI , and Twilio SendGrid to analyse documents seamlessly. We will create a Flask application that receives documents, processes them using LangChain for summarization and Cohere AI for analysis, and communicates the results via SendGrid email. The combination of these technologies allows for a robust system to analyse documents and share results via email.

For this use case, you will use PDF documents but the process can be applied to other documents such as csv and txt files.

Prerequisites

To follow along, ensure you have the following:

Set up your environment

In this section, you will set up your development environment to achieve success in implementing this tutorial. Begin by creating a virtual environment and installing the required dependencies. Open your command line and navigate to your project folder. Run the following commands:

python -m venv venv
source venv/bin/activate  # On Windows, use venv\Scripts\activate
pip install python-dotenv cohere langchain flask chromadb pypdf werkzeug flask_mail langchain-community

After that, log in to your SendGrid account, click on Settings in the left sidebar then select API Keys. Click Create API Key, give it a name and select your preferred access level. Click Create & View, and copy your API Key for later use.

Create SendGrid API Key

Next, log in to your Cohere AI account. Cohere AI was chosen for this tutorial because it offers a free trial for developers and it is supported by LangChain, a framework you’ll use later on. Navigate to API Keys > New Trial Key to create your API key as shown in the screenshot below. Make sure you give your key a name such as "flask_langchain_mail" and click Generate Key. Copy the new API key for later use.

Create Cohere AI API Key

Create a . env file for your environment variables. You will add the Cohere AI and SendGrid API Key and a default mail sender as shown below.

SENDGRID_API_KEY="<your-sendgrid-api-key>"
MAIL_DEFAULT_SENDER="<your-sender-email-address>"
COHERE_API_KEY="<your-cohere-api-key>"

Create a new app.py file in the root of your project folder and import all dependencies as shown below:

import os
from flask import Flask, request
from werkzeug.utils import secure_filename
from langchain_community.embeddings import CohereEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Cohere
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.chains.question_answering import load_qa_chain
from flask_mail import Mail, Message

from dotenv import load_dotenv
load_dotenv()

Finally, create a Flask app and add the configurations for SendGrid as shown below:

app = Flask(__name__)
app.config['MAIL_SERVER'] = 'smtp.sendgrid.net'
app.config['MAIL_PORT'] = 587
app.config['MAIL_USE_TLS'] = True
app.config['MAIL_USERNAME'] = "apikey"
app.config['MAIL_PASSWORD'] = os.environ.get('SENDGRID_API_KEY')
app.config['MAIL_DEFAULT_SENDER'] = os.environ.get('MAIL_DEFAULT_SENDER')
mail = Mail(app)

The code above will use the . env variables to set up the Flask mail to smtp.sendgrid.net. The mail port is configured as 587 and TLS (Transport Layer Security) is enabled for secure communication.

The username is set as "apikey" and the SendGrid API key is set to serve as the password for authentication.

Write the Flask application

In this section, you will focus on development of the application features:

  • to receive and process PDF documents.
  • create embeddings and index them in a vector database.
  • implement a question-answer mechanism to query the documents.

In your project folder, create a new folder called uploads that will host your uploaded PDF documents. Open your app.py file and add the following codeblock

UPLOAD_FOLDER = 'uploads'
ALLOWED_EXTENSIONS = {'pdf'}
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER

This sets the upload folder in your Flask application and defines the allowed document extensions. This will be used to avoid Cross-Site Scripting (XSS) attacks. Add the following code block in the app.py file

llm=Cohere(model="command", temperature=0.1)
db = None

This initialises Cohere’s Command model , a text generation large language model (LLM) that you will use to generate responses. The temperature parameter is used to set the degree of randomness in the answers provided by the LLM. Temperature values range between 0 and 1. Lower values makes the LLM’s output to be more predictable while higher values allow LLMs to output creative answers.

In app.py, add the following function to extract documents from a PDF file:

def extract_pdf(file_path):
    loader = PyPDFLoader(file_path)
    pages = loader.load_and_split()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    docs = text_splitter.split_documents(pages)
    return docs

The code snippet above uses the PyPDF library to load the contents of a PDF document. You are using LangChain’s implementation of the PDF loader. LangChain is a framework that helps developers build powerful AI applications that leverage LLMs. It implements a standardised API allowing developers to use various models and python libraries in a simpler manner. Once the text is loaded from the PDF, you use LangChain’s RecursiveCharacterTextSplitter to split the text into segments that make further processing easier.

Add the following function in app.py for creating embeddings using Cohere:

def embeddings(docs):
    embeddings = CohereEmbeddings(model = "multilingual-22-12")
    global db
    db = Chroma.from_documents(docs, embeddings)

The function above initializes Cohere’s Embeddings endpoint to generate vector representations of text, used to create powerful analytical applications. These embeddings are numerical values that represent the text in a way that machine learning models can understand.

The function also uses Chroma , an open-source vector database, to store the embedings created using the documents from the previous process. Chroma offers an in-memory database that stores the embeddings for later use. You can also persist the data on your local storage as shown in the official documentation . For this tutorial, you are using LangChain’s implementation of Chroma.

Create a new function in app.py for accepting a user query and responding with an answer using the information acquired from PDF files:

def chat_pdf(query):
    matching_docs = db.similarity_search(query)
    chain = load_qa_chain(llm, chain_type="stuff",verbose=True)
    answer =  chain.run(input_documents=matching_docs, question=query)
    return answer

Using the query, the code snippet will first perform a similarity search using the information in the vector database. A similarity search uses a query to find matching documents that are numerically similar to the query.

Next, you use LangChain to load a retrieval question-answering chain that utilizes Cohere’s command model to generate the answers. Finally, run the chain using the matching documents and the query as input and Cohere will generate the required answer using this information.

Create two functions in app.py as shown below:

def allowed_file(filename):
    return '.' in filename and \
           filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

@app.route('/upload', methods=['POST'])
def upload():
    if 'file' not in request.files:
        return "Upload a file!"
    file = request.files['file']
    if file and allowed_file(file.filename):
        filename = secure_filename(file.filename)
        filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
        file.save(filepath)
        docs = extract_pdf(filepath)
        embeddings(docs)
        return "Done"
       
    return "Upload a valid file"

The first function above is for checking the file extension of the uploaded file to avoid XSS attacks. The second function creates an API route that uses the POST method. This route is responsible for uploading the file, extracting text from it and creating the required embeddings using the functions created before.

Create another route for receiving the user query and sending results as an email as shown below:

@app.route('/query', methods=['POST'])
def emails():
    query = request.json['query']
    email = request.json['email']
    answer = chat_pdf(query)
    msg = Message('PDF Chat', recipients=[email])
    msg.body = answer
    mail.send(msg)
    return "Email Sent"

This route receives a query and an email address as the input. The query is passed to LangChain’s QA Retrieval using Cohere’s AI to extract answers from the PDF file. This answer is sent as an email to the email address provided.

Add the following code block at the end of app.py that will be responsible for running your Flask app in debug mode.

if __name__ == '__main__':
    app.run(debug=True)

 

Test the document analyzing Flask application 

You can access the code for this tutorial in this GitHub repository . To test your application, you need to open your terminal and run the following command at the root of your project folder. This will start your Flask application and run it at http://127.0.0.1:5000/

python app.py

Open the Postman application and create a new POST request. Add the URL to be http://127.0.0.1:5000/upload . Select Body and choose the form-dataoption. Under key, enter file and switch from Text to File. Under value, choose the PDF file you want to use. For this demonstration, I will use Twilio’s Style Guide . Click Send and you should receive a 200 response as shown below.

Upload and process PDF documents using LangChain and Cohere

Create a new POST request in Postman and use http://127.0.0.1:5000/query as the address. Select Body and choose the raw option. Choose the JSON option and add the payload as shown in the screenshot below. Make sure you add your email.

Query your PDF and get response on email using SendGrid, LangChain and Cohere AI

After hitting the Send button, the request should respond successfully and you will receive a new email with the relevant answer for the query submitted via the API. The answer is generated using the information in the PDF document used earlier on. This concept is called Retrieval Augmented Generation (RAG) and is used to increase the validity of answers generated by LLMs.

For example, the query “What are the main formatting requirements for code blocks?” sends an answer as described below:

Code blocks should be denoted by markdown triple fences, with the programming language or framework being used specified within the fences. If you are demonstrating a change to a code block, you should highlight the lines that have been modified.
For example:
```
hl_lines="3 5"
const myFunc = () => {
// line 3 and 5 of this code block will be highlighted
const msg = "Hello!";
return msg;
}
```

 

What's next for document analysing Flask apps? 

To conclude, we addressed the need for efficient document analysis by creating a comprehensive solution using Flask, LangChain, Cohere AI, and SendGrid. The technologies selected offer specialized capabilities, providing benefits such as text extraction, summarization, analysis, and convenient email communication. This tutorial serves as a starting point for building more sophisticated document analysis applications. 

You can check out how to receive incoming emails using SendGrid and allow your users to query PDFs directly from their mailbox.

With a decade in the computer science industry, Brian has honed his expertise in diverse technologies. He pursues the fusion of machine learning and design, crafting innovative solutions at the intersection of art and technology through his design studio, Klurdy.