Build an AI Video Analysis App with FastAPI, OpenAI, and SendGrid
Time to read:
Build an AI Video Analysis App with FastAPI, OpenAI, and SendGrid
AI video analysis has practical applications across industries like healthcare monitoring, security systems, customer service quality assurance, and remote assistance. By combining visual analysis with audio transcription, you can create intelligent systems that understand both what people are saying and what's happening on screen.
In this post, you'll learn how to build a real-time AI video analysis application using FastAPI, OpenAI's GPT-4 Vision and Whisper models, and SendGrid for email notifications. This multi-agent system captures video and audio, analyzes them using AI, and sends detailed reports via email when it detects high-priority situations.
Prerequisites
To follow along with this tutorial, you'll need:
- A free Twilio SendGrid account
- An OpenAI account with API access
- Python 3.8 or higher installed
- Basic familiarity with Python and web development concepts
- A modern web browser with camera and microphone support
Building the application
Project setup
Start by creating your project directory and organizing the folder structure. Open your terminal and run these commands:
Next, create a virtual environment to keep your project dependencies isolated:
Activate the virtual environment. The command differs depending on your operating system:
Create a requirements.txt file with the following dependencies:
Install all the dependencies:
Setting up environment variables
Create a .env file in your project root directory to store your API keys and configuration:
You'll need to obtain API keys from both OpenAI and SendGrid. For OpenAI, visit the OpenAI Platform, sign in, and create a new secret key. Make sure you have access to the GPT-4 Vision and Whisper models, which may require a paid account.
For SendGrid, you'll need to create an account and verify your sender email address. Follow this guide to set up SendGrid and obtain your API key with Mail Send permissions.
Understanding the multi-agent architecture
This application uses a three-agent pipeline where each agent has a specific responsibility. Understanding this architecture before diving into the code will help you see how the pieces fit together.
- Agent 1: Data Capture Agent handles capturing video frames and transcribing audio using OpenAI Whisper. It validates the quality of captured data and prepares it for analysis.
- Agent 2: Analysis Agent uses OpenAI's GPT-4 Vision model to analyze both the visual content and transcribed audio. It assigns a priority score from 1 to 10 and determines the urgency level.
- Agent 3: Report Agent generates HTML reports and sends email notifications through SendGrid when high-priority situations are detected.
These agents work sequentially, with each agent's output becoming the next agent's input. This modular design makes the system easier to maintain and allows you to swap out individual components without affecting the rest of the pipeline.
Building the FastAPI application structure
Create a file named main.py in your project root. Start by importing the necessary libraries and setting up the FastAPI application:
The logging setup helps you track what's happening in your application, which is especially useful when debugging issues with API calls or agent processing.
Next, load your environment variables and initialize the API clients:
The AsyncOpenAI client allows your application to make non-blocking API calls, which keeps your application responsive while waiting for OpenAI's responses.
Defining data models with Pydantic
Pydantic models define the structure of data that flows through your application. These models provide type checking and validation, which helps catch errors early. Add these model definitions after your initialization code:
Each model represents data at different stages of the pipeline. VideoData is what comes in from the frontend, CapturedData includes quality assessments, AnalysisResult contains the AI's findings, and AnalysisReport wraps everything together for the final output.
Implementing Agent 1: Data Capture Agent
The Data Capture Agent is responsible for receiving video frames and audio, then processing them into a format suitable for analysis. Create the DataCaptureAgent class:
The process_audio method takes base64-encoded audio data and uses OpenAI's Whisper model to convert speech to text. Whisper requires an audio file, so the code creates a temporary file, writes the decoded audio data to it, and then cleans up the file after transcription completes.
Add the image validation and main capture methods:
The validate_image method performs basic checks to ensure the image data is valid before processing. The capture_and_process method orchestrates the entire capture workflow and returns structured data ready for analysis.
Implementing Agent 2: Analysis Agent
The Analysis Agent uses GPT-4 Vision to examine both the visual content and transcribed audio. Create the AnalysisAgent class:
The prompt instructs GPT-4 Vision on exactly what to analyze and how to format its response. By requesting JSON output, you can easily parse and validate the AI's response programmatically.
Add the analysis method that calls OpenAI's API:
The analyze_with_openai method sends both the image and transcription to GPT-4 Vision. The detail: "high" parameter ensures the model examines the image carefully. The temperature of 0.2 keeps responses consistent and focused.
Add the priority assessment method:
This method takes the captured data, sends it to OpenAI for analysis, and packages the results into an AnalysisResult object. Default values ensure the application continues working even if the AI's response is missing some fields.
Implementing Agent 3: Report Agent
The Report Agent generates HTML email reports and sends them through SendGrid. Create the ReportAgent class:
The generate_html_report method creates a visually appealing email with color-coded priority badges and organized sections.
Add the email sending method:
SendGrid returns a status code of 202 when it successfully accepts an email for delivery. The method returns True when the email is sent successfully, allowing other parts of the application to track email status.
Complete the Report Agent with the generate_and_send_report method:
Connecting the agents with a workflow pipeline
Now create a workflow function that orchestrates all three agents. Add this code after your agent class definitions:
This workflow function connects all three agents in sequence. Each agent's output feeds into the next agent, creating a clean data pipeline from raw video and audio to a final analysis report.
Creating FastAPI endpoints
Add the HTTP endpoints that handle requests from the frontend:
The /analyze endpoint receives video and audio data from the frontend and runs it through the entire multi-agent pipeline. The /health endpoint lets you verify that all services are configured correctly and can connect to the external APIs.
Adding the application entry point
Complete your main.py file with the code that starts the web server:
This code runs when you execute python main.py directly. It starts the Uvicorn web server on port 8000 and makes your application accessible from your local network.
Creating the frontend interface
Create an index.html file in your project root directory. This file provides the user interface for capturing video and audio. The complete frontend code includes JavaScript for accessing the camera and microphone, capturing frames and audio clips, and sending them to your FastAPI backend for analysis.
Rather than walking through all the HTML, CSS, and JavaScript in detail, you can download the complete frontend code from the GitHub repository. The frontend handles camera access, audio recording, real-time display of analysis results, and visual feedback for the multi-agent pipeline.
Running and testing the application
Start your application by running:
Open your web browser and navigate to http://localhost:8000. When the page loads, click the Allow button when prompted to grant camera and microphone access.
Click Start AI Analysis to begin the real-time video and audio capture. The interface shows which agent is currently processing your data. Speak clearly into your microphone and ensure you have good lighting for the camera.
After a few seconds, you should see analysis results appear on the screen, including the priority score, assessment level, and AI-generated recommendations. If the priority score is 7 or higher, the system will automatically send an email report to the address you configured in your .env file.
Check your email inbox for the HTML report. The subject line includes the priority level and score, making it easy to identify urgent alerts. Open the email to see the detailed visual findings, audio transcription, and recommendations.
You can verify your services are working properly by visiting http://localhost:8000/health in your browser. This endpoint shows the connection status for OpenAI and SendGrid, along with the status of all three agents.
Troubleshooting
If you see a Camera Error message, ensure your browser has permission to access your camera. Check if another application is using your camera and close it. Try refreshing the page and granting permissions again.
If audio transcription shows unavailable, verify that your OpenAI API key is correct and that you have access to the Whisper model. Check your internet connection and ensure your microphone is working properly.
If analysis fails with an API request failed error, check your OpenAI account status and verify you have sufficient credits. Make sure you have access to GPT-4 Vision, which may require a paid account. Review the OpenAI rate limits to ensure you haven't exceeded them.
If email notifications aren't working, verify your SendGrid API key has Mail Send permissions. Ensure your sender email address is verified in SendGrid. Check your spam folder for test emails. Review the logs for specific error messages about email delivery.
Conclusion
You now have a working AI video analysis application that captures video and audio, analyzes it using OpenAI's advanced models, and sends email alerts through SendGrid. This multi-agent architecture provides a solid foundation for building more sophisticated monitoring systems.
You could extend this application by adding a database to store analysis history, implementing user authentication for multi-user access, or integrating with other communication channels beyond email. You might also explore analyzing longer video segments, adding custom alert rules based on specific keywords or visual patterns, or deploying the application to a cloud platform for remote access.
For more information about the OpenAI Vision API, check out the OpenAI Vision documentation. To learn more about audio transcription with Whisper, visit the Whisper API documentation. For advanced email features and best practices, explore the SendGrid documentation.
Jacob Muganda is a software engineer specializing in AI automation, cloud communications, and healthcare technology. He builds systems that leverage AI and real-time data to enhance telehealth and improve patient workflows.
Related Posts
Related Resources
Twilio Docs
From APIs to SDKs to sample apps
API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.
Resource Center
The latest ebooks, industry reports, and webinars
Learn from customer engagement experts to improve your own communication.
Ahoy
Twilio's developer community hub
Best practices, code samples, and inspiration to build communications and digital engagement experiences.