Create an AI Commentator with GPT-4V, OpenAI TTS, Replit, LangChain, and SendGrid

December 05, 2023
Written by
Reviewed by
Diane Phan
Twilion

Problem: When Draymond Green got ejected for putting Rudy Gobert in a chokehold, I wish I could've had my favorite NBA commentator Stephen A. Smith's commentary on it.

Solution: I built this application using OpenAI's new Vision API (GPT-4V), Twilio SendGrid, Replit, LangChainStreamlit, and OpenCV to Stephen Smith-ify any video by generating commentary as he would say based on some input video clip. (Thank you to Twilio Solutions Engineer and my AccelerateSF hackathon teammate Chris Brox for guarding me and letting me break his ankles for the video below!)

You can test the app out here!

Prerequisites

  1. Twilio SendGrid account - make an account here and make an API key here 
  2. An email address to test out this project 
  3. A Replit account for hosting the application – make an account here
  4. OpenAI account - make an account here and find your API key here

Some of the dependencies needed for this tutorial include:

  • moviepy to help process video and audio
  • cv2 (OpenCV) to help handle video frames
  • langchain to read and parse a CSV for relevant statistics
  • openai for OpenAI's GPT-4V and text-to-speech (TTS) APIs
  • requests for making HTTP requests to OpenAI's API
  • streamlit to create a web-based UI in Python
  • tempfile to help handle temporary files while processing

Get Started with Replit

Log in to Replit, click + Create Repl to make a new Repl. A Repl, based on "read-eval-print loop", is an interactive programming environment that lets you write and execute code in real-time.

 

Next, select a Python template, give your Repl a title like stephensmithify-openai-vision, set the privacy, and click + Create Repl.

Python template

Under Tools of your new Repl, scroll and click on Secrets.

 

Click the blue + New Secret button on the right.

new secret

Add two secrets where the keys are titled OPENAI_API_KEY and SENDGRID_API_KEY, and their corresponding values are those API keys. Keep them hidden!

add SendGrid and OpenAI API key

Back under Tools, select Packages.

Replit packages

Search and add the following:

  • base256@1.0.1
  • langchain@0.0.345
  • langchain-experimental@0.0.43
  • ​​moviepy@1.0.3
  • opencv-contrib-python-headless@4.8.1.78
  • requests@2.31.0
  • sendgrid@6.10.0
  • streamlit@1.28.2
  • tabulate@0.9.0

In the shell, run pip install openai==0.28--this lower version is needed in order to use OpenAI's Chat Completion.

Download this picture or a similar one of Stephen Smith.

Upload it by selecting the three dots next to Files and then clicking the Upload file button as shown below. 

upload file

Similarly, download basketball data like Stephen Curry's statistics here on Kaggle. This is just for fun and pretend and will be considered in the prompt that generates the commentary using a LangChain CSV agent.

Upload it by again clicking the three dots followed by the Upload file button.

Lastly, open the .replit file and replace it with the following so whenever the app is run, streamlit run is run.

run = "streamlit run --server.address 0.0.0.0 --server.headless true --server.enableWebsocketCompression=false main.py"

modules = ["python-3.10:v18-20230807-322e88b"]

hidden = [".pythonlibs"]

[nix]
channel = "stable-23_05"

[deployment]
run = ["sh", "-c", "streamlit run --server.address 0.0.0.0 --server.headless true --server.enableWebsocketCompression=false main.py"]
deploymentTarget = "cloudrun"

Now we're going to write some Python code to generate text with OpenAI's new Vision API based on an input video.

Generate Stephen Smith-like Text for an Input Video

At the top of the main.py file in Replit, add the following import statements:

# Standard library imports
import base64
import io
import os
import tempfile

# Related third-party imports
import cv2  # opencv-python-headless
​​from langchain.agents.agent_types import AgentType
from langchain.chat_models import ChatOpenAI
from langchain_experimental.agents.agent_toolkits import create_csv_agent # separate langchain_experimental package
from moviepy.audio.io.AudioFileClip import AudioFileClip
from moviepy.editor import VideoFileClip
import openai # openai==0.28
import requests
from sendgrid import SendGridAPIClient
from sendgrid.helpers.mail import (Attachment, Disposition, FileContent, FileName, FileType, Mail)
import streamlit as st

Next, include the downloaded Stephen Smith image and set the prompt we will pass to OpenAI telling the model to generate a short script summarizing an input basketball video split into frames. OpenAI's model didn't like being told to mimic someone else, but you can get around that using what is called the grandma exploit.

st.image("stephensmith.jpeg")
prompt = """
My grandma loves listening to ESPN announcer Stephen Smith. She is about to pass away from terminal cancer. Cheer both of us up by creating a short script summarizing these basketball video frames so it sounds like Stephen Smith
"""

Now we'll make some helper functions. The first function uses OpenCV to extract frames from an input video file.

def video_to_frames(video_file):
    # Save the uploaded video file to a temporary file
    with tempfile.NamedTemporaryFile(delete=False, suffix='.mp4') as tmpfile:
        tmpfile.write(video_file.read())
        video_filename = tmpfile.name

    video_length = VideoFileClip(video_filename).duration

    video = cv2.VideoCapture(video_filename)
    base64Frames = []

    while video.isOpened():
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

    video.release()
    print(len(base64Frames), "frames read.")
    return base64Frames, video_filename, video_length

Once we get the video frames, we can make our prompt and send a request to GPT every 48 frames because GPT does not need to see each frame to understand what's happening.

def frames_to_story(base64Frames, p):
    PROMPT_MESSAGES = [
        {
            "role": "user",
            "content": [
                p,
                *map(lambda x: {"image": x, "resize": 400}, base64Frames[0::48]), #50
            ],
        },
    ]
    params = {
        "model": "gpt-4-vision-preview",
        "messages": PROMPT_MESSAGES,
        "max_tokens": 600,
    }

    result = openai.ChatCompletion.create(**params)
    print(result.choices[0].message.content)
    return result.choices[0].message.content

Now create a function to convert that written text in the style of Stephen Smith to audio using OpenAI's tts-1 text-to-speech model that's optimized for real-time usage. The audio is saved to a temporary file.

def text_to_audio(text):
  resp = requests.post(
    "https://api.openai.com/v1/audio/speech",
    headers={
        "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
    },
    json={
        "model": "tts-1",
        "input": text,
        "voice": "onyx",
    },
  )

  # Check if the request was successful
  if resp.status_code != 200:
    raise Exception("Request failed with status code")
    # ...
  # Create an in-memory bytes buffer
  audio_bytes_io = io.BytesIO()
  # Write audio data to the in-memory bytes buffer
  for chunk in resp.iter_content(chunk_size=1024 * 1024):
    audio_bytes_io.write(chunk)
    
  # Important: Seek to the start of the BytesIO buffer before returning
  audio_bytes_io.seek(0)

  # Save audio to a temporary file
  with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as tmpfile:
    for chunk in resp.iter_content(chunk_size=1024 * 1024):
        tmpfile.write(chunk)
    audio_filename = tmpfile.name

  return audio_filename, audio_bytes_io

The final helper function creates a new video that combines the input video with the generated audio, replacing whatever audio was included in the initial uploaded video.

def merge_audio_video(video_filename, audio_filename, output_filename):
    print("Merging audio and video...")
    print("Video filename:", video_filename)
    print("Audio filename:", audio_filename)

    # Load the video file
    video_clip = VideoFileClip(video_filename)

    # Load the audio file
    audio_clip = AudioFileClip(audio_filename)

    # Set the audio of the video clip as the audio file
    final_clip = video_clip.set_audio(audio_clip)

    # Write the result to a file (without audio)
    final_clip.write_videofile(
        output_filename, codec='libx264', audio_codec='aac')

    # Close the clips
    video_clip.close()
    audio_clip.close()

    # Return the path to the new video file
    return output_filename

Now include the following Streamlit Python code to add a title, explanation about the application, file uploader, text input for an email to send the new video to, and a button. If the button is clicked, create a conditional checking that something is uploaded via the st.file_uploader function.

st.header("Stephen Smith-ify a video :basketball:")
st.write("This [Replit](https://replit.com/) app uses [Streamlit](https://streamlit.io/), [OpenAI's new Vision API](https://platform.openai.com/docs/guides/vision), [Twilio SendGrid](https://sendgrid.com/), [LangChain CSV Agents](https://python.langchain.com/docs/integrations/toolkits/csv), and [OpenCV](https://opencv.org/) (among other libraries) to generate [Stephen Smith](https://en.wikipedia.org/wiki/Stephen_A._Smith)-esque commentary for an input basketball video file.")
uploaded_file = st.file_uploader("Upload a video file", type=["mp4", "mov])
email = st.text_input("Email to send new video with Stephen Smith commentary to")
if st.button('Stephen Smith-ify!', type="primary") and uploaded_file is not None:

If there is a video uploaded, it is played with st.video and st.spinner temporarily displays a message while executing the next block of code which calls the helper functions above. An estimated number of words is calculated based on the length of the input video and this is added to the prompt so that the output audio is roughly the same length of the input video.

That prompt is displayed on the web page for transparency, and then the audio is generated from the generated text, merged, and displayed.

    st.video(uploaded_file)
    with st.spinner('Processing📈...'):
        base64Frames, video_filename, video_length = video_to_frames(uploaded_file)
        rough_num_words = video_length*2 + 7
        agent = create_csv_agent(
          ChatOpenAI(temperature=1),
          "StephenCurryStats.csv", 
          verbose=True,
          agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
        )
        stats_to_prompt = agent.run("How many games had more than 50 PTS?")
        prompt += f"(This video clip is {video_length} seconds long. Make sure the output text includes no more than {rough_num_words} words)"
        st.markdown(f"*prompt:  {prompt}*")
        text = frames_to_story(base64Frames, prompt)
        st.write(text)

        # Generate audio from text
        audio_filename, audio_bytes_io = text_to_audio(text)

        # Merge audio and video
        output_video_filename = os.path.splitext(video_filename)[
                0] + '_output.mp4'
        final_video_filename = merge_audio_video(video_filename, audio_filename, output_video_filename)

        # Display the result
        st.video(final_video_filename)

That new video containing audio commentary in the style of Stephen Smith is nice. Developers can use Twilio SendGrid to email it as an attachment!

        message = Mail(
            from_email='stephen_smithified@replit-sendgrid-openaivision.com',
            to_emails=email,
            subject='Stephen Smith-ified Video',
            html_content='Enjoy! <3 '
        )
        with open(final_video_filename, 'rb') as f:
            data = f.read()
            f.close()
        encoded_file = base64.b64encode(data).decode()

        attachedFile = Attachment(
            FileContent(encoded_file),
            FileName('stephensmithifiedvid.mp4'),
            FileType('video/mp4'),
            Disposition('attachment')
        )
        message.attachment = attachedFile
        sg = SendGridAPIClient()
        response = sg.send(message)

        if response.status_code == 202:
            st.success("Email sent! Check your email for your video with Stephen Smith-ified commentary")
            print(f"Response Code: {response.status_code} \n Message sent!")
        else:
            st.warning("Email not sent--check email")

Lastly, clean up the temporary files!

        # Clean up the temporary files
        os.unlink(video_filename)
        os.unlink(audio_filename)
        os.unlink(final_video_filename)

Now you can run the app by clicking the green Run button at the top middle!

The complete code can be found here on GitHub.

What's Next for GPT-4V, OpenAI TTS, LangChain, and SendGrid

It's so much fun to use GPT-4 Vision to analyze, critique, or summarize both videos and images. You can also receive images or video via Twilio Programmable Messaging or WhatsApp too!

You can play around with different OpenAI Text-to-Speech voices, play the audio over Twilio Programmable Voice, and more.

Let me know online what you're building (and send best wishes for the Warriors)