3 Ways of Querying Data using LangChain Agents in Python

July 11, 2023
Written by
Twilion
Reviewed by
Twilion

I decided to analyze data from my workouts to use it in a large language model (LLM) app and quickly learned there are multiple ways to query Strava data using LangChain–read on to learn 3 ways of querying Strava data using LangChain's OpenAPI, CSV, and Pandas Dataframe Agents in Python.

LangChain Agents

LangChain, a framework for building applications around LLMs, provides developers an interface for connecting and working with models and that data.

LangChain agents use an LLM to decide what actions to take and the order to take them in, making future decisions by iteratively observing the outcome of prior actions. Agents can be chained together and they can connect the LLM to external knowledge sources or tools for computation. If there is an error, it can attempt to fix it. With Agents, the model makes more decisions.

This tutorial will compare 3 agents: the

  • OpenAPI agent, which helps consume arbitrary APIs that conform to the OpenAPI/Swagger specification (in the case of this tutorial, the Strava API.) We'll need a spec file that allows you to describe the API we wish to consume, including available endpoints, authentication methods, etc.
  • CSV agent, an agent built on top of the Pandas DataFrame agent capable of querying structured data and question-answering over CSVs. It loads data from CSV files and can perform basic querying operations like selecting and filtering columns, sorting data, and querying based on a single condition.
  • Pandas DataFrame agent: an agent built on top of the Python agent capable of question-answering over Pandas dataframes, processing large datasets by loading data from Pandas dataframes, and performing advanced querying operations. It can group and aggregate data, filter data based on complex conditions, and join numerous dataframes.

To get started, make a new folder called Strava-LangChain and you will need the following prerequisites:

Prerequisites

  1. OpenAI Account – make an OpenAI account here
  2. Python installed - download Python here
  3. A Strava account - sign up for a Strava account if you don't have one already

If you're using a Unix or macOS system, open a terminal and enter the following commands:

mkdir strava-langchain
cd strava-langchain
python3 -m venv venv 
source venv/bin/activate 
!pip install langchain
!pip install openai
pip install load_dotenv

If you're following this tutorial on Windows, enter the following commands in a command prompt window:

mkdir strava-langchain
cd strava-langchain
python -m venv venv 
venv\Scripts\activate 
pip install langchain
pip install openai
pip install load_dotenv

Setup the Strava API

In order to use the Strava API, you need to create an app. If you’re reading this, you likely already have a Strava account but if not go ahead and create one now from the Prerequisites link above. Sign in to your Strava account and navigate to your API settings page. You can alternatively find that by selecting My API Application in the dropdown menu on the left of your regular account settings.

Strava My apps

You should now see the “My API Application” page. Fill it out accordingly:

Strava App Settings

Agree to Strava's API agreement and click Create. Yay! You have your first Strava application.

Strava API application with client id, client secret, token

Save your Client Secret as an environment variable called STRAVA_CLIENT_SECRET.

Strava Activity Webhook Authentication

Now make a file called strava-token.py and add the following code:

import requests
from dotenv import dotenv_values

config = dotenv_values(".env")

# Initial Settings
client_id = YOUR-CLIENT-ID
client_secret = config.get('STRAVA_CLIENT_SECRET')
redirect_uri = 'http://localhost/'

# Authorization URL
request_url = f'http://www.strava.com/oauth/authorize?client_id={client_id}' \
                  f'&response_type=code&redirect_uri={redirect_uri}' \
                  f'&approval_prompt=force' \
                  f'&scope=profile:read_all,activity:read_all'

# User prompt showing the Authorization URL asking for the code
print('Click here:', request_url)
print('Please authorize the app and copy/paste the generated code in the URL below')
code = input('Paste the code from the URL: ')

# Get the access token
token = requests.post(
    url='https://www.strava.com/api/v3/oauth/token',
    data= {
        'client_id': client_id,
        'client_secret': client_secret,
        'code': code,
        'grant_type': 'authorization_code'
        })

#print token in a variable
strava_token = token.json()
print('strava_token to replace in .env', strava_token['access_token'])

Run the file with the command python strava-token.py. Then click the link and in the webpage, click Authorize.

authorize strava app

Then grab the code from the generated URL (redirection URL) you set.

strava code in URL

Paste it in your terminal:

terminal asking you to paste code from browser

You should get your Access Token–save it as an environment variable STRAVA_TOKEN.

strava access token returned in terminal from previous command

Now let's look at three different LangChain Agents we can use to collect and work with the Strava data we can get now that we have an API token.

1. Query Strava Data from the Strava API with an OpenAPI Agent

We will interface with Strava's API via the OpenAPI Swagger, a set of open-source tools built around the OpenAPI Specification that can help developers work with and consume REST APIs.

LangChain lets you call an endpoint defined by the OpenAPI specification with purely natural language. They offer two ways of interacting with APIs:

  1. OpenAPIEndpointChain a chain that wraps a specific endpoint to provide a "Natural Language API"
  2. OpenAPI Planner/Executor Agent - an agent good for interacting with large specs

I found that the chain was correct less often than the agent–it didn't correct itself if it couldn't perform a task and outputted "The API response does not contain any information" more often–good thing LangChain is open source so we could dig into the code to see why!

In the current directory, add this code to a file called swagger.yml. This document provides information defining and describing the Strava API and its elements, like what endpoints developers can use.

At the top of a file called openai-agent.py, include the following code to import the following libraries, read the swagger.yml file to collect some OpenAPI specs, load your .env file, set your Strava token, and set the headers:

from langchain.agents.agent_toolkits.openapi.spec import reduce_openapi_spec
from langchain.requests import RequestsWrapper
from langchain.chat_models import ChatOpenAI
from langchain.agents.agent_toolkits.openapi import planner
import yaml
from dotenv import dotenv_values


with open("swagger.yml") as f:
    raw_openai_api_spec = yaml.load(f, Loader=yaml.Loader)
openai_api_spec = reduce_openapi_spec(raw_openai_api_spec)
config = dotenv_values(".env")
strava_token = config.get('STRAVA_TOKEN')
headers = {
    "Authorization": "Bearer {strava_token}".format(strava_token = strava_token),
    "Content-Type": "application/json"
}

Next get the API credentials, create our LLM, and ask it a question.

# Get API credentials.
requests_wrapper = RequestsWrapper(headers=headers)

llm = ChatOpenAI(openai_api_key=config.get('OPENAI_API_KEY'), model_name="gpt-4", temperature=0.0)
openai_agent = planner.create_openapi_agent(openai_api_spec, requests_wrapper, llm)

query = "how far was their last activity of type run? how many minutes did it take them?"
openai_agent.run(query)

gif of returned Strava data

The output is correct! When asked about my last run, the OpenAPI agent knows I want information about the one that happened most recently.

Now before we can compare this with querying data from a CSV Agent or a Pandas Dataframe Agent, we have to get our data into a CSV.

Make a CSV file of data from the Strava API

On the command line, run pip install requests pandas dotenv_values. In a new file called strava_csv_lc.py, include the following lines to import libraries needed to make a request to the Strava API, use a .env file, and make a CSV of your activities:

import requests
from dotenv import dotenv_values
config = dotenv_values(".env") 

The next code hits the Strava API's "/athlete/activities" endpoint and loops through four pages of Strava activities.

activities_url = "https://www.strava.com/api/v3/athlete/activities"

header = {'Authorization': 'Bearer ' + config.get('STRAVA_TOKEN')}
params = {'per_page': 200, 'page': 1} #max 200 per page, can only do 1 page at a time
my_dataset = requests.get(activities_url, headers=header, params=params).json() #activities
page = 0
for x in range(1,5): #loop through 4 pages of strava activities
    page +=1 
    params = {'per_page': 200, 'page': page}
    my_dataset += requests.get(activities_url, headers=header, params=params).json()   

We then only look at certain columns in the dataset, use json_normalize to quickly flatten the JSON into a DataFrame object, and then write the activities to a CSV file called activities.csv.

activities = pd.json_normalize(my_dataset)
cols = ['name', 'type', 'distance', 'moving_time', 'total_elevation_gain', 'start_date']
activities = activities[cols]
activities = activities[activities["start_date"].str.contains("2021") == False] #remove items from 2021, only include workouts from 2022 and 2023
activities.to_csv('activities.csv', index=False)

The CSV file called activities.csv is made! Let's now ask it questions with a CSV agent.

2. Query Strava Data with a CSV Agent

Add the following code to create a CSV agent and pass it the OpenAI model, and our CSV file of activities. Then run it and ask it questions about the data contained in the CSV file:

from langchain.agents import create_pandas_dataframe_agent, create_csv_agent
​​​​csv_agent = create_csv_agent(
    OpenAI(temperature=0),
   "activities.csv",
    verbose=True
)
csv_agent.run("how many runs has Lizzie done") #119 true
csv_agent.run("how many workouts of type ride has she done?") #248 -- get 0 if i ask "how many rides"
csv_agent.run("How many swims has she done?") # 5 correct
csv_agent.run("What was Lizzie's fastest half marathon run?")
csv_agent.run("What was her fastest run of at least 21.0975km?") 
csv_agent.run("How many walks with total elevation gain over 100 has she taken?") #50! true! 2nd time running: 24?

LangChain CSV agent chain coming to the answer of how many total runs and rides I"ve done

The chain does not take the steps from the string "half marathon" to "13.1 miles", but can find the fastest half marathon when given the specific distance (both 13.1 miles or 21.0975km.)

LangChain chain saying I have not done a half marathon before

Pretty good! Let's see how the Pandas Dataframe Agent does.

3. Query Strava Data with a Pandas Dataframe Agent

Run pip install pandas and import it:

import pandas as pd

At the top of the file import the following libraries and set the required OPENAI_API_KEY variable.

from langchain.agents import create_pandas_dataframe_agent, create_csv_agent
from langchain.agents.agent_types import AgentType
from langchain.llms import OpenAI
import os
os.environ["OPENAI_API_KEY"] = config.get('OPENAI_API_KEY')

Finally we can read our CSV file into a Pandas Dataframe object and initialize our agent.

data_df = pd.read_csv('activities.csv')
pd_agent = create_pandas_dataframe_agent(OpenAI(temperature=0), data_df, verbose=True)

Now run the agent asking it about previous Strava activities.

pd_agent.run("how far was their last run?")
pd_agent.run("how many runs have they done")
pd_agent.run("how many bike rides have they done? How many swims?")

LangChain chain showing steps the Agent took to get to the distance of last run

LangChain chain coming to the answer of how many total runs I"ve done

LangChain chain coming to the answer of how many total rides and swims I"ve done

Alas, my last run was not 8509.4km (but the last run in the Dataframe, which is the least recent, is 8509.4km! When asked for my most recent run, the distance is correct!), but the other information is correct! Moral of the story: be careful what you ask and how, and see the steps, thoughts, and observations the agent takes to get to its final answer.

Comparing these different ways of querying data

These agents CSV and Pandas Dataframes agents offer a new approach to querying data, differing from more traditional query languages. Instead of writing code to handle data, these agents let users ask questions via natural language and get answers more conversationally and quickly, no need for crafting complex queries.

However, we did see that this natural language approach sometimes doesn't handle more complex queries with multiple operations or steps and can sometimes be imprecise, leading to inaccurate results.

The outputs of the CSV agent and Pandas Dataframe agents are similar, which makes sense because both agents call the Pandas DataFrame agent under the hood, which in turn calls the Python agent. The CSV agent uses the Python agent to execute code but particularly utilizes the Pandas DataFrame agent to work with CSV files.

Ultimately, I think the Dataframe Agent would be better than the CSV Agent for most operations because it makes it easier for developers to perform operations on the data–a CSV doesn't provide the scientific data manipulation tools that Pandas does.

The OpenAPI Agent hits the API which would give more up-to-date data than a static CSV file. I found its answers to queries to be more accurate than those from the OpenAPI Chain. I thought this was because of proper parsing and ignoring of other things, and looking at LangChain's open source code for its OpenAPI Chain's LLM usage, a line that stands out to me is If the question does not seem related to the API, return I don't know. Do not make up an answer. Only use information provided by the tools to construct your response. It knows the base URL to use to make the request, but sometimes does not get the correct endpoint.

When asked to get the number of runs I've done, it wants to use "GET /athletes/{id}/activities endpoint" but that does not exist so it does not answer that query.

What's Next for Querying Data with LangChain in Python?

The complete code can be found on GitHub here. LangChain's OpenAPI, Pandas Dataframe, and CSV agents are powerful developer tools that offer builders efficient querying capabilities. Being able to chain different agents together to build more complex applications opens up a wide range of possibilities for data-driven applications. Stay tuned for more tutorials on the Twilio blog about querying data with LangChain, especially with APIs like Strava!