Access a Presto Database with Python and SQLAlchemy

December 02, 2022
Written by
Reviewed by

Access a Presto Database with Python and SQLAlchemy

At Twilio, I maintain a blog dashboard that allows my colleagues and I to perform analysis on traffic, conversions, and other metrics for articles published on this blog. I have written in the past about some of the database benchmarking techniques I’ve implemented for this project, and this generated interesting discussions on Hacker News. Today I will be discussing a topic that is a lot less controversial.

The charts and reports generated by this dashboard combine data from several sources that include web traffic, blog article metadata, and user information. The latter comes from a large internal company database hosted in a Presto cluster.

Presto is an open-source distributed query engine for big data created by Facebook. It allows users to query many different data sources, such as Hadoop, Kafka, MySQL, MongoDB, and many more using its own SQL implementation. This effectively abstracts the developer from query language differences among sources.

In this article I’m going to tell you how I enabled the Python back end of my dashboard application to communicate with Presto using the familiar SQLAlchemy library.

Python requirements

There are two Presto client libraries for Python, the official Presto-Python client and PyHive. At the time I’m writing this article, only PyHive has SQLAlchemy support, so that made it an easy choice for me, since I already use this library in my project.

If you are going to attempt to connect to your own Presto server, go ahead and install SQLAlchemy and PyHive in your virtual environment as follows:

pip install sqlalchemy "pyhive[presto]"

Create a connection

In SQLAlchemy, the first step in making a connection to a database is to create an engine for it. The syntax to create an engine is shown below:

from sqlalchemy import create_engine

engine = create_engine(database_url)

The database URL that is passed as an argument to the create_engine() function includes all the connection's parameters. For a Presto database, the URL has the following format:

presto://{username}:{password}@{hostname}:{port}/{catalog}/{schema}?protocol={protocol}

This URL above has a lot of placeholders for information that is going to be specific to your own Presto instance. Let’s review these placeholders:

  • username and password are the credentials that you will use to access the cluster.
  • hostname and port specify the server and port number on which Presto is listening for connections.
  • catalog and schema define the Presto database you intend to use. The catalog is associated with the data source. The schema has different meanings depending on the source, but is in general a method of referencing a collection of tables as a group.
  • protocol can be set to http or https, depending on the requirements of your Presto server.

To keep my scripts clean of sensitive information I prefer to configure the parameters of the connection in environment variables, which are then imported by the Python script. Below you can see an example get_presto_engine() function that returns the SQLAlchemy engine instance:

import os
from sqlalchemy import create_engine

def get_presto_engine():
    username = os.environ['PRESTO_USERNAME']
    password = os.environ['PRESTO_PASSWORD']
    hostname = os.environ['PRESTO_HOSTNAME']
    port = os.environ['PRESTO_PORT']
    catalog = os.environ['PRESTO_CATALOG']
    schema = os.environ['PRESTO_SCHEMA']
    protocol = os.environ['PRESTO_PROTOCOL']

    return create_engine(f'presto://{username}:{password}@{hostname}:{port}/{catalog}/{schema}?protocol={protocol}')

Note how I take advantage of Python’s f-strings to use the URL template from above directly in the code.

Access to database tables

Once you have the engine instance, you can create a MetaData object and add tables to it, which would allow you to start building queries.

To create table object that represent the actual tables in the database you have two options:

  • You can manually define Table objects with Column instances that match the Presto tables you are interested in querying.
  • You can have SQLAlchemy automatically “reflect” these tables for you.

The following example manually defines a pageviews table with five columns, then queries the database for the 10 most recent page views from it:

import os
from sqlalchemy import create_engine
from sqlalchemy import MetaData, Table, Column, BigInteger, String, DateTime
from sqlalchemy import select

def get_presto_engine():
    ... # implementation of this function as shown above

engine = get_presto_engine()
metadata = MetaData()

pageviews = Table(
    'pageviews', metadata,
    Column('user_id', BigInteger),
    Column('event_id', BigInteger),
    Column('session_id', BigInteger),
    Column('time', DateTime),
    Column('path', String),
)

stmt = select(pageviews).order_by(pageviews.c.time.desc()).limit(10)

with engine.connect() as conn:
    for row in conn.execute(stmt):
        print(row)

Be aware that this approach will only work when the columns defined in the Table instances actually match those in the Presto database. An interesting detail is that you do not need to define every column in the SQLAlchemy table, it is sufficient to define only the columns that you need for your queries.

For tables with lots of columns it may be tedious to manually replicate every column, so using table reflection can be more efficient. This is where SQLAlchemy automatically instantiates table objects based on information retrieved from the database. The example above will largely be the same under table reflection, the only change would be in how the Table object is defined:

pageviews = Table('pageviews', metadata, autoload_with=engine)

With this method SQLAlchemy uses the table name and the engine to retrieve the schema of the table, and then creates a corresponding table object and adds it to the metadata instance.

While table reflection sounds like a fantastic solution, I have had some issues with it. The pyhive package appears to not recognize certain column types reported by Presto, so it could not properly reflect some columns.

Integration with the SQLAlchemy ORM

If you’ve got this far, the next logical step is to see if Presto can be queried via the SQLAlchemy ORM.

And this actually worked as well. Instead of creating MetaData and Table instances, a declarative base and a model class can be defined:

import os
from sqlalchemy import create_engine
from sqlalchemy import Column, BigInteger, String, DateTime
from sqlalchemy.orm import declarative_base, Session
from sqlalchemy import select

def get_presto_engine():
    ... # implementation of this function as shown above

engine = get_presto_engine()
Base = declarative_base()

class PageView(Base):
    __tablename__ = 'pageviews'

    user_id = Column(BigInteger, primary_key=True)
    event_id = Column(BigInteger)
    session_id = Column(BigInteger)
    time = Column(DateTime)
    path = Column(String)

    def __repr__(self):
        return (f'PageView(user_id={self.user_id}, event_id={self.event_id}, '
                f'session_id={self.session_id}, time={self.time}, '
                f'path={self.path})')

stmt = select(PageView).order_by(PageView.time.desc()).limit(10)

with Session(engine) as session:
    for row in session.scalars(stmt):
        print(row)  # row is a PageView object now

With this approach, rows from the table are represented by instances from the PageView class. The class includes the __tablename__ attribute with a reference to the Presto table name. Database queries are issued through the Session object, which automatically maps between table rows and PageView objects.

You may have noticed that in the Table example I did not define any primary keys for the table, because my intention was to only read from this table. When using the ORM, SQLAlchemy requires a primary key, so I had to define it to avoid errors.

If you prefer to use table reflection, that also works with the ORM. Here is the PageView model class from the example above, defined with reflection:

class PageView(Base):
    __tablename__ = 'pageviews'
    __table__ = Table('pageviews', Base.metadata, autoload_with=engine)

Conclusion

And there you have it, now you know how to access your Presto cluster using the familiar SQLAlchemy library. I hope this helps you keep your big data needs Pythonic as much as it helped me!

Miguel Grinberg is a Principal Software Engineer for Technical Content at Twilio. Reach out to him at mgrinberg [at] twilio [dot] com if you have a cool project you’d like to share on this blog!