BigQuery Destination
Segment's BigQuery connector makes it easy to load web, mobile, and third-party source data like Salesforce, Zendesk, and Google AdWords into a BigQuery data warehouse. When you integrate BigQuery with Segment you get a fully managed data pipeline loaded into a powerful and cost-effective data warehouse.
The Segment warehouse connector runs a periodic ETL (Extract - Transform - Load) process to pull raw events and objects from your sources and load them into your BigQuery cluster. For more information about the ETL process, including how it works and common ETL use cases, refer to Google Cloud's ETL documentation.
View observability metrics with Delivery Overview
Delivery Overview, Segment's built-in observability tool, is now in public beta for storage destinations. For more information, see the Delivery Overview documentation.
To store your Segment data in BigQuery, complete the following steps:
To create a project and enable BigQuery:
- Navigate to the Google Developers Console.
- Configure the Google Cloud Platform:
- If you don't have a project already, create one.
- If you have an existing project, enable the BigQuery API. Once you've done so, you should see BigQuery in the "Resources" section of Cloud Platform.
- Copy the project ID. You'll need it when you create a warehouse source in the Segment app.
Enable billing
When you create your project, you must enable billing so Segment can write into the cluster.
To create a service account for Segment:
- Open the Google Developer Console, select the Navigation panel and navigate to IAM & admin > Service accounts.
- Click Create Service Account.
- Enter a name for the service account (for example,
segment-warehouses) and click Create. - Assign the service account the following roles:
BigQuery Data OwnerorBigQuery Data EditorBigQuery Job User
- Create a JSON key. The downloaded file will be used to create your warehouse in the Segment app.
If you have trouble creating a new service account, refer to Google Cloud's documentation about service accounts for more information.
Unified warehouse credentials in public beta
With unified warehouse credientials you can create warehouse credentials and use them across Segment warehouse products. Segment is actively working on this feature. Some functionality may change before it becomes generally available.
- Navigate to the product area in the Segment app.
- For Storage destinations, navigate to Connections > Destinations and select the Storagetab. Click + Add storage destination.
- For Profiles Sync, navigate to Unify > Profiles Sync.
- Select BigQuery as your warehouse.
- Select an existing warehouse credential or create a new warehouse credential by completing the following fields for your BigQuery instance.
- Project ID: The Google Cloud project where your BigQuery database resides. Project IDs start with a letter and contain 6 – 63 lowercase letters, digits or dashes. Some project IDs also include domain name separated by a colon
- Location: The region where you want your BigQuery datasets to reside
- Service account credentials: The JSON credential that Segment will use to load data into your BigQuery database. Create a service account just for Segment to use
- Test your connection.
- Click Save.
BigQuery datasets are broken down into tables and views. Tables contain duplicate data, views do not contain duplicate data.
The Segment connector uses partitioned tables. Partitioned tables allow you to query a subset of data, which increases query performance and decreases costs.
To query a full table, use the following command:
1select *2from <project-id>.<source-name>.<collection-name>
To query a specific partitioned table, use the following command:
1select *2from <project-id>.<source-name>.<collection-name>$20160809
A view is a virtual table defined by a SQL query. Segment uses views in the de-duplication process to ensure that events that you are querying are unique events and contain the latest objects from third-party data. All Segment views are set up to show information from the last 60 days. Segment recommends querying from these views when possible to avoid duplicate events and historical objects.
Views are appended with _view , which you can query using this format:
1select *2from <project-id>.<source-name>.<collection-name>_view
For early customers using BigQuery with Segment, rather than providing Segment
with credentials, access was granted to a shared Service Account
(connector@segment-1119.iam.gserviceaccount.com). While convenient for early
adopters, this presented potential security risks.
Segment now requires BigQuery customers to create their own Service Accounts and provide the app with those credentials instead. In addition, any attempts to update warehouse connection settings will also require these credentials. This effectively deprecates the shared Service Account.
Migrate your warehouse from a shared Service Account to a dedicated Service Account by creating a new Service Account using the Create a Service Account for Segment section. Then, head to your warehouse's connection settings and update with the credentials you created. Once you've verified that data is loading properly to your warehouse, remove access to the shared Service Account.
Segment recommends enabling IP allowlists for added security. All Segment users with workspaces hosted in the US who use allowlists in their warehouses must update those allowlists to include the following ranges:
52.25.130.38/3234.223.203.0/28
Users with workspaces in the EU must allowlist 3.251.148.96/29.
BigQuery charges based on the amount of data scanned by your queries. Views are a derived view over your tables that Segment uses for de-duplication of events. Therefore, Segment recommends you query a specific view whenever possible to avoid duplicate events and historical objects. It's important to note that BigQuery views aren't cached.
Understanding BigQuery views
BigQuery's views are logical views, not materialized views, which means that the query that defines the view is re-executed every time the view is queried. Queries are billed according to the total amount of data in all table fields referenced directly or indirectly by the top-level query.
To save money, you can query the view and set a destination table, and then query the destination table.
If you start exploratory data analysis with SELECT *, consider
specifying the fields to reduce costs.
Refer to the section on partitioned tables for details on querying sub-sets of tables.
Yes! You just need to modify one of the references to 60 in the view
definition to the number of days of your choosing. You can update the definition of the view as long as the name stays
the same.
Here is the base query Segment uses when first setting up your views. Included in the base query are the placeholders (%s.%s.%s) that you would want to include the project,
dataset and table (in that order).
1SELECT * EXCEPT (ROW_NUMBER) FROM (2SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY loaded_at DESC) ROW_NUMBER3FROM ` + "`%s.%s.%s`" + `4WHERE _PARTITIONTIME BETWEEN5TIMESTAMP_TRUNC(TIMESTAMP_MICROS(UNIX_MICROS(CURRENT_TIMESTAMP()) - 60 * 60 * 60 * 24 * 1000000), DAY, 'UTC')6AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'UTC')7)8WHERE ROW_NUMBER = 1
BigQuery offers both a scalable, pay-as-you-go pricing plan based on the amount of data scanned, or a flat-rate monthly cost. You can learn more about BigQuery pricing on Google Cloud's BigQuery pricing page.
BigQuery allows you to set up Cost Controls and Alerts to help control and monitor costs. If you want to learn more about the costs associated with BigQuery, Google Cloud provides a calculator to estimate your costs.
You can connect a BI tool like Mode or Looker to BigQuery, or query directly from the BigQuery console.
BigQuery supports standard SQL, which you can enable using Google Cloud's query UI. This doesn't work with views, or with a query that uses table range functions.
Segment's connector doesn't support streaming inserts at this time. If you have a need for streaming data into BigQuery, contact Segment support.
Your data will be available in Warehouses between 24 and 48 hours from your first sync. Your warehouse then syncs once or twice a day depending on your Segment Plan.
Segment allows Business Tier (BT) customers to schedule the time and frequency of warehouse data syncs.
If you are on a BT plan, you can schedule warehouse syncs by going to Warehouse > Settings > Sync Schedule in the Segment web app. You can schedule up to the number of syncs allowed on your billing plan.
This behavior is expected. Segment only de-duplicates data in your views. Refer to the schema section for more details.
If you notice that you are missing older BigQuery data, it might be due to a dataset's default table expiration in BigQuery. The default table expiration sets a standard expiration on all partitioned tables that are created.
You can safely change the default table expiration to 'Never', which removes these expirations from the tables/dataset and changes the dataset's default table expiration. Segment can then run a backfill for you and send all of your historical data to your warehouse.