Skip to contentSkip to navigationSkip to topbar
On this page
Looking for more inspiration?Visit the
(information)
You're in the right place! Segment documentation is now part of Twilio Docs. The content you are used to is still here—just in a new home with a refreshed look.

Data Lakes Sync Reports and Errors


FREE x
TEAM x
BUSINESS
ADDON x

Segment Data Lakes generates reports with operational metrics about each sync to your data lake so you can monitor sync performance. These sync reports are stored in your S3 bucket and Glue Data Catalog. This means you have access to the raw data, so you can query it to answer questions and set up alerting and monitoring tools.


Sync Report schema

sync-report-schema page anchor

Your sync_report table stores all of your sync data. You can query it to answer common questions about data synced to your data lake. The table has the following columns in its schema:

Sync MetricDescription
workspace_idDistinct ID assigned to each Segment workspace and found in the workspace settings(link takes you to an external page).
source_idDistinct ID assigned to each Segment source, found in the Source Settings > API Keys > Source ID.
databaseName of the Glue Database used to store sync report tables. Segment automatically creates this database during the Data Lakes set up process.
emr_cluster_idID of the EMR cluster which Data Lakes uses, found in the Data Lakes Settings page.
s3_bucketName of the S3 bucket which Data Lakes uses, found in the Data Lakes Settings page.
run_idID dynamically generated and assigned to each Data Lakes sync run.
start_timeTime when the sync run started, in UTC.
finish_timeTime when the sync run finished, in UTC.
duration_minsThe length of the sync in minutes, calculated by the difference between the start and finish time.
statusStatus of the sync. Values can either be finished for a successful sync or failed for a failed sync.
error_codeThe type of error, which can include: insufficient permissions, invalid settings, or a Segment internal error.
errorIf the sync failed, the error that describes the issue, for example "External ID is invalid".
table_nameName of the Segment event synced to S3.
row_countNumber of rows synced to S3 for a specific run.
partitionsPartitions added to the event tables during the sync.
new_columnsNew columns inferred and added to event table during the sync.
dayDay on which the sync occurred.
typeDefines whether the run metrics are at the source or event level. If type = source, the run aggregates data for syncs across all events within the source. If type = event, the run shows detailed sync metrics per event.
replayTrue or false value which indicates whether the sync run was a replay.
replay_fromStart date for the replay, if applicable.
replay_toFinish date for the replay, if applicable.

The Glue Database named __segment_datalake stores the schema of the sync_reports table. The __segment_datalake database has the following format:

Column nameData typePartition keyComment
typestring
workspace_idstring
run_idstring
start_timetimestamp
finish_timetimestamp
duration_minsbigint
statusstring
errorstring
error_codestring
table_namestring
databasestring
partitionsarray
new_columnsarray
row_countbigint
is_newboolean
replayboolean
replay_fromtimestamp
replay_totimestamp
emr_cluster_idstring
s3_bucketstring
source_idstringPartition (0)
daystringPartition (1)

The sync_reports table is available in S3 and Glue only once a sync completes. Sync reports are not available for syncs in progress.


Data Lakes sync reports are stored in Glue and in S3.

Segment automatically creates a Glue Database and table when you set up Data Lakes to store all sync report tables. The Glue Database is named __segment_datalake, and the table is named sync_reports.

The S3 structure is: s3://my-bucket/segment-data/reports/day=YYYY-MM-DD/source=$SOURCE_ID/run_id=$RUN_ID/report.json


The data in the sync reports is stored in JSON format to ensure that it is human-readable and can be processed by other systems.

Each table involved in the sync is a separate JSON object that contains the sync metrics for the data loaded to that table.

The example below shows the raw JSON object for a successful sync report.

1
{
2
"type": "source",
3
"workspace_id": "P3IMS7SBDH",
4
"source_id": "9IP56Shn6",
5
"run_id": "1597581273464733073",
6
"start_time": "2020-08-19 22:15:59.044084423",
7
"finish_time": "2020-08-19 22:18:12.891",
8
"duration_mins": 2,
9
"status": "finished",
10
"table_name": "",
11
"database": "ios_prod",
12
"row_count": 81020,
13
"emr_cluster_id": "j-3SXSUSDNPIS",
14
"s3_bucket": "my-segment-datalakes-bucket"
15
}
16
{
17
"type": "event",
18
"workspace_id": "P3IMS7SBDH",
19
"source_id": "9IP56Shn6",
20
"run_id": "1597581273464733073",
21
"start_time": "2020-08-19 22:15:59.044084423",
22
"finish_time": "2020-08-19 22:18:12.891",
23
"duration_mins": 2,
24
"status": "finished",
25
"table_name": "track_order_completed",
26
"database": "ios_prod",
27
"partitions": [
28
{
29
"day": "2020-08-16",
30
"hr": "10"
31
},
32
{
33
"day": "2020-08-16",
34
"hr": "11"
35
}
36
],
37
"new_columns": [
38
{
39
"name": "properties_billing_address",
40
"type": "string"
41
}
42
],
43
"row_count": 20020,
44
"emr_cluster_id": "j-3SXSUSDNPIS",
45
"s3_bucket": "my-segment-datalakes-bucket"
46
}
47
{
48
"type": "event",
49
"workspace_id": "P3IMS7SBDH",
50
"source_id": "9IP56Shn6",
51
"run_id": "1597581273464733073",
52
"start_time": "2020-08-19 22:15:59.044084423",
53
"finish_time": "2020-08-19 22:18:12.891",
54
"duration_mins": 2,
55
"status": "finished",
56
"table_name": "track_product_added",
57
"database": "ios_prod",
58
"partitions": [
59
{
60
"day": "2020-08-16",
61
"hr": "10"
62
}
63
],
64
"row_count": 20260,
65
"emr_cluster_id": "j-3SXSUSDNPIS",
66
"s3_bucket": "my-segment-datalakes-bucket"
67
}

The example below shows the raw JSON object for a failed sync report.

1
{
2
"type": "source",
3
"workspace_id": "P3IMS7SBDH",
4
"source_id": "9IP56Shn6",
5
"run_id": "1597867438900010296",
6
"start_time": "2020-08-19 20:04:58.368616813",
7
"finish_time": "2020-08-19 20:49:48.308318686",
8
"duration_mins": 44,
9
"status": "failed",
10
"error": "Data Lakes Destination has invalid configuration for \"AWS Role ARN\": field is required.",
11
"error_code": "Segment.Internal",
12
"table_name": "",
13
"database": "ios_prod",
14
"emr_cluster_id": "j-3SXSUSDNPIS",
15
"s3_bucket": "segment-datalakes-demo-stage"
16
}

Querying the Sync Reports table

querying-the-sync-reports-table page anchor

You can use SQL to query your Sync Reports table to explore and analyze operational sync metrics. A few helpful and commonly used queries are included below.

Return row counts per day for a specific event

return-row-counts-per-day-for-a-specific-event page anchor
1
SELECT day,sum(row_count)
2
FROM "__segment_datalake"."sync_reports"
3
WHERE source_id='9IP56Shn6' and table_name='checkout_started'
4
GROUP BY day
5
ORDER BY day

Return row counts per day for all events in the source

return-row-counts-per-day-for-all-events-in-the-source page anchor
1
SELECT day, table_name,sum(row_count)
2
FROM "__segment_datalake"."sync_reports"
3
WHERE source_id='9IP56Shn6' AND type='event'
4
GROUP BY day, table_name
5
ORDER BY day

Find the most recent successful sync

find-the-most-recent-successful-sync page anchor
1
SELECT max(finish_time)
2
FROM "__segment_datalake"."sync_reports"
3
WHERE source_id='9IP56Shn6' AND status='finished' AND date(day) = CURRENT_DATE
4
LIMIT 1

Find all failures in the last N days

find-all-failures-in-the-last-n-days page anchor
1
SELECT run_id, status, error, error_code
2
FROM "__segment_datalake"."sync_reports"
3
WHERE source_id='9IP56Shn6' AND status='failed' AND date(day) >= (CURRENT_DATE - interval '2' day)

The following error types can cause your data lake syncs to fail:

  • Insufficient permissions - Segment does not have the permissions necessary to perform a critical operation. You must grant Segment additional permissions.
  • Invalid settings - The settings are invalid. This could be caused by a missing required field, or a validation check that fails. The invalid setting must be corrected before the sync can succeed.
  • Internal error - An error occurred in Segment's internal systems. This should resolve on its own. Contact the Segment Support team(link takes you to an external page) if the sync failure persists.

Insufficient permissions

insufficient-permissions page anchor

If Data Lakes does not have the correct access permissions for S3, Glue, and EMR, your syncs will fail.

If permissions are the problem, you might see one of the following permissions-related error messages:

  • "Segment was unable to upload staging data to your S3 Bucket due to a lack of sufficient permissions".
  • "Segment does not have permissions to download object from S3 Bucket".
  • "Segment does not have permissions to upload object to S3 Bucket".
  • "Segment does not have permissions to delete S3 objects from S3 Bucket".
  • "Segment does not have permissions to submit an EMR job to cluster".
  • "Segment does not have permissions to check the status of EMR Job on EMR Cluster".
  • "Segment does not have permissions to delete table from Glue Catalog".
  • "Segment does not have permissions to fetch schema information from Glue catalog".

Check the set up guide to ensure that you set up the required permission configuration for S3, Glue and EMR.

One or more settings might be incorrectly configured in the Segment app, preventing your Data Lakes syncs from succeeding.

If you have invalid settings, you might see one of the error messages below:

  • "Data Lakes Destination has invalid configuration."
  • "The Table Partitions configuration for this Data Lake is invalid. The field name does not appear to map to the data being processed, which likely means it is misconfigured."
  • "External ID is invalid. Please ensure the external ID in the IAM role used to connect to your Data Lake matches the source ID."
  • "External ID is not set. Please ensure that the IAM role used to connect to your Data Lake has the source ID in the list of external IDs."

The most common error occurs when you do not list all Source IDs in the External ID section of the IAM role. You can find your Source IDs in the Segment workspace, and you must add each one to the list of External IDs(link takes you to an external page) in the IAM policy. You can either update the IAM policy from the AWS Console, or re-run the Data Lakes set up Terraform job(link takes you to an external page).

Internal errors occur in Segment's internal systems, and should resolve on their own. If sync failures persist, contact the Segment Support team(link takes you to an external page).


How are Data Lakes sync reports different from the sync data for Segment Warehouses?

how-are-data-lakes-sync-reports-different-from-the-sync-data-for-segment-warehouses page anchor

Both Warehouses and Data Lakes provide similar information about syncs, including the start and finish time, rows synced, and errors.

However, Warehouse sync information is only available in the Segment app: on the Sync History page and Warehouse Health pages. With Data Lakes sync reports, the raw sync information is sent directly to your data lake. This means you can query the raw data and answer your own questions about syncs, and use the data to power alerting and monitoring tools.

What happens if a sync is partly successful?

what-happens-if-a-sync-is-partly-successful page anchor

Sync reports are currently generated only when a sync completes, or when it fails. Partial failure reporting is not currently supported.