Data Lakes Sync Reports and Errors

FREE x

TEAM x

BUSINESS ✓

ADDON x

Segment Data Lakes generates reports with operational metrics about each sync to your data lake so you can monitor sync performance. These sync reports are stored in your S3 bucket and Glue Data Catalog. This means you have access to the raw data, so you can query it to answer questions and set up alerting and monitoring tools.

Sync Report schema

Your sync_report table stores all of your sync data. You can query it to answer common questions about data synced to your data lake. The table has the following columns in its schema:

Sync Metric	Description
`workspace_id`	Distinct ID assigned to each Segment workspace and found in the workspace settings.
`source_id`	Distinct ID assigned to each Segment source, found in the Source Settings > API Keys > Source ID.
`database`	Name of the Glue Database used to store sync report tables. Segment automatically creates this database during the Data Lakes set up process.
`emr_cluster_id`	ID of the EMR cluster which Data Lakes uses, found in the Data Lakes Settings page.
`s3_bucket`	Name of the S3 bucket which Data Lakes uses, found in the Data Lakes Settings page.
`run_id`	ID dynamically generated and assigned to each Data Lakes sync run.
`start_time`	Time when the sync run started, in UTC.
`finish_time`	Time when the sync run finished, in UTC.
`duration_mins`	The length of the sync in minutes, calculated by the difference between the start and finish time.
`status`	Status of the sync. Values can either be `finished` for a successful sync or `failed` for a failed sync.
`error_code`	The type of error, which can include: insufficient permissions, invalid settings, or a Segment internal error.
`error`	If the sync failed, the error that describes the issue, for example "External ID is invalid".
`table_name`	Name of the Segment event synced to S3.
`row_count`	Number of rows synced to S3 for a specific run.
`partitions`	Partitions added to the event tables during the sync.
`new_columns`	New columns inferred and added to event table during the sync.
`day`	Day on which the sync occurred.
`type`	Defines whether the run metrics are at the source or event level. If type = source, the run aggregates data for syncs across all events within the source. If type = event, the run shows detailed sync metrics per event.
`replay`	True or false value which indicates whether the sync run was a replay.
`replay_from`	Start date for the replay, if applicable.
`replay_to`	Finish date for the replay, if applicable.

The Glue Database named __segment_datalake stores the schema of the sync_reports table. The __segment_datalake database has the following format:

Column name	Data type	Partition key
type	string
workspace_id	string
run_id	string
start_time	timestamp
finish_time	timestamp
duration_mins	bigint
status	string
error	string
error_code	string
table_name	string
database	string
partitions	array
new_columns	array
row_count	bigint
is_new	boolean
replay	boolean
replay_from	timestamp
replay_to	timestamp
emr_cluster_id	string
s3_bucket	string
source_id	string	Partition (0)
day	string	Partition (1)

The sync_reports table is available in S3 and Glue only once a sync completes. Sync reports are not available for syncs in progress.

Data location

Data Lakes sync reports are stored in Glue and in S3.

Segment automatically creates a Glue Database and table when you set up Data Lakes to store all sync report tables. The Glue Database is named __segment_datalake, and the table is named sync_reports.

The S3 structure is: s3://my-bucket/segment-data/reports/day=YYYY-MM-DD/source=$SOURCE_ID/run_id=$RUN_ID/report.json

Data format

The data in the sync reports is stored in JSON format to ensure that it is human-readable and can be processed by other systems.

Each table involved in the sync is a separate JSON object that contains the sync metrics for the data loaded to that table.

The example below shows the raw JSON object for a successful sync report.

1  {
2    "type": "source",
3    "workspace_id": "P3IMS7SBDH",
4    "source_id": "9IP56Shn6",
5    "run_id": "1597581273464733073",
6    "start_time": "2020-08-19 22:15:59.044084423",
7    "finish_time": "2020-08-19 22:18:12.891",
8    "duration_mins": 2,
9    "status": "finished",
10    "table_name": "",
11    "database": "ios_prod",
12    "row_count": 81020,
13    "emr_cluster_id": "j-3SXSUSDNPIS",
14    "s3_bucket": "my-segment-datalakes-bucket"
15  }
16  {
17    "type": "event",
18    "workspace_id": "P3IMS7SBDH",
19    "source_id": "9IP56Shn6",
20    "run_id": "1597581273464733073",
21    "start_time": "2020-08-19 22:15:59.044084423",
22    "finish_time": "2020-08-19 22:18:12.891",
23    "duration_mins": 2,
24    "status": "finished",
25    "table_name": "track_order_completed",
26    "database": "ios_prod",
27    "partitions": [
28      {
29        "day": "2020-08-16",
30        "hr": "10"
31      },
32      {
33        "day": "2020-08-16",
34        "hr": "11"
35      }
36    ],
37   "new_columns": [
38      {
39        "name": "properties_billing_address",
40        "type": "string"
41      }
42    ],
43    "row_count": 20020,
44    "emr_cluster_id": "j-3SXSUSDNPIS",
45    "s3_bucket": "my-segment-datalakes-bucket"
46  }
47  {
48    "type": "event",
49    "workspace_id": "P3IMS7SBDH",
50    "source_id": "9IP56Shn6",
51    "run_id": "1597581273464733073",
52    "start_time": "2020-08-19 22:15:59.044084423",
53    "finish_time": "2020-08-19 22:18:12.891",
54    "duration_mins": 2,
55    "status": "finished",
56    "table_name": "track_product_added",
57    "database": "ios_prod",
58    "partitions": [
59      {
60        "day": "2020-08-16",
61        "hr": "10"
62      }
63    ],
64    "row_count": 20260,
65    "emr_cluster_id": "j-3SXSUSDNPIS",
66    "s3_bucket": "my-segment-datalakes-bucket"
67}

The example below shows the raw JSON object for a failed sync report.

1{
2    "type": "source",
3    "workspace_id": "P3IMS7SBDH",
4    "source_id": "9IP56Shn6",
5    "run_id": "1597867438900010296",
6    "start_time": "2020-08-19 20:04:58.368616813",
7    "finish_time": "2020-08-19 20:49:48.308318686",
8    "duration_mins": 44,
9    "status": "failed",
10    "error": "Data Lakes Destination has invalid configuration for \"AWS Role ARN\": field is required.",
11    "error_code": "Segment.Internal",
12    "table_name": "",
13    "database": "ios_prod",
14    "emr_cluster_id": "j-3SXSUSDNPIS",
15    "s3_bucket": "segment-datalakes-demo-stage"
16}

Querying the Sync Reports table

You can use SQL to query your Sync Reports table to explore and analyze operational sync metrics. A few helpful and commonly used queries are included below.

Return row counts per day for a specific event

1SELECT day,sum(row_count)
2FROM "__segment_datalake"."sync_reports"
3WHERE source_id='9IP56Shn6' and table_name='checkout_started'
4GROUP BY day
5ORDER BY day

Return row counts per day for all events in the source

1SELECT day, table_name,sum(row_count)
2FROM "__segment_datalake"."sync_reports"
3WHERE source_id='9IP56Shn6' AND type='event'
4GROUP BY day, table_name
5ORDER BY day

Find the most recent successful sync

1SELECT max(finish_time)
2FROM "__segment_datalake"."sync_reports"
3WHERE source_id='9IP56Shn6' AND status='finished' AND date(day) = CURRENT_DATE
4LIMIT 1

Find all failures in the last N days

1SELECT run_id, status, error, error_code
2FROM "__segment_datalake"."sync_reports"
3WHERE source_id='9IP56Shn6' AND status='failed' AND date(day) >= (CURRENT_DATE - interval '2' day)

Sync errors

The following error types can cause your data lake syncs to fail:

Insufficient permissions - Segment does not have the permissions necessary to perform a critical operation. You must grant Segment additional permissions.
Invalid settings - The settings are invalid. This could be caused by a missing required field, or a validation check that fails. The invalid setting must be corrected before the sync can succeed.
Internal error - An error occurred in Segment's internal systems. This should resolve on its own. Contact the Segment Support team if the sync failure persists.

Insufficient permissions

If Data Lakes does not have the correct access permissions for S3, Glue, and EMR, your syncs will fail.

If permissions are the problem, you might see one of the following permissions-related error messages:

"Segment was unable to upload staging data to your S3 Bucket due to a lack of sufficient permissions".
"Segment does not have permissions to download object from S3 Bucket".
"Segment does not have permissions to upload object to S3 Bucket".
"Segment does not have permissions to delete S3 objects from S3 Bucket".
"Segment does not have permissions to submit an EMR job to cluster".
"Segment does not have permissions to check the status of EMR Job on EMR Cluster".
"Segment does not have permissions to delete table from Glue Catalog".
"Segment does not have permissions to fetch schema information from Glue catalog".

Check the set up guide to ensure that you set up the required permission configuration for S3, Glue and EMR.

Invalid settings

One or more settings might be incorrectly configured in the Segment app, preventing your Data Lakes syncs from succeeding.

If you have invalid settings, you might see one of the error messages below:

"Data Lakes Destination has invalid configuration."
"The Table Partitions configuration for this Data Lake is invalid. The field name does not appear to map to the data being processed, which likely means it is misconfigured."
"External ID is invalid. Please ensure the external ID in the IAM role used to connect to your Data Lake matches the source ID."
"External ID is not set. Please ensure that the IAM role used to connect to your Data Lake has the source ID in the list of external IDs."

The most common error occurs when you do not list all Source IDs in the External ID section of the IAM role. You can find your Source IDs in the Segment workspace, and you must add each one to the list of External IDs in the IAM policy. You can either update the IAM policy from the AWS Console, or re-run the Data Lakes set up Terraform job.

Internal error

Internal errors occur in Segment's internal systems, and should resolve on their own. If sync failures persist, contact the Segment Support team.

FAQ

How are Data Lakes sync reports different from the sync data for Segment Warehouses?

Both Warehouses and Data Lakes provide similar information about syncs, including the start and finish time, rows synced, and errors.

However, Warehouse sync information is only available in the Segment app: on the Sync History page and Warehouse Health pages. With Data Lakes sync reports, the raw sync information is sent directly to your data lake. This means you can query the raw data and answer your own questions about syncs, and use the data to power alerting and monitoring tools.

What happens if a sync is partly successful?

Sync reports are currently generated only when a sync completes, or when it fails. Partial failure reporting is not currently supported.