Subscribe to our monthly newsletter to get the latest updates in your inbox
So your Google Analytics export to BigQuery is delayed... Why does this happen? According to Google’s documentation, one file representing your analytics data for the previous data will be loaded to BigQuery “ during the morning of the time zone you set for reporting ”. On top of this, Google makes it clear, “ This feature is not governed by a service-level agreement (SLA) ”. So in theory, you may generally expect your data in the early hours of the morning. In practice, this daily ingestion can occasionally be delayed until the late morning or early afternoon, and Google is not required by contract to account for this delay. Why does this matter? When data is delayed it can affect requirements fulfillment downstream. That might mean that data is not available for a report in time for a meeting or a subsequent data processing step executes late. What can be done? In this documentation, we will discuss two approaches Adswerve recommends to limit the effects of this issue: using an event-driven approach to data processing and incorporating logic to leverage intraday tables if your daily export table is unavailable.
First Approach: Event-Driven Processing - A Solution for Data FreshnessWhy event-driven processing? As Google Analytics data doesn’t ingest into BigQuery at a set time, a pipeline that runs on a set schedule will not be efficient in processing your data. As such, we recommend using an event-driven pipeline to detect if the data has become available in BigQuery and process your data accordingly. This will ensure data freshness downstream. Solution Architecture: Adswerve recommends the following architecture to handle data transformations that can be handled by SQL queries. Use cases include clustering, aggregating and normalizing, to name a few. Step-by-Step Architecture:
- Daily and intraday tables are exported by Google Analytics to BigQuery.
- A Stackdriver Log Sink captures logs for data ingests to ‘ga_sessions_...’ and ‘ga_sessions_intraday_...’ BigQuery tables.
- This captured log is sent as a message to a Pub/Sub topic.
- A Cloud Function is subscribed to this topic and triggers on receipt of the ingestion message.
- The function executes BigQuery API calls to execute queries that perform the required data transformations and overwrite or append the resulting data to new tables or existing tables respectively.
Second Approach: Intraday Tables - A Solution for Data AvailabilityWhy are we concerned with data availability? Consider this scenario: Your marketing team meets every day and checks their reports for website traffic for the past seven days. Every so often, they notice that the data for yesterday is not available in these reports. The issue, of course, is the delayed Google Analytics export for yesterday’s sessions. The solution for this lies in your intraday tables. What are intraday (or realtime) tables? Intraday tables represent Google Analytics data for the current day. They are appended to three times a day with data lagging about two hours and are replaced with a new table when the daily sessions table is ingested to BigQuery. Alternatively, realtime tables are appended to approximately every 15 minutes. Both tables allow reporting on the current day’s analytics data. How can we use intraday tables? Because intraday tables are only overwritten in the event of a new daily table ingestion to BigQuery, they will continue to append while also storing the data for yesterday’s sessions. We can solve for yesterday’s missing data by incorporating logic into our data processing and reporting workflow. The logic will execute data processing queries to pull from intraday tables in the event that the daily table is not yet available. The following will explain how Adswerve recommends implementing this logic: But first, a few considerations:
- Intraday or realtime table export must be configured in your Google Analytics Export linking configuration.
- Intraday tables can (seldomly) have minor discrepancies in comparison to the daily session tables. These discrepancies are due to the following:
- Google Analytics allows for hits that are sent up to four hours late, and these hits may not be available in intraday tables at the time data is requested.
- Google performs some processing of data after hits are collected and before data is exported to BigQuery. Generally, the affected fields are traffic sources and linked marketing products (AdWords, Campaign Manager, etc.)
ImplementationYou have two options for implementation: Option One - Incorporate logic into event-driven processing: If you’re using an event-driven approach for data processing to ensure the freshest data (see part one above), it’s possible to incorporate table selection logic into your data pipeline. You can do this by writing conditionals in your Cloud Function code to select the correct table before executing processing queries. The logic of the code will execute as follows:
- Using the BigQuery API, check for the presence of the daily Google Analytics sessions table.
- If the table exists, select your daily table for yesterday’s data in your processing queries.
- If the daily table is not yet available, select your intraday table for yesterday’s data in your processing queries.
- Using BigQuery scripting syntax, we start by declaring two variables, ‘gaSessionsTableExists’ and ‘alreadyRanToday’. These variables will act as flags to determine our table selection.
- We set each variable with (respectively) a boolean value returned by checking for the presence of the daily Google Analytics table and the results of processing for a given day.
- With conditional IF statements, we decide whether the processing query should do one of the following…
- Process our data using the daily ga_sessions_ table.
- Process our data using the intraday table.
- Return since the data has already been processed.