So your Google Analytics export to BigQuery is delayed…
Why does this happen? According to Google’s documentation, one file representing your analytics data for the previous data will be loaded to BigQuery “during the morning of the time zone you set for reporting”. On top of this, Google makes it clear, “This feature is not governed by a service-level agreement (SLA)”. So in theory, you may generally expect your data in the early hours of the morning. In practice, this daily ingestion can occasionally be delayed until the late morning or early afternoon, and Google is not required by contract to account for this delay.
Why does this matter? When data is delayed it can affect requirements fulfillment downstream. That might mean that data is not available for a report in time for a meeting or a subsequent data processing step executes late.
What can be done? In this documentation, we will discuss two approaches Adswerve recommends to limit the effects of this issue: using an event-driven approach to data processing and incorporating logic to leverage intraday tables if your daily export table is unavailable.
First Approach: Event-Driven Processing – A Solution for Data Freshness
Why event-driven processing? As Google Analytics data doesn’t ingest into BigQuery at a set time, a pipeline that runs on a set schedule will not be efficient in processing your data. As such, we recommend using an event-driven pipeline to detect if the data has become available in BigQuery and process your data accordingly. This will ensure data freshness downstream.
Solution Architecture: Adswerve recommends the following architecture to handle data transformations that can be handled by SQL queries. Use cases include clustering, aggregating and normalizing, to name a few.
- Daily and intraday tables are exported by Google Analytics to BigQuery.
- A Stackdriver Log Sink captures logs for data ingests to ‘ga_sessions_…’ and ‘ga_sessions_intraday_…’ BigQuery tables.
- This captured log is sent as a message to a Pub/Sub topic.
- A Cloud Function is subscribed to this topic and triggers on receipt of the ingestion message.
- The function executes BigQuery API calls to execute queries that perform the required data transformations and overwrite or append the resulting data to new tables or existing tables respectively.
Prerequisites for Implementation: You need a development resource to configure the automated solution. Ideally, this developer would have prior knowledge of Google Cloud, SQL and the chosen programming language to make BigQuery API calls in the Cloud Function. We recommend the developer be granted the Project Owner role to access all the required Google Cloud resources.
Costs: The costs for this solution are negligible because much of the product usage will fall within the free tier for Google Cloud Platform’s pricing model (Stackdriver, Pub/Sub, Cloud Functions). The only cost incurred will likely be dependent on the size of your transformation queries executed via the BigQuery API.
Second Approach: Intraday Tables – A Solution for Data Availability
Why are we concerned with data availability? Consider this scenario: Your marketing team meets every day and checks their reports for website traffic for the past seven days. Every so often, they notice that the data for yesterday is not available in these reports. The issue, of course, is the delayed Google Analytics export for yesterday’s sessions. The solution for this lies in your intraday tables.
What are intraday (or realtime) tables? Intraday tables represent Google Analytics data for the current day. They are appended to three times a day with data lagging about two hours and are replaced with a new table when the daily sessions table is ingested to BigQuery. Alternatively, realtime tables are appended to approximately every 15 minutes. Both tables allow reporting on the current day’s analytics data.
How can we use intraday tables? Because intraday tables are only overwritten in the event of a new daily table ingestion to BigQuery, they will continue to append while also storing the data for yesterday’s sessions. We can solve for yesterday’s missing data by incorporating logic into our data processing and reporting workflow. The logic will execute data processing queries to pull from intraday tables in the event that the daily table is not yet available. The following will explain how Adswerve recommends implementing this logic:
But first, a few considerations:
- Intraday or realtime table export must be configured in your Google Analytics Export linking configuration.
- Intraday tables can (seldomly) have minor discrepancies in comparison to the daily session tables. These discrepancies are due to the following:
- Google Analytics allows for hits that are sent up to four hours late, and these hits may not be available in intraday tables at the time data is requested.
- Google performs some processing of data after hits are collected and before data is exported to BigQuery. Generally, the affected fields are traffic sources and linked marketing products (AdWords, Campaign Manager, etc.)
You have two options for implementation:
Option One – Incorporate logic into event-driven processing: If you’re using an event-driven approach for data processing to ensure the freshest data (see part one above), it’s possible to incorporate table selection logic into your data pipeline. You can do this by writing conditionals in your Cloud Function code to select the correct table before executing processing queries. The logic of the code will execute as follows:
- Using the BigQuery API, check for the presence of the daily Google Analytics sessions table.
- If the table exists, select your daily table for yesterday’s data in your processing queries.
- If the daily table is not yet available, select your intraday table for yesterday’s data in your processing queries.
The following diagram describes the entire process.
Option Two – Incorporate logic directly in queries – BigQuery scripting: In some cases, it may be possible to incorporate this table selection logic directly within your processing queries. This option relies on using BigQuery scripting to procedurally check for the presence of your daily Google Analytics table and select a table accordingly.
The following code snippet and comments describe the BigQuery script used for table selection logic:
A more detailed explanation:
- Using BigQuery scripting syntax, we start by declaring two variables, ‘gaSessionsTableExists’ and ‘alreadyRanToday’. These variables will act as flags to determine our table selection.
- We set each variable with (respectively) a boolean value returned by checking for the presence of the daily Google Analytics table and the results of processing for a given day.
- With conditional IF statements, we decide whether the processing query should do one of the following…
- Process our data using the daily ga_sessions_ table.
- Process our data using the intraday table.
- Return since the data has already been processed.
Finally, using query scheduling we can execute this query on a routine basis, for example, every 15 minutes. This query will automatically handle the delayed data issue we see in the Google Analytics export.
Using either of our recommended solutions for Google Analytics Export lag issues will result in greater data freshness and ensure data availability. If you have a more complex issue or have questions, please feel free to contact us and we’ll help you find a custom solution.