Technical Insights

Universal Analytics to GA4 - BigQuery Export Guide

Written by Luka Cempre | Oct 6, 2022 6:00:00 AM

This is a live article, new definitions will be added and existing definitions may get updated.

High-Level Differences

GA4 export offers a considerably changed schema to work with your Google Analytics data, however, the export is still available in the familiar and ever more powerful BigQuery. Here are some of the main differences between the two exports:

  Universal Analytics (UA) Export GA4 Export
Cost Only available with a GA360 license.* Available with every GA4 property (free or paid).*
Location BQ Dataset:
- Numeric property id
BQ Table:
- ga_sessions_YYYYMMDD (daily shards)
- ga_intraday_YYYYMMDD (intraday)
- ga_realtime_sess
BQ Dataset:
- Property id prefixed with "analytics_"
BQ Tables:
- events_YYYYMMDD (daily shards)
- events_intraday_YYYYMMDD (streaming)
Row Scope - Every row is a session, calculations on the original nested table yield session-scoped results
- To access hits, rows need to be flattened first (unnested)
- Every row is an event/hit
- Data includes custom events as well as automatically collected events (first_visit)
- Some of the events used to be present as column "flags" (first_visit event in GA4 vs totals.newVisits column in UA)
Native Dimensions Over 200 columns of data are available in the export. The export includes almost all the dimensions and metrics available in the UI and even offers a few additional ones. A considerably smaller set of columns is available to be queried directly. Some of the fields are now accessible via a nested event parameters field or as their own events (session_start, first_visit)
Custom Dim. and Metrics vs Parameters Custom dimensions, need to be flattened and are accessed depending on the scope (user, session, hit, product). The two fields in custom dimensions are index and value. Values are always of a string type.

Custom metrics are accessed similarly to custom dimensions, but are only available on a hit level. The data type is always an integer.
Custom fields in GA4 export are available as parameters.

event_params - A repeated field with a string key (same name as in GA) and a value that can be either (a string, an integer, or a float)

user_properties - A repeated field with a similar structure as event_params and a value.set_timestamp_micros field. To signify the time when the property was set.
Current Day Data Current day data is available as either an intraday export (every few hours) or as a streaming export (about a 5-10min delay with duplicated events that require an additional view on top to deduplicate the export) Streaming is available. Hits become available within a few seconds. Name, source, medium are not available and user attribution data may not be fully available yet.
Cookieless Data Cookieless hits will be properly handled by anonymizing users when using the analytics storage consent.
See more
Cookieless hits will be completely removed when analytics storage consent is denied.
See more

*streaming incurs additional cost ($0.01 per 200MB streamed) in GCP, as a rough estimate you can consider an event as 1kB. Storage and processing may also incur additional costs.

Field Translation

UA Field GA4 Field
clientId, fullVisitorId user_pseudo_id
visitNumber, visitId, visitStartTime now available as integer event parameters with the key "ga_session_number" and "ga_session_id"
date event_date
totals.* since all the "totals" metrics were connected to the session, these fields now need to be extracted or calculated
trafficSource.* Traffic source data in GA4 export is a dangerous area. The three fields readily available as traffic_source.name, traffic_source.medium, traffic_source.source refer to user's first (acquisiton) session.

The traffic source values of the current session can be found in the event parameters.
device.* device.* (most but not all of the dimensions are still available)
geoNetwork.* geo.* (latitude and longitude of the city are no longer available)
hits.hour, hits.minute, hits.time Can now be extracted from event_timestamp.
hits.eCommerceAction Now available as recommended events as well as under additional fields under ecommerce.*.
hits.dataSource Additional stream and platform information is available.
hits.isEntrance, hits.isExit Can be extracted from the event sequence of the session, filtering by (page_view and screen_view events).
hits.product.* Product data is now found under items.*.
hits.latencyTracking Latency tracking is not natively available anymore.
hits.page Page dimensions are now available as event parameters on the event, using keys such as page_location, page_title, page_referrer... Hostname is available as device.web_info.hostname.
hits.appInfo.* app_info.*
privacy_info.* privacy_info.*
N/A user_ltv, user_first_touch_timestamp
Dimensions and metrics not supported by GA4 (or deprecated since classic GA) N/A

The new data set has a bigger focus on user acquisition. The traffic source fields now signify the campaign, source, and medium of the user's very first session. We also have the user's first touch timestamp and lifetime value available as a field. These features used to be harder to calculate and required seeking through the user's history.

Sample tables are available in the BigQuery public dataset:

  • UA: bigquery-public-data.google_analytics_sample.ga_sessions_*
  • GA4: bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*

Basic Metrics and Dimensions

Let's look through some basic queries that you may have used in the universal analytics export and how they translate under the new schema.

Events, Sessions and Users

With tables being event-scoped, the number of rows now directly corresponds to the number of events, therefore calculating all the events is as easy as counting the rows, and calculating the number of page views can be done by filtering these rows on event_name = "page_view".

As in the old export, sessions can be calculated using more than one approach. Since all sessions begin with a "session_start" event, we can take advantage of the event to calculate how many sessions started during the observed date range.

The number of users is calculated as it was in the past by looking at distinct ids (per the mapper above going from clientId to user_pseudo_id).

SELECT 
  count(*) as events,
  countif(event_name = "page_view") as pageViews,
  countif(event_name = "session_start") as sessions,
  count(distinct user_pseudo_id) as users
 FROM `analytics_12345678.events_20221002`

Note that GA4 uses estimates for many metrics in the UI. Therefore, you will likely not be able to get an exact match between your BQ data and the GA4 UI.

Sessions Advanced

In the Universal Analytics export, the number of sessions metric is available via a column (totals.visits). Which is either a 1 or a null and thus allows us to easily calculate the number of sessions by summing the values in that field (when working in a session-scoped table). If our UA table has been untested, we can take advantage of a combination of dimensions such as clientId and visitNumber to create unique ids for each session and calculate the number of sessions using a count distinct.

We can use the same approach with GA4. The user identifier (clientId in UA) is now available as a user_pseudo_id and session identifier is nested within the event parameters under the key ga_session_id.

SELECT 
    COUNT(DISTINCT 
        CONCAT(user_pseudo_id, "-", 
            (SELECT value.int_value FROM unnest(event_params) WHERE key="ga_session_id")
        )
    )  
FROM  
    analytics_12345678.events_20220101

Note that the two approaches for calculating the number of sessions will likely return different results. Since one is looking at sessions starting during the time range selected and the other one at total sessions during the time range. Therefore, if a session started before the observed time range and continues into it, it will be counted using the second, but not the first method.

Engagement Rate and Bounce Rate

The definition of bounce rate has changed from UA to GA4. From a percentage of single-page sessions without interactive events to the percentage of sessions that were not engaged sessions. Because user_engagement event can happen multiple times in a session, we calculate the engagement rate by a distinct count of session ids with the engagement event divided by all sessions.

To calculate the bounce rate we look at the opposite audience:
Bounce Rate = 1 - Engagement Rate

WITH sessionsEngaged AS (
  SELECT  
    count(distinct session_id) as sessions,
    count(DISTINCT if( session_engaged>0, session_id, null)) as engaged_sessions 
  FROM 
    (SELECT 
        (SELECT value.int_value FROM unnest(event_params) WHERE key="session_engaged")  as session_engaged ,    
        CONCAT(user_pseudo_id, "-", 
              (SELECT value.int_value FROM unnest(event_params) WHERE key="ga_session_id"))  as session_id 
          FROM `analytics_12345678.events_20221002`) 
)

SELECT  round(100*(engaged_sessions/sessions), 2) engagement rate, round(100*(1-engaged_sessions/sessions), 2) bounce_rate FROM sessionsEngaged

GA4 export also has a parameter called "engagement_time_msec" which will provide you with the exact time a user was engaged.

Total engagement time in msec:

 SUM((SELECT value.int_value FROM unnest(event_params) WHERE key="engagement_time_msec"))

Advanced Metrics and Dimensions

To get a better sense of these differences let's also build some slightly more complex queries.

User Entry and Exit

Landing page reports focus on the session's first page and the interactions on and following that page. In the example below we look at the number of users, sessions, and views per landing page along with the avg. a number of total events by users who landed on those pages.

A landing page is defined with the event_parameter key entrances. The value of the parameter on a landing page is 1. Note that the entrance page (first pageview) does not necessarily correspond to the first-page location of the session. Therefore using session_start instead of event parameter entrances will yield different results.

SELECT 
  (SELECT value.string_value FROM unnest(event_params) WHERE key = "page_location") as page_location,
  count(*) as sessions,
  count(distinct user_pseudo_id) as users
FROM `analyticspros.com:spotted-cinnamon-834.analytics_206551716.events_20221002`
WHERE  (SELECT value.int_value FROM unnest(event_params) WHERE key="entrances")=1
GROUP BY 1
ORDER BY 2 DESC

User Stickiness (DAU/MAU)

User stickiness is a new concept in Google Analytics 4 and it provides the ratio between active users over a narrower time frame the engagement over a broader time frame*. An example would be to compare the ratio of users active in the past day (DAU) with those active in the past thirty days (MAU).

To calculate the number of users we look at the distinct count of the user_pseudo_id. There are many ways to approach time ranges. To keep the example query simple, we select a 30-day window that is used for Monthly Active Users and then filter down the final week and day in the date range to generate Weekly and Daily Active Users.

SELECT ROUND(100*DAU/MAU, 2) DAU_MAU, ROUND(100*WAU/MAU, 2) WAU_MAU FROM (
    SELECT 
      count(distinct IF(event_date = "20221002", user_pseudo_id, null)) DAU,
      count(distinct IF(event_date BETWEEN "20220926" AND "20221002", user_pseudo_id, null)) WAU,
      count(distinct IF(event_date BETWEEN "20220904" AND "20221002", user_pseudo_id, null)) MAU,
    FROM `analytics_12345678.events_*`
    WHERE _table_suffix BETWEEN "20220903" AND "20221002"
)

GA4 to UA High-Level Metric Comparison

When transitioning from UA to GA4 tracking, it is recommended to run both implementations in parallel to learn about the differences in metric totals that you should expect. The query below joins similar metrics from GA4 and UA export and allows you to compare the two tracking methods.

WITH GA4_stats AS (
  SELECT 
    event_date as date,
    countif(event_name = "page_view") as GA4_pageviews,
    countif(event_name = "session_start") GA4_session_starts,
    COUNT(DISTINCT 
        CONCAT(user_pseudo_id, "-", 
            (SELECT value.int_value FROM unnest(event_params) WHERE key="ga_session_id")
        )
    )   as GA4_sessions,
    count(distinct user_pseudo_id) as GA4_users

  FROM
    `analytics_12345678.events_*`
  WHERE _table_suffix BETWEEN "20220901" AND "20221005"
    GROUP BY 1
),

UA_stats as (SELECT 
    date,
    sum(totals.pageviews) as UA_pageviews,
    sum(totals.visits) as UA_sessions,
    count(distinct clientId) as UA_users
   FROM `view_id.ga_sessions_*` WHERE _table_suffix BETWEEN "20220901" AND "20221005"
   GROUP BY 1)

SELECT * FROM 
  GA4_stats
  INNER JOIN
  UA_stats
  USING(date)

Additionally, you can take the results of the join to calculate the coefficient and standard deviation that will help you translate between UA and GA4.

SELECT 
  round(avg(GA4_users/UA_users), 3) UA2GA4_users_coeff, round(stddev(GA4_users/UA_users), 3) UA2GA4_users_stddev, 
  round(avg(GA4_sessions/UA_sessions), 3) UA2GA4_sessions_coeff, round(stddev(GA4_sessions/UA_sessions), 3) UA2GA4_sessions_stddev, 
  round(avg(GA4_pageviews/UA_pageviews ), 3) UA2GA4_pageviews_coeff, round(stddev(GA4_pageviews/UA_pageviews ), 3) UA2GA4_pageviews_stddev,  
 
  FROM 
  GA4_stats
  INNER JOIN
  UA_stats
  USING(date)

GA4 BigQuqery Guide

  1. GA4 BQ Guide - Users and Sessions

Resources

[GA4] Bounce rate vs engagement rate

[GA4] Engagement Reports

[GA4] Automatically collected events