In the past, we shared a blog series about BigQuery tips for Universal Analytics. Today, we’re kicking off a new series diving into Google Analytics 4 (GA4) in BigQuery, starting with an overview of users and sessions. We’ll use real examples that will through this series gradually become more complex. Just like with Universal Analytics we will be diving into Google Analytics 4 (GA4) metrics and dimensions and exploring different ways to access them using BigQuery.
The GA4 BigQuery export is a big step forward in making data accessible for everyone as it’s now available for standard and 360 properties (standard properties are limited to an export of a million events daily). GA4 has a brand-new schema and there are many things you should know about before you begin querying the data.
As we write our queries and look at the calculated metrics, you’ll notice that your BigQuery results are not matching the UI numbers exactly. There could be many reasons for that, but remember that GA4 UI often uses estimates.
The reason for estimating metrics in the UI stems from the resources required to calculate the reporting tables. In short, when you want exact metric numbers, but are okay with taking it slightly longer to calculate, use BigQuery1.
The number of users in GA4 is calculated using either the user_pseudo_id field, which is a unique device identifier (clientId in the UA export), or user_id if it’s being set manually in the property tracking.
Therefore, to calculate the number of users on a property on a given day, you can take advantage of the count distinct.
SELECT count(distinct user_pseudo_id) FROM analytics_1234.event_20221201
You could also take advantage of the user_pseudo_id stickiness to create a simple segment of all the users who visited your website in January.
SELECT DISTINCT user_pseudo_id FROM `analytics_1234.events_202201*`
Now you can look at how many of those users also visited each day in September.
WITH userVisitedInJan2022 AS ( SELECT DISTINCT user_pseudo_id FROM `analytics_1234.events_202201*`) SELECT _table_suffix as date, COUNT(DISTINCT user_pseudo_id) as users FROM `analytics_1234.events_202209*` WHERE user_pseudo_id in (SELECT * FROM userVisitedInJan2022) GROUP BY 1 ORDER BY date ASC
As a note, data from Google signals is not exported to BigQuery. To quote the documentation directly, “Google Analytics 4 exports event data associated with anonymous cookies to BigQuery”. This may result in different numbers between the GA4 UI and BigQuery in event counts as well as user counts when Google signals are turned on.2,3
The GA4 UI offers three different user definitions; device-based is the only one that does not rely on Google Signals.
You can observe sessions in the BigQuery export in two different ways. You can either look at the number of sessions that started or at the total number of sessions present during the time period.
A few characteristics of the GA4 sessions1:
- A session starts with a page or a screen view when no session is active (no events from the user in the previous 30 minutes) or when opening an application in the foreground on mobile devices
- A session_start event is sent on every session start
- The start of the session also creates two parameters:
- ga_session_id, a unique identifier of user’s session (not globally unique)
- ga_session_number, a metric marking the number of visits (including the current one) by the same user
- By default, a session times out after 30 minutes of the user not generating new events
- Unlike UA, the new GA4 will not create a new session on traffic source changes
- Unlike UA, the new GA4 will also consider a session that spans over two days as a single session
The following query will calculate the number of sessions starting on a given day using the session_start event:
SELECT COUNTIF(event_name = "session_start") FROM `analytics_1234567.events_20221018`
To look at the actual number of sessions on a given day (sessions that started on the day as well as those that carried over from the previous day), you have to concatenate the user identifier (user_pseudo_id) and event parameter that uniquely describes the current session of the user (ga_session_id).
You’ll notice that the query gets a bit more complex since you’re extracting a ga_session_id value from the repeated field event_params.
SELECT COUNT(DISTINCT concat(user_pseudo_id, " ", (SELECT value.int_value FROM unnest(event_params) WHERE key = "ga_session_id") )) as sessions FROM `analytics_1234567.events_20221018`
It’s possible that the value of user_pseudo_id will come through to BigQuery as null. This means that the user identifier was not sent to Google Analytics. This will impact the second of the two calculations and may even cause the number of total sessions in the selected time period to be smaller than the number of sessions that started during the same time.
Stay tuned for part two, which will cover more detailed session information, such as attribution, time on site, time engaged, and similar. In the meantime, please reach out with any questions.
- [GA4] About Analytics sessions – “BigQuery considerations” section talks about differences between the UI and BigQuery export.
- [GA4] Data thresholds – The “Exporting to BigQuery” section holds information about what happens to data from Google signals in BigQuery.
- [GA4] Activate Google signals for Google Analytics 4 properties – Additional information about using Google Signals vs the anonymous cookie export to BigQuery.