Looking GA4 BigQuery export guide? Be sure to check out our latest blog: GA4 BigQuery Guide: Users and Sessions (Part One)
In this first post of our new blog series, Google Analytics + BigQuery Tips, we will look at how users and sessions are counted in BigQuery and many obstacles you may come across when querying these seemingly simple metrics.
Google Analytics 360 users have been exporting raw unsampled data to BigQuery for over five years and we’ve been working with the export ever since. This post will show you how to start querying that data, provide you with tips and tricks and save you from any mistakes working with the export. Basic SQL knowledge is helpful to understand the examples below.
Before we start, it is important to know that you should be getting a 1-to-1 match with your Google Analytics UI for most of the metrics (users and sampled metrics being the exception). The first thing to check is that your dataset id in BigQuery matches the id of the view in Google Analytics. Filters applied to the view will apply to the hits that get exported to BigQuery.
All the data from Google Analytics is aggregated in daily tables, which include nested fields for hits, custom dimensions, products, etc. On the highest level, each row of the table represents a session, or does it?
What is happening? The number of rows does not match the number of sessions.
Google Analytics defines a session as a set of hits with at least one pageview or interactive event (without any session breaking conditions between those hits). It is possible to have a visit with a non-interactive event that will create a new row in the BQ table, but will not count as a session in Google Analytics (interestingly, it will count as a user).
[sourcecode language="sql" title="Number of Sessions"]SELECT sum(totals.visits) FROM `project-id.dataset-id.ga_sessions_20190101`[/sourcecode]
[sourcecode language="sql" title="Number of Sessions & Non-interactive Visits"]SELECT count(*) FROM `project-id.dataset-id.ga_sessions_20190101`[/sourcecode]
ClientID vs fullVisitorId
The number of users can be calculated by counting a distinct number of client ids or fullVisitorIds. FullVisitorId is a hashed version of a client id (which matches the _ga cookie on your digital property). Client id is not available for exports from before June of 2018.
[sourcecode language="sql" title="Number of Users"]SELECT count(distinct clientId) FROM `project-id.dataset-id.ga_sessions_20190101`[/sourcecode]
The user metric is an exception when it comes to matching the number to Google Analytics UI. It is possible that the user metric in your Google Analytics property is approximated and will not match the actual number produced by your query. Read more about this.
Since Google Analytics data is sharded into daily tables, it is important to know how to query across a date range. BigQuery offers a quick solution for this without needing to manually write union queries (utilizing table suffix).
[sourcecode language="sql" title="Querying across multiple days"]SELECT sum(totals.visits) FROM `project-id.dataset-id.ga_sessions_*` WHERE _table_suffix BETWEEN "20191001" AND "20191031" [/sourcecode]
What is happening? Midnight Break.
When counting and dealing with sessions in general, pay close attention to how those behave at midnight. By default, Google Analytics breaks sessions that carryover midnight (time zone of the view). The same session will end up in two daily tables. The screenshot below shows you what happens to a session that gets split between two daily tables:
Google Analytics breaks the session and counts them as two individual visits.
When creating a session id:
And finally, per previous comment, to match sessions in the Google Analytics UI, make sure to filter to only those with totals.visits = 1.
[sourcecode language="sql" title="Using 'session id' to count sessions across multiple days."]SELECT count(distinct sessionId) FROM( SELECT CONCAT(clientId, "-", visitNumber, "-", date) as sessionId FROM `project-id.dataset-id.ga_sessions_*` WHERE _table_suffix BETWEEN "20191001" AND "20191031" AND totals.visits = 1) [/sourcecode]
Using a session id in our queries will make a lot more sense once we start working with data in scopes other than session.
What is happening? Session Count Reset.
One of the most surprising discoveries working with the export has been finding out that a single user can be a new user more than once. I've wasted many hours trying to debug my queries and logic trying to find why this would be happening. The expected logic is that the visit number or session count gets increased by one on every new visit for the same user. However, a little known fact is that if two visits from the same user happen more than 183 days apart, the session count resets and the user is considered new again.
[sourcecode language="sql"]SELECT * FROM( SELECT fullVisitorId, date as visitDate, lag(date, 1) OVER(PARTITION BY fullVisitorId ORDER BY visitStartTime ASC) as previousVisitDate, totals.newVisits, visitNumber, lag(visitNumber, 1) OVER(PARTITION BY fullVisitorId ORDER BY visitStartTime ASC) as previousVisitNumber FROM `project-id.dataset-id.ga_sessions_2019*`) WHERE previousVisitDate is not null and newVisits = 1[/sourcecode]
As you can see in the image below, sessions of users who have not visited the website for 183 days are resetting back to 1. The first-row user has a visit on September 3rd, which is 204 days since their previous visit on February 2nd. In this case, the session count resets to one instead of increasing to three.
There is a lot of unexpected behavior when it comes to measuring users and sessions in BigQuery. To stay safe always pay attention to the date range selected, scope that you are in, non-interactive events, discrepancies with the Google Analytics UI and most importantly, use common sense to assess the numbers.
In future posts, I will dive into nesting in the Google Analytics BigQuery export, scoping (unnesting, custom dimensions & metrics), e-commerce data, window functions (time on page), handling realtime export in standard SQL, more advanced segment creation, recreating attribution reports, etc.
Please reach out with any questions or feedback to @lukaslo and @adswerveinc on Twitter.