On May 10th Google announced new features that make working with BigQuery even more efficient. This article will look into utilizing PIVOT and QUALIFY with the Google Analytics 4 export data in BigQuery. With Google Analytics 4, the raw level export is now available to free and enterprise users; learn more about it here.
In the Google Analytics 4 export, each row is an event. Counting the number of rows will therefore represent the number of events. In the example below, we are grouping those events by 'user_pseudo_id' (or the Google Analytics client id) and 'event_name' (descriptions of actions in your digital property).
[sourcecode lang="SQL"] SELECT user_pseudo_id, event_name, count(*) events FROM `projectId.analytics_123456789.events_20210512` GROUP BY 1, 2 [/sourcecode]A very common overview of users and their events would have rows split by users and column metrics for each event. This is where the PIVOT operator comes in.
To use the PIVOT operator we will start with a table expression (the query above), followed by the PIVOT keyword, and then provide the:
Above syntax and additional documentation on the PIVOT operator is available in the BigQuery documentation.
[sourcecode lang="SQL" highlight="6-8"] WITH events as (SELECT user_pseudo_id, event_name, count(*) events FROM `projectId.analytics_123456789.events_20210512` GROUP BY 1, 2) SELECT * FROM (SELECT * FROM events) PIVOT (SUM(events) for event_name IN("session_start", "page_view", "blog_view", "scroll", "first_visit")) [/sourcecode]We could also completely avoid aggregating in the first step (and at the same time avoid null values) by simply using count(*) on the unaggregated table.
[sourcecode lang="SQL" highlight="8"] WITH events as (SELECT user_pseudo_id, event_name FROM `projectId.analytics_123456789.events_20210512` ) SELECT * FROM (SELECT * FROM events) PIVOT ( count(*) for event_name IN("session_start", "page_view", "blog_view", "scroll", "first_visit")) [/sourcecode]Newly structured data now provides us with a perfect schema for analysis or even machine learning. With BQML, the output from the query above can be plugged into k-means clustering in a matter of minutes.
Typically a query using analytic functions would require us to use subqueries to filter by its outputs. In the example below, we look at the last event of each Google Analytics session without using the QUALIFY function.
The query outputs events associated to users and their sessions along with a "lastToFirst" attribute that ranks the events within session from last to first.
[sourcecode lang="SQL"] WITH sessionEvents AS ( SELECT user_pseudo_id, (SELECT value.int_value FROM unnest(event_params) WHERE key="ga_session_number") as sessionNumber, event_name, event_timestamp FROM `analyticspros.com:spotted-cinnamon-834.analytics_209734898.events_intraday_20210513`) SELECT *, RANK() OVER (PARTITION BY user_pseudo_id, sessionNumber ORDER BY event_timestamp DESC) as lastToFirst FROM sessionEvents [/sourcecode]Up until now, filtering to only the last event of the session (lastToFirst = 1), would require us to use the above as a subquery input of a filtering "wrapper".
[sourcecode lang="SQL"] SELECT * FROM( SELECT *, RANK() OVER (PARTITION BY user_pseudo_id, sessionNumber ORDER BY event_timestamp DESC) as lastToFirst FROM sessionEvents ) WHERE lastToFirst = 1 [/sourcecode]By using the QUALIFY clause, we can take care of filtering on the same hierarchical level of queries, get rid of a few rows of SQL and make our queries more readable.
[sourcecode lang="SQL"] SELECT *, RANK() OVER (PARTITION BY user_pseudo_id, sessionNumber ORDER BY event_timestamp DESC) as lastToFirst FROM sessionEvents WHERE True QUALIFY lastToFirst = 1 [/sourcecode]*Per the documentation, Qualify can only be used with WHERE, GROUP BY or HAVING clause. That's why I'm "forcing" the WHERE 1=1 before the QUALIFY.
Please, feel free to reach out with any requests, questions or feedback to @lukaslo and @adswerve on Twitter.