Looking for GA4 BigQuery export guide? Be sure to check out our latest blog: GA4 BigQuery Guide: Users and Sessions (Part One)
In part one of the Google Analytics + BigQuery Tips series, we covered users and sessions. Up now: Nesting in the Google Analytics (GA) BigQuery export.
One of BigQuery’s key differentiators is its nested data. Instead of a relational table architecture, often BigQuery tables are denormalized and multiple entity types end up in the same table. In the GA export, you will usually first notice this with hits and sessions stored in the same table. Typically, sessions would be expected to have a one-to-many relation to hits.
There are up to three levels of nesting within the GA export pertaining to 11 attributes.
There is a common confusion between records and nesting. Records are similar attributes grouped together and just like other data types, they can be nested. For example, traffic source data is all under the same record (trafficSource). However, the record itself is not nested since it only appears once per session. On the other hand, custom dimensions are a nested (appears multiple times per session, hit or product) attribute of a type record (includes attributes index and value). Even though all repeated (or nested) fields in the GA export are records, repeated fields could be of any type.
As we mentioned earlier, nesting is similar to a one-to-many connection to another table. For this example, we could almost think about hits and sessions as two separate tables that are joined together with the unnest command.
[sourcecode language="sql" title="Flattening a Table"]SELECT * FROM `project-id.dataset-id.ga_sessions_20200214`, unnest(hits) as hits[/sourcecode]
The query above would transform the following simplified Google Analytics export table, which has one row (with hits nested within that row).
Client Id | Visit Number | Medium | Hits.Number | Hits.Page |
---|---|---|---|---|
1234.1234 | 3 | Organic | 1 | / |
2 | /blog | |||
3 | /blog/Luka | |||
4 | /blog/Luka/bq-howto-1 | |||
5 | /blog/Luka/bq-howto-2 |
Into
Client Id | Visit Number | Medium | Hits.Number | Hits.Page |
---|---|---|---|---|
1234.1234 | 3 | Organic | 1 | / |
1234.1234 | 3 | Organic | 2 | /blog |
1234.1234 | 3 | Organic | 3 | /blog/Luka |
1234.1234 | 3 | Organic | 4 | /blog/Luka/bq-howto-1 |
1234.1234 | 3 | Organic | 5 | /blog/Luka/bq-howto-2 |
The "COUNT(*)" function for table 1 would return 1 (treating each row as a session), while the same command when using "unnest(hits)" would return 5. Each row is now a hit, and all the parent (session) attributes (client id, visit number, and medium) were inherited to each hit.
Individual nested rows can be accessed as one of the attributes. This is particularly useful when trying to access specific custom dimensions or certain hits like landing or exit pages. Unlike the previous method, this does not change the scope of the table (number of rows stays the same).
[sourcecode language="sql" title="Accessing Custom Dimension 5"]SELECT (SELECT value FROM unnest(customDimensions) WHERE index = 5) FROM `project-id.dataset-id.ga_sessions_*`[/sourcecode]
[sourcecode language="sql" title="Retrieveing Session's Landing Page"]SELECT (SELECT page.pagePath FROM unnest(hits) WHERE isEntrance) FROM `project-id.dataset-id.ga_sessions_*`[/sourcecode]
[sourcecode language="sql" title="Retrieveing Session's Exit Page"]SELECT (SELECT page.pagePath FROM unnest(hits) WHERE isExit) FROM `project-id.dataset-id.ga_sessions_*`[/sourcecode]
Make sure that your condition (index=5, isExit, isEntrance) returns a single value within a subquery. If results return multiple values, they need to be aggregated.
Rows in repeated (nested) fields are ordered. Custom dimensions are ordered by their index ascending, hits by hit number, products by position, etc. This gives us the ability to access any row within a nested field using either offset (0-based) and ordinal (1-based) indexing.
Note that hits[offset(0)] is not the same as isEntrance = True, the difference is that isEntrance is true on the first pageview or screenview of the session. The first hit could also be of a hit type different than pageview or screenview, like, an event.
[sourcecode language="sql" title="First hit of a Session"]SELECT hits[offset(0)].page.pagePath as firstHitOfSession, hits[ordinal(1)].page.pagePath as firstHitOfSessionOrdinal FROM `project-id.dataset-id.ga_sessions_*`[/sourcecode]
You may encounter issues when trying to access the second or third hit of the session. All sessions have at least one hit, but some may only have a single hit. So accessing a second hit of the session will generate an issue.
[sourcecode language="sql" title="Third hit of a Session"]SELECT hits[offset(2)].page.pagePath as thirdHitOfSession, hits[ordinal(3)].page.pagePath as thirdHitOfSessionOrdinal FROM `project-id.dataset-id.ga_sessions_*`[/sourcecode]
To avoid this, use a prefix safe_ which will use a null value in place of the out of bounds results.
[sourcecode language="sql" title="Third hit of a Session (safe)"]SELECT hits[safe_offset(2)].page.pagePath as thirdHitOfSession, hits[safe_ordinal(3)].page.pagePath as thirdHitOfSessionOrdinal FROM `project-id.dataset-id.ga_sessions_*`[/sourcecode]
In part three of our Google Analytics + BigQuery Tips series, I will dive into more complex queries such as a calculation of time-on-page. Utilizing unnesting, window functions and other BigQuery features.
Please reach out with any requests, questions or feedback to @lukaslo and @adswerve on Twitter.