Google Analytics + BigQuery Tips: Nesting (Part Two)


February 27, 2020

In part one of the Google Analytics + BigQuery Tips series, we covered users and sessions. Up now: Nesting in the Google Analytics (GA) BigQuery export.

One of BigQuery’s key differentiators is its nested data. Instead of a relational table architecture, often BigQuery tables are denormalized and multiple entity types end up in the same table. In the GA export, you will usually first notice this with hits and sessions stored in the same table. Typically, sessions would be expected to have a one-to-many relation to hits.

There are up to three levels of nesting within the GA export pertaining to 11 attributes.

  • Sessions (on the root level each row is a session)
    • Custom Dimensions (these are user and session-scoped custom dimensions only)
    • Hits (every pageview, event or other hit type belonging to the “parent” row i.e. a session)
      • Product (each hit can contain multiple products, either as impressions, part of a transaction, etc.)
        • Custom Dimensions (a product can have multiple custom dimensions)
        • Custom Metrics (a product can have multiple custom metrics; this is the deepest level of nesting)
      • Promotion
      • Experiment
      • Custom Variables
      • Custom Dimensions (hit scoped)
      • Custom Metrics (hit scoped)
      • Publisher Infos

There is a common confusion between records and nesting. Records are similar attributes grouped together and just like other data types,  they can be nested. For example, traffic source data is all under the same record (trafficSource). However, the record itself is not nested since it only appears once per session. On the other hand, custom dimensions are a nested (appears multiple times per session, hit or product) attribute of a type record (includes attributes index and value). Even though all repeated (or nested) fields in the GA export are records, repeated fields could be of any type.

How to Unnest or Access Nested Values

Flatten the Table

As we mentioned earlier, nesting is similar to a one-to-many connection to another table. For this example, we could almost think about hits and sessions as two separate tables that are joined together with the unnest command.

SELECT * FROM `project-id.dataset-id.ga_sessions_20200214`,
unnest(hits) as hits

The query above would transform the following simplified Google Analytics export table, which has one row (with hits nested within that row).

Client Id Visit Number Medium Hits.Number Hits.Page
1234.1234 3 Organic 1 /
2 /blog
3 /blog/Luka
4 /blog/Luka/bq-howto-1
5 /blog/Luka/bq-howto-2

Into

Client Id Visit Number Medium Hits.Number Hits.Page
1234.1234 3 Organic 1 /
1234.1234 3 Organic 2 /blog
1234.1234 3 Organic 3 /blog/Luka
1234.1234 3 Organic 4 /blog/Luka/bq-howto-1
1234.1234 3 Organic 5 /blog/Luka/bq-howto-2

The “COUNT(*)” function for table 1 would return 1 (treating each row as a session), while the same command when using “unnest(hits)” would return 5. Each row is now a hit, and all the parent (session) attributes (client id, visit number, and medium) were inherited to each hit.

Access Individual Rows with a Subquery

Individual nested rows can be accessed as one of the attributes. This is particularly useful when trying to access specific custom dimensions or certain hits like landing or exit pages. Unlike the previous method, this does not change the scope of the table (number of rows stays the same).

SELECT 
    (SELECT value FROM unnest(customDimensions) WHERE index = 5) 
FROM `project-id.dataset-id.ga_sessions_*`
SELECT 
(SELECT page.pagePath FROM unnest(hits) WHERE isEntrance) 
FROM `project-id.dataset-id.ga_sessions_*`
SELECT 
(SELECT page.pagePath FROM unnest(hits) WHERE isExit) 
FROM `project-id.dataset-id.ga_sessions_*`

Make sure that your condition (index=5, isExit, isEntrance) returns a single value within a subquery. If results return multiple values, they need to be aggregated.

Take advantage of indexing

Rows in repeated (nested) fields are ordered. Custom dimensions are ordered by their index ascending, hits by hit number, products by position, etc. This gives us the ability to access any row within a nested field using either offset (0-based) and ordinal (1-based) indexing.

Note that hits[offset(0)] is not the same as isEntrance = True, the difference is that isEntrance is true on the first pageview or screenview of the session. The first hit could also be of a hit type different than pageview or screenview, like, an event.

SELECT 
hits[offset(0)].page.pagePath as firstHitOfSession,
hits[ordinal(1)].page.pagePath as firstHitOfSessionOrdinal
FROM `project-id.dataset-id.ga_sessions_*`

You may encounter issues when trying to access the second or third hit of the session. All sessions have at least one hit, but some may only have a single hit. So accessing a second hit of the session will generate an issue.

SELECT 
hits[offset(2)].page.pagePath as thirdHitOfSession, 
hits[ordinal(3)].page.pagePath as thirdHitOfSessionOrdinal 
FROM `project-id.dataset-id.ga_sessions_*`


To avoid this, use a prefix safe_ which will use a null value in place of the out of bounds results.

SELECT 
hits[safe_offset(2)].page.pagePath as thirdHitOfSession, 
hits[safe_ordinal(3)].page.pagePath as thirdHitOfSessionOrdinal 
FROM `project-id.dataset-id.ga_sessions_*`

UP NEXT: PART THREE

In part three of our Google Analytics + BigQuery Tips series, I will dive into more complex queries such as a calculation of time-on-page. Utilizing unnesting, window functions and other BigQuery features.

Please reach out with any requests, questions or feedback to @lukaslo and @adswerve on Twitter.

Resources

Standard SQL Arrays

Google Analytics BigQuery Export Schema