Audience segmentation with Google Analytics 4 (GA4) and BigQuery (Part 1)

November 13, 2023

7 min read

Editor's note (updated May 2026): Since this article was first published in November 2023, Google's Marketing Analytics Jumpstart (MAJ) has continued to evolve. The project now includes a fourth ML pipeline — Value Based Bidding — in addition to the original three (Audience Segmentation, Propensity to Convert, and Predicted Lifetime Value). The repository can now be deployed in approximately 30–45 minutes via Google Colaboratory, and Adswerve's Audience Segmentation contribution remains part of the active codebase. The core challenges and approaches described in this series remain accurate and applicable.

In this two-part series, I walk through the methodology behind audience segmentation with Google Analytics 4 (GA4) BigQuery export data, and the approach we contributed to Google's open-source Marketing Analytics Jumpstart (MAJ) project. Part 1 covers the core segmentation considerations. Part 2 goes deeper on solutions, methodology, and the math behind the model.

Key takeaways

Audience segmentation using Google Analytics 4 (GA4) BigQuery export data surfaces behavioral patterns that standard Google Analytics segments often miss, particularly non-obvious cross-interest groups.
Selecting attributes from a single "domain" (interest-based or engagement-based, not both) produces segments that are easier to explain and more actionable for marketing teams.
K-means clustering requires specifying the number of clusters in advance; because random initialization can produce different results on retraining, operational consistency needs to be built into the pipeline design.
Google's open-source Marketing Analytics Jumpstart (MAJ) provides a Terraform-automated framework for deploying audience segmentation, propensity modeling, and lifetime value pipelines on Google Cloud, with Adswerve's segmentation approach included as a contribution.

Audience segmentation using the Google Analytics 4 (GA4) BigQuery export

Segmentation (or clustering) work often brings out unique challenges that typically don’t arise when working on other types of data science tasks like propensity modeling. While the challenges apply generally to this type of work, we want to frame the exploration of those challenges around Google Analytics 4 (GA4) data.

A likely scenario one could encounter in the real world would be a company that sells sports apparel via its website store. They would like to know what their visitors are interested in. In other words, what groups of interest appear on the site? Some examples could be:

Basketball lovers: Interested in basketball shoes and shorts
Runners: Interested in all things running
T-shirts: Primarily interested in different kinds of t-shirts
Jerseys: Primarily interested in buying jerseys from favorite sports teams

The possibilities are endless, with the most intriguing examples often less evident than direct ones, such as "basketball lovers" and the cross-over interests. Take, for instance, a group passionate about both basketball and running — these combinations are usually more challenging to find.

Given sound and accurate Google Analytics implementation, we can often extract product categories from the page path, providing easy access to the needed data. Below, we discuss the challenges and offer possible solutions to establish a reliable process for this fictitious company to provide better segmentation options within the Google Analytics 4 UI and create new audiences to share through the Google Marketing Platform (GMP) stack.

Audience segmentation considerations

We don’t know how many segments we need or how many will be useful.

With some clustering methods like k-means clustering, we have to specify the number of clusters we want to train with. It’s unknown what the best number is, so this will require some experimentation to get right.

There are other methods like hierarchical clustering where the number of clusters is not a required hyperparameter, but we still have to draw a line somewhere later because, in the end, we still want to work with a given set of clusters.

Once you have the segments, they need meaningful business names.

When presenting our findings, we want to use friendly names for segments like "runners" and "basketball lovers" and not just a set of numbers. Doing this programmatically is tough, so you need a human to review the results and give segment name suggestions. It is worth noting that the path to full automation is within reach with the arrival of large-language models.

Segments need to be explainable.

Similar to the previous section, it needs to be transparent to the end consumer about what each segment means and represents.

The challenge stems from the fact that we can cluster using many attributes and create various clusters. This makes attribute selection very important. Attributes are the ones that will eventually determine how we will explain each cluster.

Collecting every conceivable attribute for analysis may lead to complexity in interpreting what each segment means for the business. In contrast, selecting a concise, focused group of attributes from the same “domain” can yield results that are simpler to understand and potentially more practical."

The classic example would be interest-based segmentation vs. engagement-based segmentation. By interest-based attributes, we mean what the visitors have been viewing on the website — regarding the website groups or product categories they browse.

The engagement-based attributes would be more like time spent on the site, number of page views, and counts of specific events and purchases.

If we were to mix the attributes from both, it would become harder and harder to extract business value. Not to say this isn’t the right approach in some scenarios, but explaining them can get messy as the resulting clusters might not be what you would expect.

In more technical terms, typically, each attribute weighs the same in the eyes of a clustering algorithm like k-means clustering. Thus, when mixing attributes from different domains, you may hope to get clusters like:

Engaged runners: Very active and spent lots of time on the site as well as interested in all things running
Moderate runners: Moderately engaged, otherwise same as Engaged Runners
Casual runners: Perhaps a session or two on the site spent browsing all things running

But this is rarely the case. You may get one of those segments —for example, engaged runners — but not the other two. There are various reasons for that, but generally, the other two might be too mixed with other segments to "survive" on their own.

Does that mean all visitors on the website who are interested in running are engaged, and there is no one interested in running that is more of a casual browser?

The point is that it’s hard to control what comes out of the clustering algorithm, but you can control what attributes go in. Being more specific and intentional about what goes in will give you a more controlled output.

Retraining the model might generate different segments.

There is no guarantee that the clustering algorithm will produce the same results when retraining on fresher data due to random initialization at the start of the training. Or the clusters might be the same but have different indexes, meaning what previously was cluster 0 (perhaps engaged runners) is now cluster 3. They might have the same characteristics, but due to random initialization, the indexes switched. This would confuse the people using the clusters as they would change occasionally.

What's next: Solutions in Part 2

The technique we built into Google's open-source MAJ project allows for a very fast zero to insights path. A lot of times, it's hard for clients to sign up for a big data science project, because they reap the benefit of it much later on. But with the approach presented, we can bridge this gap and give the client a taste of what's to come and even in the early stage of the project they can get something tangible and useful.

These considerations — how many segments to create, how to name them, and how to make them explainable to stakeholders — are the foundation for everything that follows. In Part 2, we get into the specific solutions and the math behind the approach we built into Google's Marketing Analytics Jumpstart.

Explore the open-source code on GitHub

Marketing Analytics Jumpstart (MAJ) is a Terraform-automated, quick-to-deploy marketing solution on Google Cloud that helps customers better understand and use their digital advertising budget. Check out our open-source Audience Segmentation submission on GitHub.

Learn more

Let's work together

Improve Your Data Strategy

Adswerve Connect client technology platform sign-in

Not a client yet?

Audience segmentation with Google Analytics 4 (GA4) and BigQuery (Part 1)

Table of Contents