In this two-part series, we'll explore the challenges and solutions we used in our open-source submission to Google's Marketing Analytics Jumpstart (MAJ). MAJ has three modeling pipelines: Audience Segmentation, Propensity to Convert, and Predicted Lifetime Value. We plan to submit contributions to all three pipelines, but we started with our novel approach to Audience Segmentation. Please share your thoughts with us on LinkedIn to keep the conversion going!
Segmentation (or clustering) work often brings out unique challenges that typically don’t arise when working on other types of data science tasks like propensity modeling. While the challenges apply generally to this type of work, we want to frame the exploration of those challenges around Google Analytics 4 (GA4) data.
A likely scenario one could encounter in the real world would be a company that sells sports apparel via its website store. They would like to know what their visitors are interested in. In other words, what groups of interest appear on the site? Some examples could be:
The possibilities are endless, with the most intriguing examples often less evident than direct ones, such as "Basketball Lovers" and the cross-over interests. Take, for instance, a group passionate about both basketball and running—these combinations are usually more challenging to find.
Given sound and accurate GA4 implementation, we can often extract product categories from the page path, providing easy access to the needed data. Below, we discuss the challenges and offer possible solutions to establish a reliable process for this fictitious company to provide better segmentation options within the GA4 UI and create new audiences to share through the Google Marketing Platform (GMP) stack.
With some clustering methods like k-means clustering, we have to specify the number of clusters we want to train with. It’s unknown what the best number is, so this will require some experimentation to get right.
There are other methods like hierarchical clustering where the number of clusters is not a required hyperparameter, but we still have to draw a line somewhere later because, in the end, we still want to work with a given set of clusters.
When presenting our findings, we want to use friendly names for segments like Runners, Basketball Lovers, etc., and not just a set of numbers. Doing this programmatically is tough, so you need a human to review the results and give segment name suggestions. It is worth noting that the path to full automation is within reach with the arrival of large-language models.
Similar to the previous section, it needs to be transparent to the end consumer about what each segment means and represents.
The challenge stems from the fact that we can cluster using many attributes and create various clusters. This makes attribute selection very important. Attributes are the ones that will eventually determine how we will explain each cluster.
Collecting every conceivable attribute for analysis may lead to complexity in interpreting what each segment means for the business. In contrast, selecting a concise, focused group of attributes from the same “domain” can yield results that are simpler to understand and potentially more practical."
The classic example would be interest-based segmentation vs. engagement-based segmentation. By interest-based attributes, we mean what the visitors have been viewing on the website – regarding the website groups or product categories they browse.
The engagement-based attributes would be more like time spent on the site, number of page views, and counts of specific events and purchases.
If we were to mix the attributes from both, it would become harder and harder to extract business value. Not to say this isn’t the right approach in some scenarios, but explaining them can get messy as the resulting clusters might not be what you would expect.
In more technical terms, typically, each attribute weighs the same in the eyes of a clustering algorithm like k-means clustering. Thus, when mixing attributes from different domains, you may hope to get clusters like:
But this is rarely the case. You may get one of those segments —for example, Engaged Runners — but not the other two. There are various reasons for that, but generally, the other two might be too mixed with other segments to ‘survive’ on their own.
Does that mean all visitors on the website who are interested in running are engaged, and there is no one interested in running that is more of a casual browser?
The point is that it’s hard to control what comes out of the clustering algorithm, but you can control what attributes go in. Being more specific and intentional about what goes in will give you a more controlled output.
There is no guarantee that the clustering algorithm will produce the same results when retraining on fresher data due to random initialization at the start of the training. Or the clusters might be the same but have different indexes, meaning what previously was cluster 0 (perhaps Engaged Runners) is now cluster 3. They might have the same characteristics, but due to random initialization, the indexes switched. This would confuse the people using the clusters as they would change occasionally.
In conclusion, we have discussed the challenges of audience segmentation using Google Analytics 4 (GA4) data. These include determining the optimal number of segments, naming the segments meaningfully, and ensuring that the segments are explainable. In our next article, we'll provide possible solutions to these challenges and go into much more detail about the math and the approach we used in open source submission to Google's Marketing Analytics Jumpstart on GitHub.