Our secure and easy-to-use client portal lets you access hundreds of on-demand training sessions, download our proprietary tools and manage your account and billing information. Not client? Take a look around to see what you can expect when you partner with Adswerve.
SELECT SUM(totals.visits) as Sessions, COUNT(DISTINCT fullVisitorId) as Users, SUM(totals.pageviews) as PageviewsFROM `bigquery-public-data.google_analytics_sample.ga_sessions_201706*`WHERE dateBETWEEN '20170601'AND '20170630'
We get this in BigQuery:And this in Google Analytics:Users is only 1.2% off, but for a perfectionist, it's 100% annoying. You can breathe easy because there is a perfectly rational explanation.
Google Analytics Estimates Users in GA
Google relies heavily on pre-aggregated tables in order to serve you data in a timely, efficient manner. If I'm looking at sessions or pageviews from June 1st – June 30th, Google may serve this data onto standard reports from a pre-aggregated monthly table. But If I change the dates to June 1st – July 1st, GA can simply add the sessions or pageviews from a daily pre-aggregated July 1st table to the sessions and pageviews from the previous June table. However, the users calculation don't work in the same way. It's the count of distinct visitors based on their client ID. One can't simply add users for June and the users for July 1st and get total users. It requires a full recalculation over the full range of dates in order to accommodate for returning visitors.Since calculating unique users is rather resource-intensive, and they don't want to leave their users waiting for reports to load all day, Google uses an improved version of HyperLogLog (HyperLogLog ++), a probabilistic cardinality estimation algorithm. The algorithm can typically approximate distinct elements within 2% error. They use this for many of the primary reports; however you can still get a total accurate user count in unsampled reports.Speed Test HyperLogLog++ in BigQuery You can speed test HLL++ in BigQuery if you like. Try the full calculation first on the Github public data:
You have an approximation with an error of only ~0.5% and in ~55% of the time.
If you're finding a small discrepancy between GA users vs. BQ users, just know that you're probably comparing your perfect BigQuery user calculation to a close, but not exact, approximation in GA. And if you want the warm feeling of reassurance, export an unsampled report in GA and you'll see that the number do in fact match.