Subscribe to our monthly newsletter to get the latest updates in your inbox
Ready to Flex Your BigQuery Muscles?
Welcome to Adswerve's 31 Days of BigQuery, where each day in May brings a new mini challenge to sharpen your skills, stretch your brain, and maybe even spark a little friendly competition. Whether you’re a BigQuery beginner or a seasoned SQL wizard, there’s something here for you.
These challenges were thoughtfully crafted (and solved!) by Adswerve’s Technical Services team—the behind-the-scenes problem solvers who build, architect and scale data solutions for our clients every day. Now, we're sharing that expertise with you.
Each day in May, we will post a new 10-minute BigQuery challenge for you to solve, with the solution provided the following day. The challenges will cover a wide variety of BigQuery features and will become more difficult as the month progresses. Feel free to reach out via our social channels with any questions.
As with many data problems, most of the challenges may have more than one solution; we would be excited to hear if your solution differs from ours.
Day 1 - Create your own BigQuery Instance
Challenge:
For this first challenge, set up your own Google Cloud project, enable BigQuery, and query the public dataset. BigQuery is available in a free "sandbox" environment, limiting you to 10GB of storage and 1TB of monthly queries.
When ready, use standard SQL in the BigQuery studio to retrieve the top 5 most popular names in the US during the years 2010 and 2019 using the publicly available table `bigquery-public-data.usa_names.usa_1910_current`.
Tip
BigQuery comes with a large amount of publicly available data. I encourage you to explore these datasets; many have come in handy in the past.
Resources
https://cloud.google.com/bigquery/public-data
https://cloud.google.com/bigquery/docs/sandbox
https://console.cloud.google.com/marketplace/product/social-security-administration/us-names
https://www.w3schools.com/sql/
Solution
The 5 most popular names in the US during the period of 2010 to 2019 are Emma, Olivia, Noah, Sophia and Liam.
A possible SQL approach:
#1 Challenge
SELECT name, sum(number) as number
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year BETWEEN 2010 and 2019
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5
Day 2 - Quickly Chart Query Outputs
On day two, we will explore an underutilized (personal opinion) feature hidden in the query output. While BigQuery allows you to open your query results in Looker Studio, Python notebooks, and data canvas, sometimes the quickest way to visualize is only a couple of clicks away in the "Chart" tab.
Challenge
In this challenge, build a chart using the public dataset (`bigquery-public-data.usa_names.usa_1910_current` ) from the first challenge that visualizes the popularity of your name and a few other names throughout the years. See the screenshot below as an example of the expected output.
Tip
You will notice that while the Chart function is easy to use, it comes with limitations, such as only allowing a single dimension, but allowing up to 5 measures.
Solution
Because the built-in charting solution only allows a single dimension and multiple measures, I added each of the observed names as a sum(if metric. Make sure to order by year for the visualization to be rendered as expected.
#2 Challenge
SELECT
year,
sum(if(name = "Luka", number, 0)) as Luka ,
sum(if(name = "Peter", number, 0)) as Peter,
sum(if(name = "Luca", number, 0)) as Luca ,
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY ALL
ORDER BY year ASC
Day 3 - Query Costs
Understanding query costs is crucial for managing your overall #BigQuery expenses. While the primary BQ charges are for storage and analysis (querying), advanced features like streaming inserts, BI Engine, Storage API, Notebook Runtimes, and BigQuery reservations can also contribute to costs as your usage evolves.
Challenge
Estimate the cost of the queries you wrote on days 1 and 2. In the on-demand pricing mode (most common), the cost is determined by the amount of data scanned for a query. Queries written in the UI (if valid) provide an estimate of how much data will be scanned in the query validator below the input window (see the screenshot).
Queries are charged at $6.25 per TiB or $0.00625 per GiB scanned.
Explore more
See how the price of queries changes with more or fewer columns selected, see what happens when using filters and limits. Think about how BigQuery’s columnar storage impacts the price of these queries.
Resources:
https://cloud.google.com/bigquery/docs/storage_overview
https://cloud.google.com/bigquery/pricing
Solution
The queries used in the previous challenges are both 143.71MB in size, which translates to 143.71MB => 0.00014371TB
0.00014371TB x $6.25/TB = $0.0008981875
Adding more columns increases the size of the queries, while using limit and filters does not impact the cost of the query due to BigQuery scanning each column fully.
Day 4 - Table Date Sharding
BigQuery supports table sharding as a way to divide your data, often seen with daily exports like Google Analytics 4 (events_YYYYMMDD). While you can query these daily tables individually, BigQuery lets you query them collectively using a wildcard (*) combined with the _TABLE_SUFFIX pseudo-column to filter specific date ranges.
Challenge
Using the public Google Analytics 4 dataset sample available at:
bigquery-public-data.ga4_obfuscated_sample_ecommerce run a query that provides the number of sessions (session_start events) between December 1 and December 31 2020.
Explore more
Keep in mind that while sharding is common in some exports, partitioning is generally the recommended approach in BigQuery for benefits like performance and easier management. We'll explore partitioning soon!
Resources
https://lnkd.in/d6tc5p8G
https://lnkd.in/daftABea
Solution
#4 Challenge
SELECT
COUNTIF(event_name = "session_start") as sessions
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
WHERE
_table_suffix BETWEEN "20201201" AND "20211231"
Day 5 - Repeated Fields
The use of repeated fields in BigQuery is one of the biggest differences when comparing BQ to traditional (relational) SQL databases such as MySQL, PostgreSQL, etc. The array field in BigQuery not only introduces a new way to store data, but also challenges the need to normalize tables.
For example, a table of customers with their phone numbers, where each customer can have multiple phone numbers, would require two separate tables in a traditional database (per first normal form), whereas in BigQuery, customers' phone numbers can be stored in an Array (repeated) field for each customer.
In the previous challenge, we explored the Google Analytics 4 sharded tables. The events table stores event parameters as an Array (repeated field) of structs (container of multiple fields). In practice, each Google Analytics event has multiple event parameters. An array of structs is almost like a table within each row (event) and can be queried as a parameter.
Challenge
Complete the following query to produce an output of pageviews per page location for January 31, 2021 of the Google Analytics 4 sample table.
#4 Challenge
SELECT
(SELECT _________ FROM unnest(__________) WHERE key = "page_location") as page_location,
countif(event_name = “page_view”) as page_views
FROM
bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131
GROUP BY page_location
Explore more
The event table has multiple repeated fields. Can you find them all? Why was an array used to represent those fields?
Resources
Work with arrays - https://lnkd.in/d9EWUi6K
Google Analytics 4 export schema - https://lnkd.in/d_r7dQGw
Solution
We were looking for value.string_value and event_params to fill the blanks. In the Google Analytics export's event parameters, values can come as different data types, that's why inside the value struct we specifically reference string_value.
SELECT
(SELECT value.string_value FROM unnest(event_params) WHERE key = "page_location") as page_location,
countif(event_name = “page_view”) as page_views
FROM
bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131
GROUP BY page_location
Day 6 - Window Functions & CTEs
On day 5 we looked at how repeated fields (Arrays) work in BigQuery. Today, let's tackle Window Functions and Common Table Expressions (CTEs).
- Window Functions: perform calculations across a set of rows that are somehow related. Unlike aggregate functions such as SUM, COUNT, AVG, etc. they do not collapse the rows. Instead, they return a value for each row based on a “window” they are in. Today, we’ll use them to create a new dimension for the Google Analytics 4 sample that will number events in a session.
- CTEs or Common Table Expressions defined using the WITH clause allow you to create a temporary, named query that can be referenced within the same query as any other table. This is incredibly useful to break down complex queries into more manageable and readable parts.
Challenge
Use a CTE to define a base query. The base query should include event_timestamp, event_name, geo.region, device.web_info.browser, user_pseudo_id (unique user identifier) and ga_session_number (an integer value from event_param repeated field with the key ga_session_number. The GA Session Number indicates the number of sessions a user has had on your website or app (a counter that increases with each visit).
For each session number, the events in each session (starting with 1 for the first event in the session) are ordered by the event_timestamp.
If you’re new to window functions, the resources below has great examples to get you started.
Tip
Resources
Window Functions in BigQuery - https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls
Solution
WITH base AS (
SELECT
event_timestamp,
geo.region,
device.web_info.browser,
event_name,
user_pseudo_id,
(SELECT value.int_value FROM unnest(event_params) WHERE key = "ga_session_number") as ga_session_number
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`
)
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY user_pseudo_id, ga_session_number ORDER BY event_timestamp) as hit_number
FROM base
Day 7 - Data Definition Language
Data definition language (DDL) allows you to create, delete, and alter resources in BigQuery. Think of DDL as part of SQL that provides a way to define and manage the structural aspects of your data warehouse (BigQuery). In BigQuery, many resources can be controlled through the UI, however, knowing how to build and manage resources using DDL will help you get quicker and allow you to automate some of these processes.
Challenge
Create a dataset (schema) and a table using DDL. Create your table in the newly created dataset on top of the query from the previous challenge using the AS keyword in the create table SQL.
Tips
You will have to use the following two DDL commands:
- CREATE SCHEMA
- CREATE TABLE AS SELECT
Resources
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language
Solution
#Create a dataset:
CREATE SCHEMA bqchallenges;
#Create a table based on a query
CREATE TABLE
bqchallenges.ga_data
AS
WITH base AS (
SELECT
event_timestamp,
geo.region,
device.web_info.browser,
event_name,
user_pseudo_id,
(SELECT value.int_value FROM unnest(event_params) WHERE key = "ga_session_number") as ga_session_number
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`
)
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY user_pseudo_id, ga_session_number ORDER BY event_timestamp) as hit_number
FROM base;
Day 8 - Optimize with Partitioning & Clustering
On Day 3, we saw how query costs relate to the amount of data scanned in your query. Today, let's se optimize those costs and boost speed using partitioning and clustering!
Partitioning
Think of a partitioned table as a well-organized filing cabinet. Instead of one giant drawer, your table is divided into smaller, more manageable segments (partitions) based on the values in a specific partition column. When you query data and filter by the partition column (a date range), BigQuery can skip scanning irrelevant partitions entirely, leading to faster queries and lower costs.
In BigQuery, the partition column can be an integer, a time-unit, or the ingestion time. A table can have up to 10,000 partitions.
Clustering
If partitioning is like having separate drawers, clustering is like having alphabetically sorted files within those drawers. BigQuery sorts the data within storage blocks based on the values in one or more clustering columns. When you filter or aggregate on these columns, BigQuery can more efficiently locate the relevant blocks and rows. This is especially beneficial for columns with high cardinality that you frequently filter on.
In BigQuery you can specify up to four clustering columns. Unlike partitioning, there's no limit to the number of unique values in clustering columns. Take advantage of both for optimal performance!
Challenge
Create a partitioned and clustered table, by slightly rewriting yesterday’s solution below to include all the events_ tables using the wildcard notation. Then transform the table suffix to a date type using parse_date(“%Y%m%d”, _table_suffix) as table_date so it can be used as a partition. Use event name, browser and region as clustering columns
Explore More
See how using filtering by these columns when querying the newly created table changes the amount of data scanned. How does scanned data prediction compare between partitioning and clustering?
Resources
Partitioning - https://lnkd.in/gNRQwsFm
Clustering - https://lnkd.in/dhsjYDZp
Solution
CREATE TABLE bqchallenges.ga_data_par_clu
PARTITION BY table_date
CLUSTER BY event_name, browser, region
AS
WITH base AS (
SELECT
parse_date("%Y%m%d", _table_suffix) as table_date,
event_timestamp,
geo.region,
device.web_info.browser,
event_name,
user_pseudo_id,
(SELECT value.int_value FROM unnest(event_params) WHERE key = "ga_session_number") as ga_session_number
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
)
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY user_pseudo_id, ga_session_number ORDER BY event_timestamp) as hit_number
FROM base
Day 9 - Setting up hard limits
Despite careful query cost management and adherence to best practices, unexpected spikes can still occur. To prevent runaway costs,BigQuery provides hard limits on the daily data scanned at both project and project-user levels.
Challenge
Today's challenge is simple. In the Quotas & System Limits section of your GCP project, set a daily query usage limit.
Solution
- Open your GCP project
- Navigate to Quotas and System Limits
- Filter for query usage
- Check the quota you'd like to implement (per day or per user per day) and click edit
- Update the value (uncheck unlimited) and click submit request
Day 10 - Unleash the Power of External Tables with BigQuery!
BigQuery offers a fantastic way to seamlessly integrate with external data sources, like Google Sheets. This is a very effective way of collaborating with teams who prefer managing data outside of BigQuery and SQL, but want to see the impact of their changes immediately.
Challenge
In this challenge, set up an external Google Sheets table by following the documentation https://cloud.google.com/bigquery/docs/external-data-drive and join it with a public dataset.
An example would be providing a margin for each item category in a Google Sheets document and then joining that data with the public GA4 data to calculate profit, and not just the revenue of items sold (screenshot below).
Explore More
Test the live connection, check how changes in the document immediately impact the outputs of your queries.
Solution
Follow the provided documentation to create a new external table. Any query combining the newly create table and a public dataset works, the screenshot above is an example of a combination of financial Google Doc and the GA4 sample dataset.
Day 11 - The Information Schema
The BigQuery Information Schema provides invaluable insights into the metadata and execution history of your BigQuery resources. By querying these system-defined views, you can gain detailed information about your datasets, tables, job execution, and more.
Challenge
Review queries you ran in your project in May so far and sort them from the most to the least expensive.
- You'll primarily need to query the INFORMATION_SCHEMA.JOBS_BY_PROJECT view.
- Look for the total_bytes_billed column to determine query cost
- The query column contains the actual SQL text of the executed job
- The creation_time column can be used to filter to May
Resources
Information Schema Views - https://lnkd.in/dDfpXhca
Information Schema Jobs - https://lnkd.in/dExxvrFP
SELECT
user_email,
query,
total_bytes_billed
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
creation_time BETWEEN "2025-05-01"
AND "2025-05-31"
AND job_type = "QUERY"
ORDER BY
total_bytes_billed DESC
LIMIT
10
Day 12 - Storage Billing Model
Another underutilized, and possibly massively beneficial feature in BigQuery is the ability to decide how a dataset will be charged for storage. The standard rate is $0.02 per GB per month for the first 90 days, which drops to $0.01 per GB per month when the data is older than that. The above pricing applies to the more commonly used logical storage-based billing. However, BigQuery allows you to be billed at double the rate per Physical (compressed) + time travel bytes.
For analytical and denormalized tables, physical storage can be significantly smaller (often 10-20x!), potentially leading to 5-10x storage cost reductions. As an example, the table in the second screenshot is using 21.94GB of logical storage (uncompressed) but only 1.13GB physical – that's a massive 90% potential saving!
Table id: bigquery-public-data.samples.natality
Dataset storage billing can be changed at any point and does not impact the query execution.
The query in the screenshot below is one of the most magical pieces of the BigQuery documentation available. Hidden in the information schema table storage documentation (https://lnkd.in/d3D_TMJc) it examines your datasets and determines which billing model is better for your datasets and what the savings are.
Challenge
Execute the query from the documentation in your GCP project and update the datasets that would benefit from switching the billing model using the query below:
ALTER SCHEMA DATASET_NAME
SET OPTIONS( storage_billing_model = 'BILLING_MODEL');
The billing model can be either PHYSICAL or LOGICAL.Day 13 - GEO
BigQuery supports the geography data type, enabling you to work with spatial information directly within your data warehouse. This means you can store geographic points, lines, and polygons and apply specialized functions to them. These functions allow you to perform complex spatial operations, such as calculating distances and areas, checking for intersections, and determining containment.
For this example, we'll look into the public dataset "geo_us_boundaries" to test some of the geography functions on the polygons of different zip codes.
Challenge x 2
Start querying the bigquery-public-data.geo_us_boundaries.zip_codes table to solve the following two challenges:
- Using the st_area function on the ZIP code area field (zip_code_geom), order zip codes from largest to smallest, what are the top 3 biggest zip codes*?
- Find the zip code farthest away from Times Square
ST_GEOGPOINT(-73.985130, 40.758896)
*The area function returns square meters
Resources
GEO functions: https://lnkd.in/gryAGF2T
Zip Code public data: https://lnkd.in/gX55fEsd
Solution
SELECT
*
FROM `bigquery-public-data.geo_us_boundaries.zip_codes`
ORDER BY st_area(zip_code_geom)
DESC LIMIT 5;
SELECT
zip_code,
st_distance(ST_GEOGPOINT(-73.985130, 40.758896), zip_code_geom) as distance
FROM `bigquery-public-data.geo_us_boundaries.zip_codes`
ORDER BY distance DESC;
The top 22 largest zip codes in the USA are all in Alaska! The zip code farthest from Times Square in NYC is 96916.

Day 14 - Time Travel
BigQuery has a built-in "time machine" that allows you to access a snapshot of your table from any point within its time travel window (default is 7 days). This means you can easily recover from accidental or intentional data changes by simply specifying a past timestamp!
How it works in SQL:
Just append FOR SYSTEM_TIME AS OF followed by a timestamp to your table reference.
To reference a table state from 5h ago:
`my_dataset.my_table` FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 HOUR);
To start this challenge, first create a sample table in your project. Such as copying the Google Analytics 4 public table:
CREATE OR REPLACE TABLE `bqchallenges.my_time_travel_table`
AS
SELECT * FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`;
After creating the table, note down the Unix milliseconds value, this is the time we will be travelling back to.
SELECT unix_millis(current_timestamp());
Now, "accidentally" delete all the page views from the table:
DELETE FROM `adswerve-ts.bqchallenges.my_time_travel_table`
WHERE event_name = "page_view";
Challenge
1. Verify the change: Write a query using FOR SYSTEM_TIME AS OF TIMESTAMP_MILLIS([your_recorded_unix_millis_value]) to compare the row count of the table at its creation time with its current state.
2. Restore the past: Write a query to retrieve the table data as it was before the deletion.
Tip
- Remember this powerful feature! You can be a data hero by recovering accidentally modified or deleted data.
- For extreme emergencies, BQ Tables also have a lesser-known fail-safe window, which allows tables to be restored up to 7 days after deletion.
Resources
System time as of: https://lnkd.in/d6yVbtGp
Access historical data: https://lnkd.in/d9xaj2dj
Fail safe: https://lnkd.in/ga_aDZRu