There are multiple factors that contribute to BigQuery cost. Generally, storage and processing (querying) tend to be the largest items on your bill at the end of the month. In this article, I will equip you with an easy way to analyze jobs (queries) within your organization.
Jobs Information Schema
Historically, analyzing processes in your BigQuery required several steps. This had to be done either via billing export, logging or the BigQuery API. With the introduction of jobs information schema, this has become a far easier task.
Depending on your permission levels, you will have access to different information schema scopes:
– by user (view your own jobs)
– by project (view all the jobs in a project)
– by folder (view all the jobs for the projects in the folder)
– by organization (view all the jobs in the organization)
No matter the level of access, the schema between those views will stay the same and can be reviewed here. Note that some columns are either partitioned (creation_time) or clustered (project_id, user_email). Take advantage of those to make your queries more efficient.
Fields of Interest
Fields that I will use in my “quick-start” queries are:
– creation_time (creation time of the job/query)
– user_email (email of the user who ran the job/query)
– job_type (a type of a job, to differentiate between queries, loads, copies, etc.)
– start_time and end_time (start and end time of a job, to calculate the query run time)
– query (input text of a query, to identify which query exactly ran the longest, cost the most, etc.)
– total_bytes_billed (bytes that were billed to calculate the cost of a query*)
* In BigQuery, the processing cost is calculated based on the amount of data scanned (unless slot reservation is used). After the initial free tier, the cost of querying is $5 per TB
BQ Processing Cost Queries
The queries are written on a “project” scope. To access the information schema view you will have to reference it with the region (`region-us`) followed by the dataset “INFORMATION_SCHEMA”, followed by the view, which depending on the scope can be USER, PROJECT, FOLDER or ORGANIZATION.
Query Cost by User Over the Last 30 days
SELECT user_email, ROUND(SUM(total_bytes_billed)/1e12*5, 2) costInDollars FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY) AND CURRENT_TIMESTAMP() AND job_type = "QUERY" GROUP BY 1 ORDER BY 2 DESCNote that sometimes users authenticate access to BigQuery in tools such as Tableau, Data Studio, local scripts, etc. You should be able to identify those by sometimes very specific SQL language (different job id format), however, the user running those queries/reports may not be the same as the user who authenticated BigQuery as a data source.
Size and SQL Code of the Most Expensive Query in the Last 30 days
SELECT user_email, query, total_bytes_billed FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY) AND CURRENT_TIMESTAMP() AND job_type = "QUERY" ORDER BY total_bytes_billed DESC LIMIT 1
Queries that ran the Longest
SELECT query, TIMESTAMP_DIFF(end_time, start_time, SECOND) AS run_time_in_seconds FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY) AND CURRENT_TIMESTAMP() AND job_type = "QUERY" ORDER BY run_time DESC LIMIT 5
Other Information Schema Views
Views that provide access to job history are not the only available information schema in BigQuery. Here is the full list: