Saving money is essential to almost everything we do. So when it comes to BigQuery — a data warehouse that is part of the Google Cloud Platform — it’s not surprising that saving money on BigQuery storage can be a game-changer.
As you may already know, BigQuery’s potential to turn big data into calculated business decisions is seemingly limitless. It allows you to load your data from various platforms into one centralized data warehouse, where you can then discover significant insights through in-depth data analysis.
While these projects allow you to query massive datasets in seconds to uncover user behavior patterns that are hard to notice in standard reporting, they also collect lots of data in storage as time goes on. And while the cost of storage per GB will stay consistent (and inexpensive), the total BigQuery storage cost can increase. To help keep those costs down, here are seven tips for saving money on BigQuery storage.
Tip 1: Estimate
Tip 2: Clean
Do some “spring cleaning” in the project and datasets to find any tables or partitions you can delete or archive. Basically, don’t be a digital hoarder. For example, delete temporary tables you created for EDA (exploratory data analysis) if you don’t need them anymore.
Tip 3: Keep and partition
We recommend that you keep your data in BigQuery and partition your tables.
A partitioned table is a special table divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can:
- Improve query performance
- Control query costs by reducing the number of bytes read by a query, and
- Save on BigQuery storage cost
If you have a table or a partition not edited for 90 consecutive days, the price of BigQuery storage for that table automatically drops by 50 percent to $0.01 per GB per month because of Big Query’s long-term storage pricing. In addition to savings, long-term storage ensures that:
- There is no performance degradation, durability, availability, or any other functionality when a table or partition is considered for long-term storage.
- Querying the table data, along with a few other actions, does not reset the 90-day timer and the pricing continues to be considered as long-term storage.
See BigQuery storage pricing for details.
- If you have 100 GB of data in BigQuery active storage, typically, it will cost you $1.80 per month.
- If your tables are partitioned, you can save on storage.
- If you haven’t modified 2/3 of your data for 90 days, you will save 43% on BigQuery storage.
- If you haven’t modified 4/5 (or 80%) of your data for 90 days, you will save 50%.
Example A: 67% of BigQuery data is in long-term storage for 43% savings:
Example B: 80% of BigQuery data is in long-term storage for 50% savings:
Taking advantage of BigQuery’s long-term storage should give you significant savings. If you want to take it further, see the following Google Cloud Storage (GCS) archiving steps.
Tip 4: Expire
Use the expiration settings to remove unneeded tables and partitions.
- Configure expiration time for data in your:
- Before expiration, tables can be archived in GCS. We will discuss this below.
Tip 5: Classify
Classify your data into the following groups, based on the frequency of access:
- Data you actively access (at least once a month)
- Data you expect to access less than once a month (30 days)
- Data your business expects to access less than once a quarter (90 days)
- Data accessed less than once a year (365 days)
Tip 6: Archive
Before your tables and partitions expire, move data in classification groups #2, #3 and #4 into Nearline, Coldline and Archive GCS buckets.
- Create GCS buckets with default storage classes: Nearline, Coldline and Archive.
- Export BigQuery tables into GCS as CSV or JSON. Compress data with GZIP.
- CSV doesn’t support nested fields, but JSON does.
- Automate uploading your tables and partitions from BigQuery to GCS before the data expires.
- A possible solution is a Cloud Function, which exports compressed data from BigQuery into GCS. We will show you what this solution looks like below.
- Keep the archived data in GCS for the minimum storage duration.
- Access the archived GCS data by loading it into BigQuery.
- You can also query GCS as an external data source without loading it into BigQuery.
- An external data source is a data source that you can query directly even though the data is not stored in BigQuery. Instead of loading the data, you create a table that references the external data source.
- BigQuery offers support for querying data directly from Cloud Storage.
- Supported formats include CSV and JSON.
- You can also maintain and query externally hive partitioned tables, which allow you to retain partitioning in external GCS data, making your query cost lower.
Google Cloud Storage (GCS) Costs
* Storage costs vary depending on your region.
Learn more about GCS:
- Intro Cloud Storage for data archiving
- Storage and retrieval pricing
Archival BigQuery Storage Solution
Below is a design of a solution that will archive your BigQuery data in GCS.
- Cloud Scheduler sends a daily Pub/Sub message.
- Pub/Sub message triggers a Cloud Function.
- The Cloud Function copies yesterday’s Gzipped data (as CSV or JSON) partition to the GCS.
- It also can backfill data prior to yesterday.
- Once the BigQuery partition is archived in GCS, it expires in BigQuery.
Tip 7: Monitor
We suggest using the Cloud Billing Console to help you monitor how your storage and access costs changed after archiving. This will also allow you to adjust your savings strategy as needed.
Check out these helpful resources for more information about how you can cut your BigQuery costs:
Interested in getting started in BigQuery? Adswerve is offering a $500/month credit to any of our existing or new Google Analytics 360 subscribers interested in leveraging the Google Cloud and Google Marketing Platforms together to meet their marketing goals. Contact us to learn more.