Subscribe to our monthly newsletter to get the latest updates in your inbox
Google Cloud Next is an annual conference focused on Google Cloud Platform (GCP), where Google presents all of the latest features that are coming to the cloud. We get announcements on many new features, updates on existing ones, and even new public betas that are ready for use. There are also hundreds of sessions, panels, and bootcamps to attend. One of our favorite parts of the conference: there are Googlers everywhere! You can directly connect with the product teams and there are endless opportunities for interactive demos, discussions, and networking. This year's conference focused on three main topics: Machine Learning & Artificial Intelligence (AI), Data Analytics, and Application Development. We spent three days attending and are excited to share with you our highlights primarily focused around Machine Learning & AI. Here is the top five list of our favorite announcements from Google Cloud Next.
Let's consider an example using stock data. Every stock is essentially a big time series which is great because we can time partition stocks based on the timestamp. Imagine we have a table called “stock” that houses that data, the columns might be “timestamp”, “stock_name” and “price” for simplicity. So, column “timestamp” is our partition column and using that we can efficiently navigate the date ranges. Usually we’d want to query data for one single stock, so we’d do a WHERE clause, where “stock_name” equals “GOOGL”. That’s OK, but BigQuery will do a full column scan of “stock_name” column (and any other column in the SELECT clause) reading a lot of stock names we don’t want and then filter out what we want. With clustering we can cluster based on the “stock_name” column and, behind the scenes, BigQuery will store data in a way that when we run the same query again we'll only read from the “GOOGL” cluster, reading only the data we actually want to read and thus avoiding full column scans of the columns in the SELECT clause. You can learn more about BigQuery clustering which is now in beta here.
Number 1: BigQuery ML (Machine Learning)
BigQuery just got a huge update! We now have BigQuery ML which is a way for users to create and execute machine learning models directly in BigQuery using standard SQL. We've been using BigQuery ML for a few months and it's awesome! Now that it's in public beta , you can use it, too! This is important because Google has made it very easy to train your machine learning models inside BigQuery with just a few very simple SQL-like statements. This means no more exporting data back and forth, building out separate TensorFlow models in Python, or trying to run off sample data on your local computer. We now have a quick and easy way for anyone that has a basic understanding of SQL and Machine Learning to execute quickly. With everything staying inside of BigQuery, this also makes other tasks that were tedious in the past—like retraining, prediction and result analysis—that much simpler. If you're getting started we want to highlight that we're currently limited to 2 basic machine learning algorithms, linear regression and logistic regression. So, for any complex custom models, TensorFlow and Cloud ML Engine are still the way to go. For more in depth about BigQuery ML, here’s the link to the documentation.Number 2: BigQuery Clustering
The name of the feature is clustering which may be a bit confusing because it suggests it has something to do with a data science-related term, but it actually has little to do with it. If you regularly use BigQuery you know that you can partition tables either via ingestion time method or based on a partition column. Partitioning data is nice because you can query data on those time-based partitions which save cost and have performance benefits. BigQuery clustering is the extension of that idea, except now you can partition by multiple columns that you may be frequently querying. BigQuery will sort data internally based on those columns and store it separately. So, at query time, there is no need to do a full column scan of the data, rather, only a scan of the cluster you want to read from. This adds big performance and cost benefits, which is a big win for all! Bellow is the comparison of the two features:PARTITIONING | CLUSTERING | |
Cardinality | Less than 10k | Unlimited |
Dry Run Pricing | Available | Not available |
Query Pricing | Exact | Best Effort |
Performance Overhead | Small | None |
Data Management | Like a Table | Use DML |