Table of Contents

How does BigQuery work?

BigQuery is Google’s highly scalable, low-cost analytics data warehouse. It enables you to run petabyte-scale queries and receive results in seconds, as well as providing the advantages of a fully managed system. Since it is built on serverless architecture, you don’t have to worry about infrastructure or hardware administration. Additionally, BigQuery automatically replicates your data across multiple sites. In terms of company data, this guarantees exceptional stability and performance.

Unlike most other cloud-based data warehouses, this Google solution requires little to no upkeep. Therefore, you can access Google’s Data Warehouse from any location where Google Cloud is available. Another benefit of adopting Google BigQuery is that it integrates seamlessly and for free with certain Google-based services like Google Analytics and Google Drive. Different tools have been developed to simplify, streamline, and display the ETL (Extract, Transform, Load) process in an environment where it plays a critical role in Data Analysis.

Today, businesses are learning how to use BigQuery ETL best practices to improve their workload processing capacity. This procedure entails utilizing the most up-to-date ideas and approaches to improve the entire procedure by leveraging the power of Google BigQuery and GCP.

BigQuery Optimization

BigQuery Internal Controls for Cost Optimization

To get started with BigQuery, you must first upload your data, which implies you’ll be using it for storage, and then verify it in some way before creating models or reports. Let us look at the most essential practices for loading, querying, and storing data.

Storage management

You can define an expiration time for tables or partitions if you just need current or recently updated data, regardless of the historical data. For example, you can set a partition’s end date to 90 days, which indicates that any segment that is older than 90 days would be destroyed automatically.

Please note that this can only be applied to tables / partitions created after the policy was defined; if you want to apply it to older tables/partitions, you may have to reconstruct them with the policy.

BigQuery is built from the ground up to save money on data storage that is no longer in use. By default, every table or partition in BQ that hasn’t been touched in at least 90 days will be moved to long-term storage, which costs $ 0.01 per GB per month, or 50% less than usual prices.

You may also export BigQuery data to Google Cloud Storage for a lower cost if you have older data that you need to maintain in case you need to consult it or just for general governance concerns.To get estimates of your storage costs, you can calculate them using Pricing Calculator.

Automatic View Creation

Generating a BigQuery view is quite similar to creating tables, with the exception that the access control component is more involved. Views, unlike tables, do not hold actual data because they are simply saved queries. As a result, if a view querying dataset A is established under a dataset B, users must provide the view read access to dataset A, or the view will not be usable.

Thus, instead of manually allowing views access to dependent datasets in BigQuery UI, if a data pipeline automates view generation, it is recommended to automate granting views access to dependent datasets as part of the same pipeline. This optimizes the view performance.

GCP’s newest update, BigQuery Studio (designed for big data management in organizations using AI, ML, and LLM) is an all-in-one analytics dashboard that streamlines end-to-end analytics workflows, from data ingestion and transformation to sophisticated predictive analysis.

How to Reduce BigQuery Cost

Partitioning Tables

You can partition BigQuery data by a date column or by load time. Since it only retrieves the partitions required by the query, this increases performance and lowers the cost of a query. BigQuery will obtain data from relevant partitions without a full scan when users query a partitioned table by filtering on the partition column, optimizing query performance and minimizing query cost.

Clustering Tables

Clustered columns can be used to organize data in BigQuery. For example, you can organize employee data by department, and if a query is run on that department, BigQuery just reads until it reaches the filtered department, skipping the rest of the data, avoiding wasteful data scans. BigQuery also minimizes the overhead of sorting data by clustered columns when we aggregate by them in a query.

Nesting and Repeating Data

This is among the most significant ETL guiding principles for Google BigQuery. When the data is denormalized, Google BigQuery performs the best. Denormalize the data instead of retaining relationships and take leverage of nested and repeated fields. Avro, Parquet, ORC, and JSON (newline delimited) formats support nested and repeated fields. STRUCT is the type that can be used to represent a nested object, while ARRAY is the type that is used to indicate a repeating value.

Slots

The computing power required to perform a SQL query in BigQuery is referred to as slots. In terms of pricing and resource allocation, slots are extremely important. BigQuery is in charge of slot allocation, and the amount of slots allotted to a job is determined by two factors:

  • Query Size – The size of the query is the amount of data that is processed.
  • Data Complexity – The amount of data that has been jumbled.

Reducing query size will reduce the amount of data processing and associated costs. In addition to consumption-based pricing, BigQuery also offers flat-rate pricing, which allows you to purchase a set number of slots over a set period of time, giving you additional flexibility in capital planning.

Streaming Inserts

A load task will be generated to read data directly and insert it into the table in batch mode when inserting data into a Google BigQuery table. Using streaming data, we will be able to query data without having to wait for the load process to finish. Using Cloud SDKs or other GCP APIs, stream insert may be done on any Google BigQuery table.Ensure the following when performing streaming inserts –

  • After a few seconds of the initial stream being placed in the table, streaming data is accessible for the query.
  • It might take up to 90 minutes for data to become accessible for copying and exporting.
  • The contents of the _PARTITIONTIME pseudo column will be NULL when streaming to a partitioned table.
  • The value in a DATE or TIMESTAMP column should always be within 1 year in the past and 6 months in the future when streaming to a table partitioned on that column. Any data that falls outside of this range will be discarded.

BigQuery resource management best practices

BigQuery follows a columnar database style. In other words, at the physical level, all columns from a table are kept independently. As a result, you should never use select * in your queries. The amount of data handled will be limited as a result of this method.

  • Instead of using the query, you should utilize the preview option to look at the sample data. BigQuery’s preview function is completely free.
  • Before sending the query to production, you can evaluate the expenses with our BigQuery Pricing Calculator. On-demand inquiries are priced according to the amount of bytes read, and we may determine the price depending on the number of bytes read.
  • User-defined functions in JavaScript degrade overall performance. We should stay away from them.
  • Certain connections, such as a cartesian product, might result in more output than input and should thus be avoided.
  • DML statements that update or insert single rows should be avoided.
  • Use the most granular prefix possible when querying wildcard tables. If you have tables with names like "TableName YYYMMDD" and only require data from 2020, use "TableName 2020*" instead of "TableName_*" and then filter by year.

In the same query, avoid joining the same subquery several times. Rather, save this subquery as an intermediate table and query it from there. The cost of storage is substantially lower than the cost of recurrent querying. Start with the biggest table when combining two tables for optimal performance.

Conclusion

BigQuery gives you the ability to conduct petabyte-scale queries and get responses in seconds, as well as the benefits of a fully managed infrastructure. Businesses are learning how to leverage Google BigQuery ETL to increase the capacity of their workload processing. By harnessing the power of Google BigQuery and GCP, this technique comprises using the most up-to-date concepts and approaches to improve the complete procedure.

Hopefully you get the most out of BigQuery and enjoy the benefits of this contemporary data warehouse. If you don’t keep track of your progress and picture your achievements, there will be no way to gauge your success. It is recommended to run a short report of your BigQuery use for the last month before taking any action to get a sense of the cost. Once completed, you will be in a better position to determine the cost-cutting steps needed in the future.

Adarsh Rai

Adarsh Rai, author and growth specialist at Economize. He holds a FinOps Certified Practitioner License (FOCP), and has a passion for explaining complex topics to a rapt audience.