Table of Contents

Today, businesses grapple with the challenge of handling massive amounts of data from multiple data sources. Effectively integrating this data is essential to harness its full potential. Google Cloud Platform (GCP) introduces Cloud Data Fusion, a powerful tool designed to simplify the complex data integration process. It allows businesses to efficiently combine, transform, and analyze their data, providing them with the information needed to make strategic decisions and drive innovation.

What is Google Cloud Data Fusion?

Google Cloud Data Fusion is a fully managed, cloud-native data integration service offered by Google Cloud Platform (GCP). It enables organizations to create, manage, and orchestrate data pipelines in a user-friendly environment. It allows users to visually design data workflows, which can ingest, transform, and analyze data from multiple sources in real-time or batch modes. This flexibility helps businesses of all sizes efficiently manage and utilize their data.

What is Google Cloud Data Fusion?

Google Cloud Data Fusion is a fully managed, cloud-native data integration service that allows users to create, manage, and orchestrate data pipelines through a no-code/low-code visual interface. It enables the seamless integration, transformation, and analysis of data from multiple sources, making it ideal for businesses looking to streamline their data workflows.

At its core, Cloud Data Fusion provides a no-code/low-code interface that allows users to design data workflows visually, with a visual point-and-click interface. This user-friendly design lets you deploy ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) data pipelines without any need for coding.

Another significant advantage of using Cloud Data Fusion is its seamless integration with other Google Cloud services. It also offers end-to-end data lineage. This feature is particularly useful for conducting root cause and impact analysis, as it allows you to trace data flows and transformations throughout your pipelines. Since GCP Cloud Data Fusion is built on an open-source core known as the CDAP (Cask Data Application Platform). This foundation provides the flexibility to move and adapt your data pipelines across different environments, ensuring that your data integration solutions are both portable and adaptable to changing business needs.

Core Features of Google Cloud Data Fusion

Google Cloud Data Fusion is equipped with a variety of powerful features that cater to the diverse data needs of modern enterprises, making it an indispensable tool for streamlining data operations and enhancing decision-making capabilities. Let us delve into some of the core features that make GCP Cloud Data Fusion a leading choice for data integration and transformation.

Google Cloud Data Fusion features, GCP Cloud Data Fusion Unified experience, Cloud Data Fusion Studio
Source: Google Docs
  • Collaborative Data Engineering: Cloud Data Fusion facilitates collaborative data engineering with an internal library of custom connectors and transformations. These can be validated, shared, and reused across an organization, promoting efficiency and consistency in data handling.
  • Code-Free Service: Cloud Data Fusion empowers non-technical users with a code-free graphical interface that supports point-and-click data integration. This removes bottlenecks typically associated with data handling, enabling faster and more efficient processes without the need for deep technical knowledge.
  • Real-Time Data Integration: The platform supports real-time data integration by enabling seamless replication of databases such as SQL Server, Oracle, and MySQL into BigQuery. It integrates with Datastream to deliver change streams for continuous analytics and includes tools for quick development and performance monitoring.
  • Enterprise-Grade Security: Integration Google Cloud Data Fusion with Cloud Identity and Access Management (IAM), Private IP, VPC Service Controls (VPC-SC), and Customer-Managed Encryption Keys (CMEK) provides enterprise-grade security. This feature helps mitigate risks and ensures compliance with data protection standards.
  • Seamless Operations: For seamless operations, Google Cloud Data Fusion offers a range of tools such as REST APIs, time-based scheduling, pipeline state-based triggers, logs, metrics, and monitoring dashboards. These features make it easier to manage and operate data pipelines, even in mission-critical environments.

What is the difference between Google Cloud Data Fusion vs Dataflow vs Dataproc?

GCP offers a variety of tools for data integration and processing, each suited to different use cases and requirements. Three popular services in this domain are Cloud Data Fusion, Dataflow, and Dataproc. Here’s a breakdown of the difference between Google Cloud Data Fusion and other data integration tools:

GCP Cloud Data Fusion vs Dataflow

GCP Cloud Data Fusion and Dataflow serve different roles in data integration and processing on Google Cloud Platform. Cloud Data Fusion is a fully managed, no-code/low-code tool that simplifies data integration with a visual interface, perfect for businesses needing easy ETL/ELT pipeline creation. Dataflow, on the other hand, is built for real-time and batch processing using Apache Beam and requires more technical expertise, making it ideal for complex and scalable data workflows. While Cloud Data Fusion is great for ease of use, Dataflow excels in handling more intricate processing tasks.

GCP Cloud Data Fusion vs Dataproc

While Google Cloud Data Fusion simplifies data integration from multiple sources, Dataproc is a powerful platform for heavy-duty big data processing using tools like Spark and Hadoop. Dataproc is ideal for large-scale data analytics, data cleansing, ETL, and machine learning workflows, offering flexibility and efficiency for handling large datasets, though it requires more technical expertise compared to the user-friendly Cloud Data Fusion.

How to build Data Pipelines on Google Cloud Data Fusion: A Step-by-step Tutorial

Here’s a quick guide on setting up a data pipeline using Google Cloud Data Fusion. We’ll create a Cloud Data Fusion instance, deploy a sample pipeline, and process data from Cloud Storage to BigQuery.

Step 1: Set Up Your Environment

  • Go to the project selector page in the Google Cloud console and choose an existing project or create a new one. This is useful if you want to easily remove all resources afterward by deleting the project.
  • Go to the API library and enable the Cloud Data Fusion API to use its features.
GCP Cloud Data Fusion API, Google Cloud API library
Cloud Data Fusion API

The GCP Cloud Data Fusion API enables automated management of data pipelines, allowing seamless integration and transformation of data across Google Cloud services.

Step 2: Create a Cloud Data Fusion Instance

  • Click on “Create an instance” from the Instances page. Provide a name, description, and select the region for your instance.
  • Select the Cloud Data Fusion version and edition you need. For newer versions, choose a Dataproc service account for pipeline execution.
  • Click “Create” and wait for the instance to be set up. This may take up to 30 minutes.

Step 3: Navigate the Cloud Data Fusion Interface

  • Once the instance is ready, go to the Instances page in the Google Cloud console and click “View Instance” to open the Cloud Data Fusion web interface.
  • Use the left navigation panel to explore different pages like Studio for building pipelines and Wrangler for data preparation.

Step 4: Deploy a Sample Pipeline

  • In the Cloud Data Fusion web interface, click on “Hub” in the left panel and then select “Pipelines.”
  • Choose the “Cloud Data Fusion Quickstart” pipeline and click “Create.”
  • In the configuration panel, click “Finish” and then “Customize Pipeline” to see a visual representation of your pipeline in the Studio page.
  • Click “Deploy” in the top-right menu to submit your pipeline for execution.
data fusion Google Cloud, cloud data fusion connectors, google cloud data fusion api, Create a data pipeline with GCP cloud data fusion
Source: Google Docs

Step 5: Execute Your Pipeline

  • In the pipeline details view, click “Run” to start the execution. Cloud Data Fusion will provision a temporary Dataproc cluster, run the pipeline, and then delete the cluster once finished.
  • Watch the status change to “Running” and then to “Succeeded” when the process completes.
GCP cloud data fusion ETL setup, Execute data pipeline in google cloud data fusion
Source: Google Docs

Step 6: View the Results

  • After a few minutes, the pipeline finishes and the pipeline status changes to Succeeded and the number of records processed by each node is displayed.
Google cloud data fusion,

Source: Google Docs
  • Now, Go to the BigQuery web interface and navigate to the DataFusionQuickstart dataset. Click on the “top_rated_inexpensive” table to view a sample of the processed data.
  • Execute the query
SELECT * FROM PROJECT_ID.DataFusionQuickstart.top_rated_inexpensive LIMIT 10

replacing PROJECT_ID with your project ID to see the results.

This is a quick overview of setting up and running a data pipeline with Google Cloud Data Fusion, allowing you to integrate, transform, and analyze data seamlessly.

Google Cloud Data Fusion Pricing Insights

Google Cloud Data Fusion pricing is structured to provide flexibility based on usage and the specific needs of your data integration processes. The costs are primarily determined by the edition of Cloud Data Fusion you choose and the resources consumed during data processing. Here’s an overview of the key components that affect pricing:

Cloud Data Fusion Editions

Developer Edition:

The Developer Edition of Google Cloud Data Fusion is the most affordable option, perfect for small projects or development tasks. It offers essential data integration features but with limited access to advanced functionalities and lower resource limits. This makes it an excellent choice for those looking to get started or experiment with data integration without incurring high costs.

Basic Edition:

The Basic Edition of Google Cloud Data Fusion is a reasonably priced option designed for routine data integration tasks. It offers more connectors and transformation capabilities than the Developer Edition, providing increased flexibility for daily operations. Additionally, users get the first 120 hours of use each month for free, which can significantly reduce costs for regular activities.

Enterprise Edition:

The Enterprise Edition of Google Cloud Data Fusion is the top-tier option, tailored for large-scale and complex data integration requirements. It comes with advanced features such as high availability and robust security, and provides access to an unlimited number of connectors and transformations. This makes it ideal for handling extensive and critical data operations, ensuring reliability and security for enterprise-level projects.

Cloud Data Fusion EditionPrice per instance per hour
Developer$0.35 (~$250 per month)
Basic$1.80 (~$1100 per month)
Enterprise$4.20 (~$3000 per month)
Google Cloud Data Fusion Pricing based on Edition

Resource Consumption

When using Google Cloud Data Fusion, you are billed for the compute resources required to run your data pipelines, which includes the time spent on data transformations and integrations. The total cost depends on factors like the complexity of your pipelines, the amount of data processed, and the type of compute instance you choose. Besides, there are charges for storing data, both temporary and permanent, with costs varying based on the volume of data stored and the duration it is kept.

Additional Costs

When using Google Cloud Data Fusion, you may incur data transfer fees, especially when moving data out of Google Cloud services or across regions. The costs vary based on the volume of data transferred and the destinations involved. In addition, there are optional costs for premium support and maintenance services. The total expense depends on the level of support you choose and any additional maintenance services you opt for.

How much does a Data Pipeline with GCP Cloud Data Fusion cost?

Consider the example of running a data pipeline with Google Cloud Data Fusion that processes data every hour by reading from Cloud Storage, transforming it, and writing the results to BigQuery. This setup incurs several costs that depend on your specific usage. You will be charged for the resources used, such as the compute power for data transformation and the storage space for both raw and processed data.

In this scenario, you set up Dataproc clusters that run for 15 minutes each time, processing data every hour. The configuration includes a master node with 4 virtual CPUs and 5 worker nodes, each with 4 virtual CPUs. This gives a total of 24 virtual CPUs for each cluster.

To estimate the cost for Dataproc, you’d calculate it as follows:

Dataproc charge=number of vCPUs × number of clusters × hours per cluster × Dataproc price per vCPU hour
Plugging in the numbers:
Dataproc charge=24vCPUs×24clusters×0.25hours×$0.01
Dataproc charge=$1.44

This calculation shows that the cost for Dataproc is $1.44 for these runs. However, you’ll also incur additional charges for other Google Cloud services. For instance, you’ll pay for the Compute Engine resources used by the Dataproc clusters and for the persistent disk space provisioned. There are also costs for storing data in Cloud Storage and BigQuery, which depend on the volume of data processed by your pipeline.

To get a comprehensive estimate of these additional costs, you can use our GCP pricing catalog, which will help you understand the current rates of each service you use.

Conclusion

In summary, Google Cloud Data Fusion offers a versatile and user-friendly platform for integrating and transforming data from various sources. Whether you are an enterprise managing complex data pipelines or a small business starting with data integration, it provides the necessary tools and flexible pricing to meet your needs. The pricing model aligns with the resource usage and specific needs of your data integration projects, making it a cost-effective solution for your data strategy.

How can we help?

Are your cloud bills reaching the sky? Don’t let cloud costs weigh you down anymore. With Economize, you can slash your cloud expenditures by up to 30% effortlessly. Book a free demo with us today and discover how we can help you start saving in as little as 10 minutes.

Heera Ravindran

Content Marketer at Economize. An avid writer and a zealous reader who specializes in technical content and has a passion for all things Cloud and FinOps.

Related Articles