Mastering GCP Disaster Recovery with Async Replication

Disaster Recovery, Google Cloud Platform

By Adarsh Rai
May 29, 2024

· 9 min read

Last week, Google announced a powerful tool to its Google Cloud Platform (GCP) lineup: Persistent Disk Asynchronous Replication. This revelation isn’t just another feature to explore, it’s a veritable game-changer, especially when it comes to disaster recovery for GCP users.

In this article, we’re diving deep to understand what Persistent Disk Asynchronous Replication is, why it should matters to your business, and how to leverage it for maximum benefit.

What is Disaster Recovery in GCP?

Picture disaster recovery as a fire drill, a mock run you conduct in anticipation of a fire breakout. In the context of GCP, it’s a plan that prepares your data, resources, and applications for swift recovery in case of any catastrophic event, be it cyber-attacks or natural disasters. It ensures your Compute Engine instances, Cloud Storage buckets, and other components are safe and can bounce back, minimizing downtime.

In your cloud environment, it act as a safety net, carefully woven to catch your precious data should calamity strike. At its core, it’s a collection of strategies and processes designed to restore normalcy to your cloud environment following a disaster.

Why GCP Disaster Recovery isn’t Optional

Not having a disaster recovery plan can expose your GCP environment to various risks, including data loss and prolonged downtime. These disruptions can greatly impact your operations, potentially leading to severe losses. Additionally, reputational damage, possible regulatory fines, and the added pressure on your IT team are risks that businesses can’t afford to take.

The Hidden Cost of Cloud Downtime

Downtime is more than just a pause; it carries a substantial financial burden. According to a report from Parametrix Solutions, a New York-based insurance company, cloud downtime can cost a staggering $300,000 to $500,000+ per hour.

This cost only covers the tangible financial loss. The additional hidden costs, such as lost business opportunities, resources diverted to damage control, and the impact on the company’s reputation, are not factored into this amount. These hidden costs can escalate the total cost of downtime exponentially.

In essence, disaster recovery isn’t an added feature; it’s a business necessity. It’s an insurance policy for your data and a strategic move to protect your business.

Understanding Google’s Disaster Recovery Plan and Architecture

When it comes to the uninterrupted functioning of your services, a robust, well-tested Disaster Recovery (DR) plan is your business’s safety net. Be it a network outage, a bug in your latest application push, or a natural disaster, it’s important to be prepared for all contingencies. With Google Cloud Platform’s robust, flexible, and cost-effective range of products, you can construct a DR plan tailored to your needs.

The Fundamentals of DR Planning

At its core, Disaster Recovery is a subset of business continuity planning. The process commences with a business impact analysis, which defines two key metrics:

Recovery Time Objective (RTO): This is the maximum acceptable length of time that your application can be offline. It’s usually defined as part of a broader service level agreement (SLA).
Recovery Point Objective (RPO): This refers to the maximum acceptable duration in which data might be lost from your application due to a major incident. The RPO varies depending on the use of the data. For instance, frequently modified user data might have an RPO of a few minutes, while less critical, infrequently modified data might have an RPO of several hours.

The Cost-RTO/RPO Relationship

Typically, the smaller your RTO and RPO values are (i.e., the faster your application must recover from an interruption), the higher the cost to run your application. The graph below illustrates the ratio of cost to RTO/RPO, highlighting that low RTO/RPO leads to higher costs.

Google has provided an in-depth guide to disaster recovery and planning that you can access here.

What is Persistent Disk Asynchronous Replication in GCP?

Persistent Disk Asynchronous Replication (PDAR) is an innovative feature of Google Cloud Platform (GCP) that operates at the block infrastructure level. With just a few API calls, it manages data replication without the need for VM agents, dedicated replication VMs, or constraints on supported guest operating systems. It also doesn’t exert any performance overhead on your workload.

Let’s consider the task of regular backups. In the past, it might have required the coordination of multiple VMs, potential interruptions to your workload, and careful OS considerations.

Now, with Persistent Disk Asynchronous Replication, it’s like having a set-and-forget robot assistant. You input your replication commands, and PDAR takes care of the rest, working in the background, quietly and efficiently.

How Does Persistent Disk Asynchronous Replication Work?

Persistent Disk Asynchronous Replication (PDAR) is straightforward to set up and manage, working seamlessly in the background. Here’s a step-by-step guide on how PDAR operates:

Enable PDAR: Start by selecting your existing PD disk that needs protection. With a couple of quick commands in the API or Google Cloud console, you can enable PDAR.
Create a new disk: You need to create a new blank disk in the secondary region. This disk should reference the primary disk you want to protect.
Start replication: With a reference to the secondary disk, initiate replication from the primary disk. From this point forward, PDAR automatically replicates data between the disks, typically achieving an RPO of less than a minute depending on the disk’s change rate.
Monitor Replication: Once PDAR is in action, you can track the time since the last replication and the network bytes sent via Cloud Monitoring.
Disaster Management: The decision to declare a disaster event lies with the operations team managing the workload. If they identify a disaster in the primary region, they can initiate failover.
Initiate Failover: To start the failover, you need to stop replication between the disks and attach the secondary disk to a VM in the secondary region. This can be accomplished within minutes, enabling swift recovery.
Post Disaster Management: After the workload in the primary region is restored post-disaster, you can create a new replication pair back to the primary region. This failback process ensures the workload returns to normal in the primary region.

By replicating data between data centers in different regions, PDAR creates resilient data replicas that guard against disruptions caused by natural disasters or localized events. The magic of PDAR lies in its simplicity, flexibility, and effectiveness, making disaster recovery manageable and efficient.

Persistent Disk Asynchronous Replication Features for Disaster Recovery in GCP

Reliable disaster recovery is a critical part of modern business. Persistent Disk Asynchronous Replication in GCP provides a variety of tools to aid in this process. This section will delve into how these key PDAR features can enhance your disaster recovery plans and safeguard your critical data.

Consistency Groups – Guaranteeing Data Synchronization

In scenarios where workloads distribute data across multiple disks and VM instances, consistency groups ensure data synchronization. By synchronizing the replication period across all disks, PDAR provides simultaneous and atomic data replication, achieving data consistency between primary and secondary disks—a critical success factor for workload recovery during a disaster.

Key functions of consistency groups include:

Replication: Aligns replication across primary disks, ensuring all disks carry data from a common replication point—vital for a successful DR process.
Cloning: Aligns cloned disks from secondary disks, making sure all clones share data from a specific point in time—critical for DR drills or testing.

Remember, a consistency group can facilitate either replication or cloning at any given time, not both simultaneously. Before starting replication, primary disks need to be added to a consistency group. For cloning, secondary disks can be added at any time.

Their adept management of replication and cloning makes them indispensable in DR planning, providing businesses with a way to maintain continuous operations even in the face of unexpected disruptions.

Disaster Recovery Testing with Failovers and Failbacks

Understanding the role of failover and failback operations in disaster recovery is crucial for ensuring business continuity.

What is a Failover?

In the event of an outage in your primary region, the responsibility lies with you to recognize the issue and initiate a failover—essentially restarting your workload using the secondary disks in the secondary region. It’s important to note that PD Async Replication doesn’t automatically monitor for outages.

The failover process is straightforward:

Stop replication.
Attach the secondary disks to VMs in the secondary region.

Post failover, you need to validate and restart your application workload in the secondary region. You’ll also need to reconfigure the network addresses that access your application to point to the secondary region.

What is a Failback?

Once the primary region’s outage or disaster is resolved, the secondary region (now the acting primary region) can initiate a failback. This means starting replication back to the original primary region. Optionally, you can repeat this process to shift the workload back to the original primary region.

The failback process involves:

Configuring replication between the new primary region and the original primary region. This means that the original secondary disk, now the new primary disk, is set to replicate to a new secondary disk in the original primary region.
Optionally, creating a new consistency group resource policy in the new primary region. This ensures that the new primary disks (originally the secondary disks) can replicate consistently to a new set of secondary disks in the original primary region.
Optionally, repeating the failover process to return the workload to the original primary region after the initial replication.

Regular Testing for Disaster Recovery

For assurance that your recovery procedures will work during an actual disaster, Google recommends regularly conducting tests in the secondary region. This can be done without disrupting or disconnecting PD Async Replication, by bulk-cloning the secondary disks (with a consistency group applied) even while they are still receiving new data.

Using Regional Persistent Disks for High Availability and Disaster Recovery

Leveraging both PD Async Replication (PDAR) and regional persistent disks can ensure continuous high availability (HA) and effective disaster recovery (DR), particularly for crucial data that requires frequent access.

The Benefit of Using Regional Persistent Disks with PDAR

By incorporating regional persistent disks into your DR strategy, you’re enhancing the resilience and accessibility of your data. Regional persistent disks can serve as either the primary or secondary disk in a PDAR disk pair. A disk pair consists of a primary disk, which replicates to a secondary disk.

When you use a regional disk as the primary disk in the pair, the replication process remains undisturbed even if one of the primary disk’s zones experiences an outage. This is due to the regional primary disk’s ability to continue replication from the unaffected zone to the secondary disk.
On the other hand, when a regional disk is employed as a secondary disk, replication pauses if one of the secondary disk’s zones encounters an outage. In this case, the replication doesn’t resume to the secondary disk’s unaffected zone.
However, preparing your workload for cross-zone HA becomes possible when using regional disks as secondary disks, particularly during a failover when the secondary disk transitions into the new primary disk.

Supported Region Pairs

PDAR supports replication between specific Google Cloud regions, permitting replication to and from disks in each region within a region pair. Here are the currently supported PDAR region pairs:

Region A	Region B
London (`europe-west2`)	Belgium (`europe-west1`)
Iowa (`us-central1`)	South Carolina (`us-east1`)
Iowa (`us-central1`)	Oregon (`us-west1`)
Iowa (`us-central1`)	N. Virginia (`us-east4`)
Taiwan (`asia-east1`)	Singapore (`asia-southeast1`)
Sydney (`australia-southeast1`)	Melbourne (`australia-southeast2`)

Secure Disk Encryption

Google-managed and Customer-managed Encryption Keys

When employing PDAR, it’s essential to understand that neither primary nor secondary disks support customer-supplied encryption keys (CSEK). This restriction is placed to ensure a uniform encryption standard that maintains data security during the replication process.

In contrast, Google offers the usage of Google-managed encryption keys or customer-managed encryption keys (CMEK) for disk encryption. Both types of keys offer their unique benefits.

Google-managed keys are maintained by Google Cloud, relieving you of the responsibility of key management, while CMEK gives you the control and responsibility of managing your encryption keys.
If you opt for using CMEK on the primary disk, it’s compulsory to also use CMEK on the secondary disk. You have the flexibility to use different CMEKs for both disks. This allows for unique encryption configurations for each disk, offering an added layer of security.

Limitations and Considerations

Despite its value in disaster recovery (DR) planning, it’s crucial to understand the restrictions associated with Persistent Disk Asynchronous Replication (PDAR).

PDAR is solely compatible with balanced (pd-balanced) and Solid-State Drive (SSD) persistent disks (pd-ssd). The maximum capacity for these disks is limited to 2 TB. Also, PDAR can’t support read-only or multi-writer disks. Furthermore, a project can only have 100 disk pairs for each region pair.

The Recovery Point Objective (RPO), which is the measure of potential data loss, is an essential aspect to consider. PDAR targets a one-minute RPO for replicating up to 100 MB of compressed changed blocks per minute.

However, certain scenarios can cause the RPO to exceed one minute. These include the initial replication phase, instances where disk change rates surpass 100 MB per minute, or if a disk is detached from a virtual machine (VM) during replication.

On the other hand, the Recovery Time Objective (RTO), the time it takes to restore operations, hinges on the speed of failover tasks. RTO can be optimized by ensuring VMs are ready in the secondary region.

Grasping these limitations and considerations is crucial for an effective DR strategy when utilizing PDAR.

Conclusion

Persistent Disk Asynchronous Replication plays a pivotal role in creating a robust disaster recovery (DR) plan within the Google Cloud Platform environment. As we’ve discussed, this service offers a multitude of features, like easy-to-configure replication periods, consistency groups, and extensive failover and failback testing.

Each of these contribute to creating a more resilient, reliable, and versatile backup system. Integrating Persistent Disk Asynchronous Replication enables business to enjoy a complementary host of benefits in their Disaster Recovery strategy, business continuity plans, and overall GCP cost optimization.

Granular visibility into your cloud

Discover hidden optimization signals

Monitor everything. Troubleshoot faster.

BUSINESS SIZE

For Small Business

For Medium Business

TRUST

Reviews & Security

USE CASE

Google Cloud Platform

Amazon Web Services

INTEGRATION

Slack, Teams and many more...

FOR TEAMS

RESOURCES

Guides

GCP Pricing Catalog

AWS Pricing Catalog

Glossary

GCP Pricing Calculator

AWS Pricing Calculator

Apna

ShopUp

DeepSource

Customer Stories

Mastering GCP Disaster Recovery with Async Replication

Table of Contents

What is Disaster Recovery in GCP?

Why GCP Disaster Recovery isn’t Optional

The Hidden Cost of Cloud Downtime

Understanding Google’s Disaster Recovery Plan and Architecture

What is Persistent Disk Asynchronous Replication in GCP?

How Does Persistent Disk Asynchronous Replication Work?

Persistent Disk Asynchronous Replication Features for Disaster Recovery in GCP

Consistency Groups – Guaranteeing Data Synchronization

Disaster Recovery Testing with Failovers and Failbacks

What is a Failover?

What is a Failback?

Regular Testing for Disaster Recovery

Using Regional Persistent Disks for High Availability and Disaster Recovery

The Benefit of Using Regional Persistent Disks with PDAR

Supported Region Pairs

Secure Disk Encryption

Google-managed and Customer-managed Encryption Keys

Limitations and Considerations

Conclusion

Adarsh Rai

Related Articles

A Beginner’s Guide to Google Cloud Functions

Top 7 Cost-Saving Tips for AWS EBS Snapshots

AWS Split Cost Allocation Data for Container Cost Management

Maximize Cloud Efficiency and Optimize Costs

SERVICES

FEATURES

CAPABILITIES

RESOURCES