How do you determine RTO and RPO?

RTO (Recovery Time Objective) and RPO (Recovery Point Objective) are key metrics used to measure the effectiveness of an organization’s disaster recovery strategies. RTO refers to the maximum acceptable length of time that a business process can be disrupted before there is an unacceptable impact on the business. RPO refers to the maximum period of time for which data can be lost in the event of a disruption. Determining appropriate RTO and RPO is essential for developing effective disaster recovery plans that align with business needs.

Defining RTO and RPO provides concrete targets for disaster recovery, helps identify critical systems and data, and allows businesses to select disaster recovery solutions to meet their goals. RTO and RPO analysis enables organizations to balance the costs of disaster recovery solutions against the potential impacts of downtime. Tracking and reporting on RTOs and RPOs also provides metrics to evaluate the performance of disaster recovery plans.

Why Determine RTO and RPO?

Determining RTO and RPO is crucial for business continuity and avoiding prolonged downtime during a disruption or disaster. As per an article on the importance of RTO and RPO, RTO and RPO help set clear expectations for how quickly systems and data can be recovered after an outage.

Setting RTO and RPO targets ensures proper business impact analysis, risk assessment, and resource allocation for disaster recovery plans. Neglecting RTO and RPO can lead to excessive downtime costs from lost revenue, productivity, and reputation. Proactively defining RTO and RPO demonstrates due diligence in safeguarding operations.

Assess Critical Business Functions

The first step in determining appropriate RTO and RPO is to conduct a business impact analysis to identify critical systems and data. This involves working with business owners and process experts to catalogue key business functions, processes, and supporting IT systems. The goal is to understand dependencies and the impacts of potential disruptions.

As part of the business impact analysis, identify business processes that would be severely impacted by a disruption, along with the supporting applications, data, and infrastructure. Also analyze how much downtime can be tolerated before major impacts occur. Quantify potential financial losses, reputational damage, regulatory non-compliance, and other consequences.

Prioritize the most critical systems and data that have the lowest tolerance for disruption. These will drive more aggressive RTO and RPO targets. Also consider critical dependencies across systems and business processes. The full extent of disruption may be greater than the sum of individual components.

Conducting a thorough business impact analysis provides the foundation for setting meaningful and informed RTO and RPO objectives. Understanding the potential impacts of various disruption scenarios allows appropriate resiliency plans to be established. This step is crucial to align IT recovery capabilities with true business needs.

For more details see: Element Critical’s 4 Steps to Conducting a Business Impact Analysis

Define RTO

RTO or Recovery Time Objective refers to the maximum tolerable downtime after a disruption. It is the duration of time within which a business must restore its critical systems and applications after an outage. The RTO represents the shortest time a company can afford to be without its key operations before severe impacts begin.

Setting appropriate RTO targets requires analyzing business processes to determine maximum allowable downtime. Shorter RTOs lessen business impacts but require greater investments in resilient IT infrastructure with built-in redundancy and failover capabilities. According to best practices, RTOs are often categorized into tiers based on criticality, such as:

  • Tier 1 – Less than 4 hours
  • Tier 2 – 4 to 24 hours
  • Tier 3 – 24 to 72 hours

Factors that influence RTO include data and application criticality, cost of downtime, recovery priorities, and availability of redundancy mechanisms. RTO is a vital metric that helps businesses implement disaster recovery plans focused on restoring mission-critical systems within predetermined timeframes.

Define RPO

RPO stands for Recovery Point Objective and is the maximum amount of data loss that is acceptable in the event of a disruption or disaster. The RPO defines the point in time to which systems and data must be recovered after an outage.1

For example, if the RPO is set to 24 hours, the systems and data should be recoverable to a version from no more than 24 hours prior to the disruption. A shorter RPO of say 1 hour would require recovering data and systems to a state much closer to the time of the disruption, and therefore limit potential data loss.2

The RPO is a business decision balancing the cost of more frequent backups vs the potential data loss. Typically, an RPO of 24 hours or less is recommended as a best practice for mission critical systems and data. Financial services and healthcare organizations often have RPOs of less than 1 hour.

Gather Data on Downtime Impacts

A key step is to gather data on the potential impacts of downtime to important business functions. This helps set realistic RTO and RPO targets. Consider impacts across three key areas:

  • Financial – What are the direct and indirect financial costs of downtime? Lost revenue, wages paid during outages, and costs to restore systems should be considered. Recent headlines have highlighted how prolonged outages can cost millions per day.
  • Reputational – How does downtime impact customer satisfaction, future sales, and brand reputation? Even short outages can damage trust and result in lost business.
  • Regulatory – Are there legal or regulatory requirements around uptime and availability? Failing to meet obligations can lead to fines, lawsuits, and increased scrutiny.

Analyzing downtime impacts across these dimensions provides crucial data to define appropriate RTO and RPO targets.

Set Initial RTO and RPO Targets

Based on the data gathered on downtime impacts and costs, initial RTO and RPO targets can be established. Setting realistic targets involves balancing business needs, costs, and technical feasibility. According to Azure Disaster Recovery Services, “The typical RTO and RPO targets are set accordingly within the organizational limits depending on the requirements, which are best-in-class limits.”

Some key factors to consider when setting initial RTO/RPO targets include:

  • Maximum acceptable downtime for critical systems before major business impact
  • Potential revenue loss and other financial costs due to downtime
  • Customer service and reputational damage if systems are unavailable
  • Regulatory compliance requirements for uptime and data recovery
  • Current data backup and restoration capabilities
  • Costs of improving infrastructure to meet lower RTO/RPO targets

Initial targets should balance desired business continuity with feasibility and cost. Aggressive targets may require significant investment. The targets can be refined later after testing capabilities.

Test and Refine

Once initial RTO and RPO targets have been set, it is crucial to test and refine them through disaster recovery testing and exercises. This involves simulating outages and disruptions to evaluate if recovery procedures can meet the established RTO and RPO goals. Testing may reveal that certain applications or systems take longer than expected to restore, requiring adjustments to the targets. Regular disaster recovery testing, such as failover drills or tabletop exercises, provides the opportunity to identify gaps and fine-tune recovery plans.

Common tests related to refining RTO/RPO include:

  • Recovery testing – Simulate a disruption and test restoration of systems and data to a secondary site or cloud.
  • Backup testing – Restore data from backups to ensure recovery point objectives can be met.
  • Stress testing – Introduce abnormal workloads to evaluate system performance during recovery.

Each test provides data points that can be used to reevaluate and optimize the established RTO and RPO metrics. For example, longer than expected recovery times may indicate the need for additional redundancy or migration to cloud-based disaster recovery services. The costs associated with more aggressive RTO/RPO targets should also be weighed against the potential impact of prolonged downtime. Ongoing testing and refinement is key to developing recovery plans that match business continuity needs.

Monitor and Report

Ongoing monitoring and reporting are critical for maintaining RTO and RPO compliance. Organizations should track key performance metrics like:

Tools like Nutanix’s management platform (https://next.nutanix.com/community-blog-154/disaster-recovery-monitoring-and-reporting-in-a-hybrid-it-world-39928) allow teams to monitor RTO, RPO, and outages to ensure regulatory and business compliance.

Regular Review

It’s important to reassess RTO and RPO targets on a periodic basis to ensure they continue meeting the organization’s recovery objectives. Requirements can change over time as the business, technology, and threats evolve. A regular review process allows organizations to:

  • Evaluate if current RTO and RPO targets still align with business needs and downtime tolerances.
  • Assess the costs and benefits of adjusting RTO/RPO targets based on changing business requirements.
  • Validate that recovery plans can actually meet RTO/RPO expectations based on tests or actual recovery events.
  • Identify new systems, applications, or business processes that require defined RTO/RPO targets.
  • Adjust RTO/RPO targets to account for technology improvements that enable faster recovery.
  • Update RTO/RPO targets to address increased downtime risks from evolving cyber threats or vulnerabilities.
  • Review recovery plan testing results to find opportunities to improve RTO/RPO attainment.

Organizations should review RTO and RPO at least annually, or whenever there are major changes to systems, processes, or threats. This helps verify recovery objectives match current business needs and remain achievable.