Author: Shaun Hummel
Disaster recovery planning (DRP) starts with a discussion that involves key management employees. It is important to get their support with any disaster recovery initiative. Explain what disaster recovery is and why it is required for business continuity, cost reduction, generating revenue and improving productivity. Disaster scenarios such as fire, flood, earthquake, cold weather and employee sabotage should be discussed. Alternate vendors should be discussed as well as a potential issue with business continuity.
The Risk Assessment is a ” what if analysis ” that describes the amount of risk associated with the current state of the network. The following are some things to consider before any disaster recovery strategy formulation.
• Average cost per/minute that your network is unavailable.
• Cost of replacing servers, applications, circuits and devices.
• What if any disaster recovery plan exists and how extensive it is.
• Have alternate vendors been identified should primary vendors have their own disaster recovery problems.
Disaster Recovery Strategy
The disaster recovery strategy describes operational changes, design changes and failover strategies for business continuity. An action plan document is created that describes all those strategies and a detailed escalation procedure should the network become unavailable. It should document employees, responsibilities, time frames, event sequence, vendors and processes.
The following describes recommended operational changes:
1. Network Documentation
Automate the network documentation process. It is difficult to restore a network without having current documentation of the network before it became unavailable. Running a network assessment will collect some information however you need application and device configurations as well. Find a tool that will automate this process !
Document these items:
• Current Topology
• Security Policies
• Management Strategy
• Application Configurations, Versions and Patches
• Device Configurations, IOS Versions and Firmware
2. Regular Backups rotated off-site and tested for data integrity
The following list describes recommended design changes:
Review and modify design, infrastructure, configuration, security and management for improved network resiliency and availability. It is my contention that running a network assessment is an effective strategy for determining what changes should be made to your network. The argument could be made that all assessment groups have some affect on network availability and resiliency. The availability assessment will collect most of the key information however the security assessment must be considered since problems with company security will expose your network to attacks. When your network is being attacked it isn’t available!
Management strategy assessments are key as well since the absence of effective management policies and applications will create a tenuous situation. For instance without any change management policies you will have employees changing application and device configurations (assuming they have security authorization) without prior approval and at any time of the day. The configuration change doesn’t work as expected and it is 10 am while employees are starting their day. Guess what, your day just got longer. Pro-Active fault and performance monitoring strategies will indicate when a device or server is not operational or near capacity. Those situations will obviously affect network availability. The performance assessment will describe how well the network is performing and whether there are any capacity issues and what offices are affected. The infrastructure assessment will focus on issues such as media mismatches, switch port capacity, IOS version problems, router memory shortages, application software versions and protocols. Facilities are considered with an availability assessment and focus on rack space, temperature controls, power availability and raised floors.
Select Failover Strategies
1. On-Line data synchronization between the production Data Center and a remote Data Center facility. The cutover or convergence time should be transparent to employees and all current data would be available. This requires the cost of a remote facility with routers, switches and matching servers and applications to synchronize the offices. Cisco distributed director technology can be utilized to configure both Data Centers for concurrent operation if that is required.
2. Configure the distributed director to redirect sessions to the alternate Data Center once a certain percentage of TCP sessions were running at the primary Data Center. It is still a good idea to consider standby sites as described below since both on-line Data Centers could be unavailable.
3. Configure a 48 hour standby site for the company which is a remote facility that has all the equipment necessary for restoring a specified service level within 48 hours. This is a temporary strategy for continuing network service for a short time frame before the problems are fixed or cutover to a 10 day site. This can be provisioned by company employees or contracted to a third party DRP vendor.
4. Configure a 10 day standby site for the company which is a remote facility that has all the equipment necessary for restoring all specified services within 10 days. This would be utilized in a situation where restoration of Data Center services would require months. This can be provisioned by company employees or contracted to a third party DRP vendor.
Test your disaster recovery (business continuity) strategy utilizing the action plan document from the strategy phase. There should be a meeting with specific employees and vendors to discuss responsibilities, time frames, test event sequence and processes. The company strategy and action plan should be changed as problems are identified from the testing phase or company requirements change. Plan on regular testing of the disaster recovery plan 3 – 4 times per year.
The results from contingency testing will be utilized to make sound recommendations for improving the disaster recovery strategy and the testing process. The complexity of your organization will affect how difficult it is to build a workable disaster recovery plan. The recommendations will streamline your DRP and ensure it works when it is required. The on demand circuit is homed to the remote DR facility router where it converges with the company network and employees can utilize the mainframe applications. The DR mainframe should be synchronized with the company mainframe for transactions during that period before service is restored.