CCAMP Working Group P. Czezowski (FLA) Internet Draft T. Soumiya (FLL) draft-czezowski-optical-recovery-reqs-01.txt (Editors) Expires: August 2003 February 2003 Optical Network Failure Recovery Requirements draft-czezowski-optical-recovery-reqs-01.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [1]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This draft presents requirements for control plane-based recovery from data plane failures in pre-OTN networks. pre-OTN networks are transport networks that have a GMPLS-based control plane and various transport plane technologies (such as Optical Cross Connects and Optical Add/Drop Multiplexers, etc.) An important feature of these networks is timely recovery from failures - using either a protection or restoration scheme. However, achieving recovery under strict time constraints is a difficult problem. Shared mesh-based recovery is especially desirable for reducing spare capacity and because it allows more flexible recovery scenarios than ring-based networks. Following a brief overview and consideration of the requirements, they are presented in an itemized list in section 3.4 of this document. Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 1] draft-czezowski-optical-recovery-reqs-01.txt February 2003 Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [2]. Table of Contents 1. Introduction...................................................2 2. Glossary of Terms Used.........................................3 3. Failure Recovery Requirements..................................4 3.1 Overview of Recovery Requirements..........................4 3.2 Shared Mesh-based Recovery.................................6 3.3 Failure Notification Mechanisms............................6 3.4 pre-OTN Network Failure Recovery Requirements..............8 4. Security Considerations.......................................10 5. Conclusions...................................................10 References.......................................................10 Acknowledgments..................................................11 Editors' Addresses...............................................12 Contributing Authors.............................................12 1. Introduction This draft describes requirements for control plane-based recovery from data plane failures in pre-OTN Networks. pre-OTN Networks are transport networks that have a GMPLS-based [3] control plane and various transport plane technologies (such as Optical Cross Connects (OXC), Optical Add/Drop Multiplexers (OADM), etc). Service recovery from failures, using either a protection or restoration scheme, is an important feature of these networks to ensure high-availability and uninterrupted service. Achieving service recovery under strict time constraints is a difficult problem. Several mechanisms for recovery in mesh and ring topologies have been devised. Protection and restoration algorithms can be used for local repair (around failed spans or nodes) or edge-to-edge recovery of an LSP. Shared mesh- based recovery is especially desirable for reducing spare capacity requirements and achieving flexible service recovery scenarios. While edge-to-edge based recovery has the potential for efficient redundancy requirements, it also entails the potentially lengthy delay incurred in notifying all nodes along the recovery path of the failure of a remote resource. For some applications, recovery paths must be chosen carefully to meet strict recovery time requirement (e.g., 50ms). Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 2] draft-czezowski-optical-recovery-reqs-01.txt February 2003 There are currently several Internet Drafts in the Sub-IP Area related to recovery in GMPLS networks. They cover the topics of terminology [4], functional specification [5] and mechanisms analysis [6] for recovery in GMPLS-based networks, and survivability requirements and considerations for traffic engineered or hierarchical networks [7,8]. As a set, these documents provide their readers with detailed descriptions of the concepts and mechanisms used in network recovery. However, the list of requirements for control plane-based recovery has not been specifically detailed in any one document. 2. Glossary of Terms Used The following acronyms are used in this document: o GMPLS: Generalized Multiprotocol Label Switching [3] o LMP: Link Management Protocol [9] o LSP: Label Switched Path o LSR: Label Switched Router o OADM: Optical Add/Drop Multiplexer o OTN: Optical Transport Network o OXC: Optical Cross-Connect o RSVP-TE: Resource Reservation Protocol-Traffic Eng. [10] The terminology for GMPLS-based recovery is documented in [4]. These terms are borrowed from a work in progress at the ITU-T [11]. Here, we use the following terms from that document: o Detecting Entity (Failure Detection): An entity that detects a failure or group of failures; providing thus a non-correlated list of failures. o Reporting Entity (Failure Correlation and Notification): An entity that can make an intelligent decision on fault correlation and report the failure to the deciding entity. o Deciding Entity (part of the failure recovery decision process): An entity that makes the recovery decision or select the recovery resources. This entity communicates the decision regarding the recovery actions to be performed to the impacted LSPs/spans. o Recovery Entity (part of the failure recovery activation process): Any entity that participates in the recovery of the LSPs/spans. o Bridge: A bridge is the function that connects the normal traffic and extra traffic to the working and recovery LSP/span, respectively. There are three types of bridges (Permanent Bridge, Broadcast Bridge and Selector Bridge). o Selector: A selector is the function that extracts the normal Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 3] draft-czezowski-optical-recovery-reqs-01.txt February 2003 traffic either from the working or the recovery LSP/span. There are two types of selectors (Selective selector and Merging Selector). o Recovery phases: 1. Failure Detection, 2. Failure Localization and Isolation, 3. Failure Notification, 4. Recovery (Protection or Restoration), 5. Reversion (Normalization) 3. Failure Recovery Requirements Even though some requirements for fault recovery have been discussed in working groups of the Sub-IP area, several additional aspects should be examined and mentioned regarding recovery in pre-OTN networks. In this section, we describe the fault recovery requirements that we see. For purposes of completeness, we do not try to avoid restatement of requirements listed in other drafts. 3.1 Overview of Recovery Requirements This subsection summarizes the survivability requirements for pre-OTN networks. Greater details on the requirements are provided in the subsequent subsections. The following classes (types) of recovery are required for span, LSP segment, and LSP recovery: o Protection - pre-computed route and pre-selected (i.e., cross- connected) resources o Restoration - pre-computed route and on-demand selection of resources - on-demand route and on-demand selection of resources A recovery scheme uses either protection or restoration (or both), together with failure detection and notification mechanisms and protocols. Depending on the service specification, the timing bounds for the recovery schemes range from 50ms (for local repair of services carrying voice calls) to seconds (for low priority path- based repair). For multi-layered networks, hold-off timers are required to allow recovery at lower layers, and escalation must be supported. Support for horizontal hierarchy must also be included, because large networks are usually segmented [7]. In general, recovery schemes are required to operate in a stable and cooperative manner to maximize the network's reliability and availability. Such requirements entail that the recovery schemes Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 4] draft-czezowski-optical-recovery-reqs-01.txt February 2003 also be resource efficient and as flexible as possible with respect to types of failures, service classes, and the network operator's policies. A temporal model of fault recovery is shown in Figure 1 below. The diagram is adapted from [11]. +-Network Impairment | +-Fault Detection | | +-Start of Fault Notification | | | +-Start of Traffic Switching | | | | +-Recovery Operation Complete | | | | | +-Traffic Recovered | | | | | | v v v v v v -----------------------------------------------> | T1 | T2 | T3 | T4 | T5 | time Figure 1. Recovery temporal model. The five recovery phases shown in the figure are (using the terms from [4]): 1. Failure Detection - The time between the network impairment and the detection at the control plane (via a technology dependant interface of the transport plane). 2. Failure Localization and Isolation - The time between when the detecting entity has detected a fault, and when the reporting entity starts the fault-recovery process. This time assumes that the fault-recovery process at a given layer may wait for restoration or recovery to occur at another layer. The reporting entity also performs failure correlation to reduce the number of notifications to be sent to the deciding entity. 3. Failure Notification - The time between when the reporting entity starts the notifications and when all the necessary deciding and recovering entities have received the failure notifications. 4. Recovery (Protection or Restoration) - The time between the first and last recovery actions, after which the recovery path is carrying traffic. Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 5] draft-czezowski-optical-recovery-reqs-01.txt February 2003 5. Reversion (Normalization) - The time (after recovery) until the original working path has been repaired and begins to carry the traffic again. Together, phases 1 and 2 are called Fault Management. It is evident that the critical component in guaranteeing the time constraints for the service recovery is the Failure Notification phase. A recovery scheme should follow these steps. The scheme should also allow the network operator to choose whether or not reversion is performed. 3.2 Shared Mesh-based Recovery In non-WDM optical networks, such as Synchronous Optical Network / Synchronous Digital Hierarchy (SONET/SDH), conventional protection techniques are currently the most commonly used. These techniques are based on linear and ring network topologies. Linear protection can be categorized as 1+1 and 1:N protection. Ring protection can be categorized as uni-directional path switched ring (UPSR) and bi- directional line switched ring (BLSR). However, linear 1+1 protection requires 100% redundancy in the spare resources for every working path. For ring-based protection, the available topology is restricted to a ring, and it requires 100% redundancy in the spare resources for every working path. Even with 1:N based link protection, it is difficult to select different routes flexibly. From this point of view, 1+1 and 1:N protection are extravagant in resource usage and have low flexibility, even though the level and speed of recovery from a failure can be assured. For reasons of efficiency and flexibility, pre-OTN network recovery schemes should support shared mesh-based recovery. Shared mesh recovery can save resources by sharing recovery capacity among multiple working paths. This approach increases the system flexibility, because the possibility of sharing recovery resources may allow for more options when routing working paths and recovery paths. Furthermore, this flexibility facilitates fast recovery because the shared mesh provides more (suitable intermediate) nodes for the routing of the recovery paths. Having more candidates increases the chances of finding shorter recovery paths, which reduces the notification time. 3.3 Failure Notification Mechanisms In general, there are two alternatives for control plane based failure notification: o Failure notification messages based on modified GMPLS signaling o Controlled flooding of failure notification messages Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 6] draft-czezowski-optical-recovery-reqs-01.txt February 2003 The GMPLS signaling protocol, RSVP-TE [10], supports notification using a Notify message. Under this scheme, the deciding entity pre- arranges to receive the notifications by sending a Notify Request object in the Path or Resv messages. Since additional (extra) Notify Request objects in a RSVP-TE message are ignored, a detecting (or reporting) entity sends Notify messages to only one deciding entity per LSP. The recovery process uses a 2 or 3-phase method. In the first phase, the reporting entity sends the notification to the deciding entity. The deciding entity then begins a 1 or 2-phased signaling down (or down and back) the recovery LSP. The controlled flooding of fiber link failure notification messages on the control plane, perhaps by extending LMP [9], is another alternative for failure notification. Flooding the notifications in one shot to an appropriate portion of the network ensures their timely delivery. This supports recovery schemes that require policy or priority-based decisions at multiple decision entities that may be distributed, within the network, off the working path. To meet the time constraints for recovery, failure correlation/ aggregation time for the computations to be performed at the reporting entity must be minimized, and the time that elapses prior to all entities involved in the recovery receiving a failure notification (or recovery action) signal must also be minimized. The flooded messages will take the shortest available paths to all these entities. +---+ .....| E |.............. : +---+ : : : +---+ +---+ \ / +---+ +---+ ===| A |====| B |====X====| C |====| D |=== +---+ +---+ / \ +---+ +---+ : : : +---+ +---+ : :......| F |.........| G |......: +---+ +---+ Figure 2. Multiple (partial) recovery paths protecting against the failure of link BC. Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 7] draft-czezowski-optical-recovery-reqs-01.txt February 2003 Figure 2 above shows a network when a failure occurs on link BC. The working LSPs follow the route ABCD, and two (dotted) recovery paths have been reserved, but not activated. Recovery paths BED and AFGD are each responsible for recovering a portion of the working capacity on link BC. In this case, nodes A, B, D, E, F, and G must all receive a notification of the failure and make reconfiguration actions. A flooding-based approach to fault notification not only has the benefit of reaching all recovery nodes in the shortest time possible, but also has a beneficial side effect that all nodes in the vicinity of the failure receive the notification. Therefore, it is possible for other nodes, say Node H and Node I, in the neighborhood of the failure to use this information in making policy or priority- based decisions such as dynamically rerouting low-priority LSPs around the neighborhood to free-up capacity, or blocking new LSP requests that do not have a high enough priority value. 3.4 pre-OTN Network Failure Recovery Requirements This is our list of recovery requirements: o Requirements on the efficiency of working and recovery bandwidth (1) A recovery scheme SHOULD allow efficient use of working LSP bandwidth using such measures as route optimization, taking into account route dependencies between a working path and its recovery path. (2) A recovery scheme SHOULD allow efficient use of recovery LSP bandwidth using such measures as route optimization, taking into account route dependencies between a working path and its recovery path. (3) A recovery scheme SHOULD, when possible, allow sharing of recovery bandwidth among multiple recovery paths to enable efficient use of recovery bandwidth. o Requirements on recovery actions (4) A recovery scheme SHOULD allow suppression of fault notification messages, so that spurious fault notification messages and recovery action messages are suppressed and are not broadcast within the network, ensuring scalability of the fault recovery mechanism. (5) A recovery scheme SHOULD ensure reliable transmission of fault recovery messages, providing the control plane is connected. Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 8] draft-czezowski-optical-recovery-reqs-01.txt February 2003 (6) A recovery scheme SHOULD allow fallback operations of its recovery actions. For example, when the system encounters a fault class (eg., multiple simultaneous failures) which was not anticipated, the system should execute a best-effort recovery, such that as many working paths as possible are restored under the circumstances. (7) A recovery scheme SHOULD allow the network operator to choose whether or not the reversion actions are to be performed. (8) A recovery scheme SHOULD support recovery within bounded time constraints and MAY be compliant with generally used recovery times like 50ms for SONET/SDH protection. (9) A recovery scheme SHOULD allow testing and verification of the availability of the recovery path before its actual use. This testing may occur when the recovery path is provisioned, or after it is provisioned but before actual recovery action occurs, causing the path to be used. (10) A recovery scheme SHOULD guarantee that recovery actions correctly deliver traffic from working paths to the respective recovery paths, such that the recovery actions do not result in any unintended connections or unintended diversion of traffic. o Requirements on recovery schemes (11) A recovery scheme SHOULD support and be compliant with generally used protection schemes such as 1+1, 1:1, 1:N, M:N, and unprotected. (12) A recovery scheme SHOULD support recovery of failed LSPs even if the LSPs have different endpoints. (13) A recovery scheme SHOULD support priority-based recovery of failed LSPs. This means that path restoration should be ordered according to each LSP's recovery priority. o Requirements on recovery priority of service classes (14) A recovery scheme SHOULD allow recovery of service classes based on their recovery priority, which is a continuous spectrum from lowest priority (best effort) to the highest priority (guaranteed), based on the service class usage and a carrier's agreements with its customers. (15) A recovery scheme SHOULD allow support of service classes with different recovery time guarantee. For example, the authors Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 9] draft-czezowski-optical-recovery-reqs-01.txt February 2003 estimate that a service class carrying voice calls requires a recovery time of less than 50ms to avoid loss of connections, whereas a service class carrying private lines requires a recovery time on the order of several seconds. o Requirements on recovery granularity (16) A recovery scheme SHOULD allow recovery of traffic on an aggregated basis, ensuring scalability. o Requirements on failure notification delivery (17) A recovery scheme SHOULD be equipped with a failure notification mechanism that guarantees prompt and reliable delivery of notification of faults in the data plane to a deciding entity that is in charge of recovering the fault. 4. Security Considerations This draft does not introduce any new security issues. 5. Conclusions This draft describes requirements for control plane-based recovery from data plane failures in Optical IP Networks. While there are currently several Internet Drafts in the Sub-IP Area related to service recovery in GMPLS networks, the list of requirements for control plane-based recovery has not been specifically detailed in any one document. We identify that most important requirements are meeting the potentially strict timing, enabling flexible recovery schemes, and facilitating the efficient use of resources. 17 requirements are listed in section 3.4. References [1] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [3] Mannie, E. (Ed.), "Generalized Multi-Protocol Label Switching (GMPLS) Architecture", Internet Draft, work in progress, draft- ietf-ccamp-gmpls-architecture-03.txt, August 2002. Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 10] draft-czezowski-optical-recovery-reqs-01.txt February 2003 [4] Mannie, E. and D. Papadimitriou (Eds.), "Recovery (Protection and Restoration) Terminology for GMPLS", Internet Draft, work in progress, draft-ietf-ccamp-gmpls-recovery-terminology-01.txt, November 2002. [5] Lang, J.P. and B. Rajagopalan (Eds.), "Generalized MPLS Recovery Functional Specification", Internet Draft, work in progress, draft-ietf-ccamp-gmpls-recovery-functional-00.txt, January 2003. [6] Papadimitriou, D. and E. Mannie (Eds.), "Analysis of Generalized MPLS-based Recovery Mechanisms (including Protection and Restoration)", Internet Draft, work in progress, draft-ietf- ccamp-gmpls-recovery-analysis-00.txt, January 2003. [7] Lai, W.S., and D. McDysan (Eds.), "Network Hierarchy and Multilayer Survivability", RFC 3386, November 2002. [8] Owens, K., et al., "Network Survivability Considerations for Traffic Engineered IP Networks", Internet Draft, work in progress, draft-owens-te-network-survivability-03.txt, May 2002. [9] Lang, J. (Ed.), "Link Management Protocol (LMP)", Internet Draft, draft-ietf-ccamp-lmp-07.txt, November 2002. [10] Berger, L. (Ed.), "Generalized MPLS Signaling - RSVP-TE Extensions", Internet Draft, work in progress, draft-ietf-mpls- generalized-rsvp-te-09.txt", September 2002. [11] ITU-T Draft Recommendation G.gps, "Generic Protection Switching", work in progress, April 2002. Acknowledgments The following individuals provided valuable input to this draft: Richard Rabbat, Ching-Fong Su and Takafumi Chujo of Fujitsu Labs of America, Inc., Norihiko Shinomiya and Akira Chugo of Fujitsu Laboratories, Ltd. Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 11] draft-czezowski-optical-recovery-reqs-01.txt February 2003 Editors' Addresses Peter Czezowski Toshio Soumiya Fujitsu Labs of America, Inc. Fujitsu Laboratories Ltd. 595 Lawrence Expressway 1-1, Kamikodanaka 4-Chome Sunnyvale, CA 94085 Nakahara-ku, Kawasaki United States of America 211-8588, Japan Phone: +1-408-530-4516 Phone: +81-44-754-2765 Email: peterc@fla.fujitsu.com Email: soumiya.toshio@jp.fujitsu.com Contributing Authors Peter Czezowski (see address information above) Toshio Soumiya (see address information above) Kohei Shiomoto NTT Network Innovation Laboratories Midori-machi 3-9-11, Musashino-shi Tokyo, Japan 180-8585 Phone: +81-422-59-4402 Email: Shiomoto.Kohei@lab.ntt.co.jp Shoichiro Seno Mitsubishi Electric Corporation 5-1-1 Ofuna, Kamakura Kanagawa, Japan 247-8501 Phone: +81-467-41-2430 Email: senos@isl.melco.co.jp Czezowski & Soumiya (Eds.) Expires - August 2003 [Page 12]