Pre-configured polyhedron based protection against multi-link failures in optical mesh networks

Shanguo Huang; Bingli Guo; Xin Li; Jie Zhang; Yongli Zhao; Wanyi Gu

doi:10.1364/OE.22.002386

1. Introduction

In wavelength division multiplexing (WDM) optical network, pre-assigned capacity is usually exploited to protect end-to-end traffic against network link failures, referred as protection approach. Such proactive approach could achieve higher survivability guarantee and faster recovery time through simple switching over operation in the frontend node, compared with the reactive re-routing based restoration technique. According to the protected entity in different optical management layers, in general, there are two alternatives: link protection and path protection. Specifically, lightpath is protected in optical channel sublayer (referred as path protection) and physical disjoint lightpaths are reserved as the primary and backup channels of end-to-end traffic. Once a link is failed, the path protection entails the rerouting of all working lightpaths that use the failed link along their pre-computed backup lightpaths. This flexibility in recovery could lead to lower protection capacity but requires that all failed lightpaths implement their recovery individually which need more hardware operations. On the other hand, in the optical multiplex section layer, multiplexed WDM channels transmitted in one fiber are monitored. Thus, in this layer, fault recovery regards each fiber link individually (referred as link protection). Without the flexibility in individual lightpath rerouting, link protection may require more protection capacity, but requires only local knowledge around the failed link to complete the recovery which is critical in practical development. So, in this paper, we focus on link protection approach.

Regarding to the failure model, a lot of studies have been devoted to protection problems in IP over WDM optical networks [1–3] which are mostly designed for single link failures. However, network deployments across expanded geographic domains are mandating services recovery under highly-challenging multi-failure conditions.

First, recovery from the failure of a link is completed within a few milliseconds to a few seconds depending on the mechanism used for recovery. Usually, it would take several hours to weeks to repair the physical faults. During this period, it is conceivable that other failures would happen, which results two or even more links fail at the same time. Another consideration of the multiple link failures arise from practical implementations since fibers connected different node pairs may share same physical risk. For instance, fibers are laid in the same conduits or pipelines, referred as common risk group, and a failure of such shared resources would lead to multiple link failures. Moreover, there is much interest in handling multiple failures arising from large-scale “stressors” such as natural disasters (earthquakes, hurricanes, floods, snow storms, etc), massive power outages, and malicious weapons of mass destruction attacks.

How to survive from multiple link failures has been a critical problem for optical networks [4–6], for which schemes like redundant capacity, capacity reprovision and ring re-configurations [7] have been proposed. These solutions are limited in scope and generally treat these failures independently. Therefore it is inevitable that they will yield high computation overhead, high resource consumption (inefficiency), and increased blocking and reduced revenues.

In light of the above, there is a pressing need to develop improved multi-failure recovery solutions which also incorporate resource provision efficiency. This paper addresses this critical area and is organized as follows. First, Section II presents an overview of some existing work in dual and multi-failure recovery. Next, a novel pre-configured polyhedron (PCP) protection structure based on k-regular and k-edge connected graph is proposed in Section III and heuristic algorithm to construct such structure in Section IV. Section V then presents detailed simulation results and compares the performance of the proposed scheme with various existing solutions. Finally, conclusions and directions for future work are presented.

2. Related work

As discussed above, researchers have investigated dual failure and multi-failure recovery strategies in WDM network and in this paper we restrict ourselves to the case of link based protection against multiple link failures in mesh topology. Some existing approaches can be found in the recent researches and are briefly reviewed here.

One is ring based protection strategies which is attractive because of its simple automatic protection switching (APS) mechanisms, which automatically switches traffic over to protection fibers in the event of failure. For example, P-cycle and cycle double covers are well studied topics [7,8], which achieve both fast restoration speed and high capacity efficiency. The ring based protection used in mesh networks first decompose the topology into a set of rings, and then restores the traffic whenever a failure occurs at a ring. So, ring construction problem is critical in this scenario and various methods have been proposed [7,9]. Obviously, each ring or cycle could not afford more than one link failures. Also, ring based solutions pose difficult optimization problems of finding ring covers, and could not provide sufficient connectivity between all pairs of nodes under multiple link failures without complex extensions to enable protection paths to hop between rings.

Another method is subgraph protection proposed in [10,11], e.g., loopback subgraph protection and star-block protection. In this method, instead of forming cycles, a directed subgraph that covers all nodes is obtained. In generalized loopback approach, upstream node detects the failure, then it would perform a loopback of traffic destined for the failed link, and would flood all possible routes on the protection capacity available. The protection network mimics the live network, and is called conjugate diagraph. Eventually, the intended destination will receive the signal on one of the flooded routes. There are several optimized loopback recovery schemes in terms of capacity efficiency and computational complexities, and please refer to [10,11]. However, lacking of protection structure connectivity consideration leaves subgraph protection bears longer recovery path and limited recovery capability under multi-link failures.

On the other hand, to against dual or sequential link failures in mesh network, protection reconfiguration can be used in conjunction with most protection schemes to achieve the maximum robustness for a given topology leaving the network vulnerable only to complete partitioning [7,11–13]. For example, p-cycle reconfiguration [7] or mutual exclusion [11] which need more signaling and management overhead. Also, some combination of the protection and restoration may be implemented [12,13], but the increase in protocol, management and hardware complexity in implementing two different recovery protocols render the solution less attractive. Also in very recently, there are several works investigate path protection under multiple link failures [14,15] which is a complex optimization problem.

So, with the discussion above, we could conclude that: (1) previous work mainly focus on single or dual link failures, and sequential failures, where failures could be fixed one by one, but seldom on random probabilistic multi-link failures; also it is generally difficult to extend single or dual-failure schemes to handle multiple random failures; (2) some existing multiple link failures researches address maximizing the reliability of connections against failure events, nevertheless discuss the spare resource efficiency objectives. Hence, the design of probabilistic recovery schemes for link protection is very germane here. Recently, we proposed the PCP based protection scheme which is formulated as an ILP problem, and its efficiency are primarily demonstrated in COST 239 network [16]. Also, in [17], greedy algorithms are implemented to search existing PCP structure in a given topology. In this work, heuristics are developed to implement link protection for various topologies by constructing PCP structure, especially for large size network which could not be protected by a single PCP. Also, probabilistic link failures are considered for the first time.

3. k-regular and k-connected structures and pre-configured polyhedron

As discussed above, there are two important objectives of survivable design for optical networks: failure robustness and spare resource efficiency. In this section, we first present the definition of k-regular and k-edge connected graph and prove it is the optimal protection structure for multi-link failures through examples and theoretical analysis. Then, PCP scheme is elaborated and some designing concerns are discussed.

3.1 k-regular and k-edge connected graph and PCP schemes

Definition 1: k-regular and k-edge connected graph: k-regular represents the degree of each node in a graph is k and k-edge connected graph means there does not exist a set of fewer than k edges whose removal disconnects the graph. The proposed k-regular and k-edge connected structure is based on the Menger's theorem which is one of the cornerstones of graph theory.

Theorem 3.1. (Menger Theorem) [18,19]

(1) Let G = (V, E) be a graph and {A, B} $\subseteq$ V. Then the minimum number of edges separating A from B in G is equal to the maximum number of edge-disjoint A−B paths in G.
(2) A graph is k-edge-connected if and only if it contains k edge-disjoint paths between any two vertices.

It is worth noting that, although k-edge connected is robust enough against multi-link failures, k-regular is still necessary here to keep the resource efficiency of protection structure.

With the definition above, an example is given to show the resource efficiency and failure robustness of k-regular and k-edge connected graph. For a completely connected physical network with 8 nodes in Fig. 1(a), we assume the working capacity to be protected in each link is one fiber. One type of Ring protection structure is presented in Fig. 1(b), which includes 6 separate cycles (shown with the orange line in each surface of the cube) and 3 in the rear are omitted here. Obviously, this Ring structure with 2 spare fibers reserved in each link could provide 100% recovery guarantee for one or dual failures of on-structure link (which is on the protection structure/ring, e.g. AB, AF, FG et.al.) and span link (those links connected two nodes on the protection structure/ring but not on-structure links, e.g. AG, BF, CE et.al.). So, we could conclude that, to protect the topology in Fig. 1(a) under one or dual link failures, the given Ring protection with two spare fiber (i.e., 2 spare fibers × 12 links = 24 spare fibers in total) could protect almost all the physical links except those straddle links (i.e. FC, AH, BE and DG) which require additional rings (e.g., B-D-E-G to protect BE) to be protected and thus consume more spare resources. On the other hand, a 3-regular and 3-edge connected graph based protection structure (a cube-like Polyhedron) is provided in Fig. 1(c), which could provide more recovery capability with same spare capacity (SC): (1) could protect four straddle links while Ring could not; (2) could protect more types of random multi-link failures, as shown in Table 1. First, one notion is defined as following to evaluate the recovery capability of a protection structure.

Fig. 1 (a) A 8 nodes complete graph to be protected; (b) Ring protection (6 cycles in each surface where 3 in the rear are omitted here); (c) 3-regular and 3-edge connected graph protection.

Download Full Size | PDF

Table 1. Recovery Capability Comparison

View Table | View all tables in this article

Definition 2: Failure Recovery Probability (FRP) FRP is the ratio of failure scenarios that could be recovered by the protection structure and all possible failure scenarios.

Specifically, the FRP in Table 1 is calculated as following. For example, for random 3 link failures of topology in Fig. 1(a), there are $(\begin{matrix} 28 \\ 3 \end{matrix})$ = 3276 possible failures in total. With 2 spare fiber allocated for each on-(protection) structure link, the 3-regular and 3-edge connected protection could not provide 100% recovery of 8 + 96( = 8 × 3 × 4) + 48( = 12 × 4) + 12(3 × 4) = 164 failure scenarios, where 8 is the failures of 3 on-structure links associated with each node (e.g., AB, AD and AF), 96 is the failure scenarios of two on-structure link and one span link associated one node (e.g., AB, AD and AG), 48 is the 3 link failures of every 4 four links correlated with each on-structure link (e.g., AB, AD and FG) and 12 is failure scenarios that leave a bridge connect two cube faces (e.g., AB, CD and EH). So, its FRP under 3 random link failures is (3276-164)/3276≈95%, while the FRP of Ring is 54%. Also, its FRP could be improved with sufficient SC which implies that the existing of those failed protected failure scenarios are caused by two reasons: it could not be protected with the given protection structure or there is no sufficient SC.

With the discussion above, we could observe that k-regular and k-edge connected structure could provide higher recovery capability than Ring against multi-link failures and achieve better resource efficiency. Furthermore, we would formally prove this observation through investigating the lower bound of spare resource required by link-restorable optical network and that of k-regular and k-edge connected protection structure. Now, the definition of Pre-Configured Polyhedron (PCP) protection is given as following.

Definition 2: Pre-Configured Polyhedron Protection For a given physical topology, if a k-regular and k-edge connected subgraph could be constructed on it and all the links are marked as on-structure links or span links, this k-regular and k-edge connected subgraph is referred as a Pre-Configured Polyhedron protection.

3.2 Analysis of spare resources required for link-restorable optical mesh network and k-regular and k-edge connected protection structure

(a) Lower bound of spare resource required by link-restorable optical mesh networks

A link-restorable mesh network is modeled with graph G = (V,E), each link associated with working capacity w_ij, and SC s_ij with ∀ij∈E. The Spare Capacity Efficiency (SCE or logical redundancy defined in [20]) is employed to indicate the spare resource utilization efficiency, which is the ratio of the total spare capacity to the total working capacity which could be properly protected in the physical network. In a worst-case scenario, all failed links are incident to one node i, and {e_i₁…e_if } are failed links where f denotes the number of failed links_. While d_i is the degree of node i, we should have f<d_i (otherwise this failure could not be recovered with 100% guarantee). Also, it is necessary that the surviving links have enough SC to carry the interrupted working capacity, i.e.:

\sum_{1 \leq x \leq f} w_{i x} \leq \sum_{\begin{array}{l} i y \in E_{i} \\ y \neq 1, ..., f \end{array}} s_{i y}

\sum_{1 \leq x \leq f} c_{i x} \leq \sum_{i y \in E_{i}} s_{i y}

\frac{1}{2} | V | \times f \times c \leq \sum_{i j \in E} s_{i j}

SCE = \frac{\sum_{i j \in E} s_{i j}}{\sum_{i j \in E} w_{i j}} = \frac{\sum_{i j \in E} s_{i j}}{| E | c - \sum_{i j \in E} s_{i j}} \geq \frac{\frac{1}{2} | V | \times f \times c}{| E | c - \frac{1}{2} | V | \times f \times c} = \frac{f}{\bar{d} - f}

where E_i is the edge set incident to node i,

\bar{d}

is the average node degree, c is the total capacity in each link (we assume every link have same working capacity).

In the ideal case, every node reserves the same SC independent of its degree. Summing Eq. (2) over all nodes yields and we have Eq. (3), then we get the lower bound of the SCE under the multi-link failures as Eq. (4): $f / (\bar{d} - f)$ .

(b) Upper bound of SCE of k-regular and k-edge connected protection structures

The expected number of links (in an N* nodes network), $\bar{Ε}$ , deriving protection from an N nodes k-regular and k-edge connected protection structure could be estimated as Eq. (5):

\bar{Ε} = k \times N / 2 + [(\begin{matrix} N \\ 2 \end{matrix}) - k \times N / 2] \times k \times p_{e}

p_{e} = \frac{\bar{d} \times N^{*} / 2 - k \times N / 2}{(\begin{matrix} N^{*} \\ 2 \end{matrix}) - k \times N / 2}

where k × N/2 is the number of the links in the protection structure, and the P_e is the probability that a link exists between any two nodes on the network to be protected excluding those already present to form the protection structure. On average in a network of N* nodes that is stipulated to have average node degree

\bar{d}

, it follows Eq. (6), where k = f + 1 suggests the largest working capacity in span link is f + 1 times the spare capacity in on-structure link. Thus, substituting P_e, dividing by the number of spare links used in the polyhedron, we get the SCE of k-regular and k-edge connected protection structure as Eq. (7):

\underset{N \to N *}{\lim SCE} = \frac{f \times k \times N / 2}{\bar{Ε}} = \frac{f}{\bar{d} - f}

Accordingly, we can find that k-regular and k-edge connected structure is the optimal protection structure against multi-link failures, since it reaches lower bound on SCE of link-restorable mesh networks.

In a word, the proposed k-regular and k-edge connected graph based protection structure has following advantages: 1) better spare capacity efficiency; 2) robustness to survive multiple link failures; 3) able to handle both sequential failures and simultaneous failures.

3.3 Some designing concerns of PCP

In this section, some PCP designing concerns are analyzed, and it is shown that the size of PCP structure and the selection of k would have impact on its SCE and recovery capability, which is important to survivable network design. Following are some notions will be used and two observations to be discussed in details.

Definition 3: Link Spare Ratio (LSR, α) LSR is the ratio of spare capacity allocated for each link of protection structure and the working capacity protected in each link.
Observation 1: FRP performance is closely related with the value of SCE, as well as the failure scenario considered and the link failure probability.
Observation 2: For PCP protection structures with same connectivity (i.e., k), generally larger size protection structure could protect link failures more efficiently. On the other hand, for PCP structures with same size, proper k would lead to higher SCE.

Specifically, three PCP structures are introduced in Fig. 2, i.e., 6 nodes 3-regular and 3-edge connected structure (6 3&3) (Fig. 2(a)), 8 nodes 3&3 (Fig. 2(b)) and 8 nodes 4&4 (Fig. 2(c)), and they are employed to protect completed connected physical topologies of same size respectively. In Fig. 3, FRP of a set number of failures in a 8 nodes completed connected topology (as in Fig. 1(a)) are shown as function of LSR and it follows worst case design. For example, when random one on-structure link failure happens, upstream node of the failed fiber could split the traffic on other two associated alive fibers, so if LSR≥0.5, we could provide 100% failure recovery guarantee. Otherwise, FRP is linearly increased with LSR. In the same way, for FRP₃ in Fig. 3(c), we could provide at most 96% failure recovery, and for recoverable failures, the worst case is LSR = 3. And formally, FRP function of LSR for probabilistic link failures is formulated as following. The basic idea is weighted sum of FRP for all possible multi-link failures, as well as considering the recovery capability of k-regular and k-edge connected structure itself. Following are some notions to be used.

Fig. 2 Different PCP structures.

Download Full Size | PDF

Fig. 3 FRP vs. LSR in different link failure scenarios.

Download Full Size | PDF

$p_{l}$ is the failure probability of a link;
$P (f)$ is the probability of random f link failures happening;
$C_{| E |}^{f}$ is all possible f link failures scenarios among $| E |$ links;
${\bar{R}}_{f}$ is the amount of failure scenarios that could not be recovered among $C_{| E |}^{f}$ ;
$P_{n} (f)$ is normalized $P (f)$ among all possible failures;
$P_{r} (f)$ is the recovery probability of random f link failures.

For a network G(N, E), we have:

\begin{array}{l} P (f) = C_{| E |}^{f} (^{p_{l}) f} (^{1 - p_{l}) | E | - f}, where 0 \leq f \leq | E | - | N | + 1 \\ P_{n} (f) = \frac{P (f)}{\sum_{x \in F} P (x)}, where F is set of all possible f \\ P_{r} (f) = \frac{C_{| E |}^{f} - {\bar{R}}_{f}}{C_{| E |}^{f}} \end{array}

FRP for probabilistic multi-link failures is defined as function of LSR:

F R P (α) = P_{n} (1) F R P_{1} (α) + ... + P_{n} (f) F R P_{f} (α)

where

{FRP}_{f} (α)

is the FRP function of LSR for random f link failures and its formulation for k-regular and k-edge connected structure will be elaborated in next section.

It is worth noting that, k-regular and k-edge connected structure could recover at most $| E | - | N | + 1$ link failures and normalized failure probability is employed since only the recoverable failure events are considered. Also, considering protecting those span links would not change the FRP function since it follows the worst case design.

Finally, FRP of these three structures are evaluated in Fig. 4 as a function of SCE, as well the impact of link failure probability. First, it shows that, generally, 8 nodes 3-regular and 3-edge connected structure has better FRP than 6 nodes 3-regular and 3-edge connected structure when they have same SCE and $p_{l}$ , which partially is the result of fact that 6 nodes 3-regular and 3-edge connected structure protects fewer links, i.e., at most 15 links (including span links), while 3-regular and 3-edge connected structure could protect at most 28 links. Furthermore, FRP performance of 8 nodes 3-regular and 3-edge connected structure and 8 nodes 4-regular and 4-edge connected structure take turns to lead, but 8 nodes 3-regular and 3-edge connected structure performs better when FRP is required larger than 95%. This is because that, although 8 nodes 4-regular and 4-edge connected structure could protect more links through including 4 more links in its protection structure, it would have lower LSR when these two structures have same SCE. Regarding to the impact of link failure probability, Fig. 4 reveals that larger link failure probability would lead to higher SCE requirement to achieve the same FRP.

Fig. 4 FRP vs. SCE for probabilistic link failure.

Download Full Size | PDF

4. Heuristics algorithm

4.1 Polyhedron structure construction

Regarding to the PCP structure construction, one method is to extend a cycle to a k-edge-connected graph which has been studied by Mader in 1972 on the connectivity of circulant graphs:

Theorem 4.1 (Mader in 1972) [19,21]:
- (1) The circulant graph C_n(l₁, l₂, ..., l_k) with n nodes, degree k, and links from vertex i to vertices i + l_j (l_j is positive integer) for j in [1…k] is connected if and only if gcd(n, l₁, l₂, ..., l_k) = 1. Circulant graph is a graph the n vertices of which can be numbered from 0 to n–1 in such a way that, if some two vertices numbered x and y are adjacent, then every two vertices numbered z and (z−x + y) mod n are adjacent.
- (2) Every connected point-symmetric (vertex transitive) graph of degree k has edge-connectivity k.

Accordingly, the PCP could be designed as following. After finding a cycle in the physical topology, selecting specified links/paths in the physical topology and adding it to the cycle to form a circulant graph, which leaves each node pair in cycle separated by l_j nodes (where $l_{j} \leq ⌊ | V | / 2 - 1 ⌋$ [16]). If the amount of links/paths added for each node pair is k-2, this circulant graph is a k-regular and k-edge connected structure.

However, it is generally impracticable to construct one huge PCP protection structure for a given physical topology, because it needs higher graph connectivity and would lead to longer recovery path, no matter to say its computation complexity (e.g., Hamilton cycle searching has high computation and time complexity even for a large planar graph). So, two compromise approaches are introduced, graph abstraction based graph decomposing approach and a remediation approach to form a holistic polyhedron.

First, in graph decomposing approach, separated and small PCP structures are constructed first and then abstract or contract them as one node to be part of other protection structure, by merging nodes in PCP into one node and keep their connections with nodes not included in this subgraph as shown in Fig. 5(a). This is helpful to provide further protection for links belonging to any single structure when their failure could not be recovered in their own PCP, as well as reduce the computation complexity.

Fig. 5 (a) Graph Abstraction; (b) A protection structure with remediation approach.

Download Full Size | PDF

Also, another problem in polyhedron construction approach is that such a k-regular and k-edge connected subgraph does not always exist (i.e., some specific links do not exist in the physical topology). In order to fix this problem, a link disjoint alternative path is found as remediation. For example in Fig. 5(b), to form an 8 nodes 3-regular and 3-edge connected structure, a path d-i-h is employed to connect d and h, and to provide same survivability for link d-i and h-i, i is connected with other nodes in this structure (e.g., f).

With the discussion above, a heuristic algorithm is proposed, and the basic idea is: (1) find a highly connected subgraph of the physical topology and the longest Hamiltonian cycle within it; (2) then construct PCP based on this cycle by gradually adding links/paths to it with remediation approach and mark the on-structure and span links of this PCP; (3) abstract/contract this subgraph; (4) return to step (1) and continue these approaches till all the links are marked which indicates that all the physical links are protected. Specifically, cluster coefficient is employed to evaluate the connectivity of a subgraph. For a given physical topology, its adjacency matrix (a_ij) and local cluster coefficient (cc_i), $s u m_{c c} (n_{i})$ is calculated with Eq. (10) which is used to show the connectivity of n_i and its adjacent nodes. The local clustering coefficient of a node in a graph, which quantifies how close its neighbors are to being a clique (complete graph), is computed as the proportion of links between the nodes within its neighborhood divided by the number of links that could possibly exist between them. And the principle of longest Hamiltonian cycle searching algorithm utilized in this paper is: (1) one tree T of the physical graph G is found and transformed to a weighted complete graph K_T, where the vertices of K_T are nodes of T and weight (i; j) is the distance between i, j in T; (2) the node S has minimal separability over all vertices in T is selected (referred as the centroid of T) according the size of maximum component of T-{S}; (3) S-coloring of T, which is a coloring such that the color of each node v≠S corresponds to the subtree of S containing v and the color of S is S itself; (4) two adjacent nodes in G are selected and an alternating sequence (colored sequences in which adjacent elements are of different colors) of these nodes is the longest Hamiltonian cycle if the adjacent node in this sequence is connected in G.

| \begin{matrix} a_{12} & ... & a_{1 | N |} \\ ⋮ & a_{i j} & ⋮ \\ a_{| N | 1} & \dots & a_{| N | | N |} \end{matrix} | \times | \begin{matrix} c c_{1} \\ ⋮ \\ c c_{| N |} \end{matrix} | = | \begin{matrix} s u m_{c c} (n_{1}) \\ ⋮ \\ s u m_{c c} (n_{| N |}) \end{matrix} |

The proposed Polyhedron Structure Construction algorithm is described as following:

Algorithm 1 Polyhedron Structure Construction
Input: Adjacency matrix of physical topology {a_ij}; βis a parameter defined to limit the size of PCP structure; λ is weight coefficient of cc_i; $\bar{w}$ ; $γ$ is a coefficient related with k;
Output: Polyhedron protection structure;
	1: Sorting the nodes according their cc_i = λcc_i + sum_cc(n_i); E = Ø; N* = Ø;
	2: while E* ¹ E do
	3: for node n_i with max cc_i* do
	4: N* = N∪n_i*
	for each n_j with a_ij or a_ji = 1 do //cover the adjacent nodes of n_i and their adjacent nodes
		N* = N∪n_j*
		for n_k with a_jk or a_kj = 1 do
			N* = N∪n_k*
			for $\| N^{} \| < β \| N \|$ do // to control the amount of involving nodes*
				for each $n_{x} \in {\bar{N}}^{}$ do // ${\bar{N}}^{} = N - N^{*}$
					for each $n_{y} \in N^{*}$ do
						$w_{x} = \sum a_{x y}$
					end for
				if $w_{x} > w$ then $n = n_{x}$ , $w = w_{x}$ // initialization w = 0;
				end for
				if $w_{x} < \bar{w}$ then continue
					else $N^{} = N^{} \cup n_{x}$
				end for
			end for
		end for
	end for
	find the longest Hamilton cycle G^† (N^†, E^†) in subgraph G* with nÎN* and edges between them;
	construct the polyhedron structure by adding links separated by $γ \leq ⌊ \| N^{†} \| / 2 - 1 ⌋$ nodes with the *remediation approach*;
			E* = E^†∪{e_ij and e_ji\|a_ij or a_ji = 1, where i, jÎN^†}
5: *run graph abstraction;*
6: go to step 2;
7: end while.

4.2 Spare capacity allocation

With the protection structure generated by Algorithm 1, spare capacity needed to satisfy specified FRP should be properly dimensioned. Also, it is worth noting that the node/link symmetry of k-regular and k-edge connected based PCP makes the formulation of FRP as a function of LSR feasible, and it is also related with the node size of the protection structure, and link failure probability.

F R P_{f} (α) = P_{r} (f) \times {\begin{array}{l} α \frac{k - f}{f}, f < k \\ \frac{α (2 k - 2 - f)}{f}, k \leq f \leq 2 k - 3 \\ \frac{α}{f}, 2 k - 3 < f \leq ⌊ (\frac{k}{2} - 1) | N | + 1 ⌋ \\ 0, f > ⌊ (\frac{k}{2} - 1) | N | + 1 ⌋ \end{array} where k \geq 3

where

P_{r} (f)

could be got from Eq. (8) where

{\bar{R}}_{f}

could be obtained through extending cutset of each protection structure/graph. Particularly, for each protection structure, all its cutsets could be obtained with ring sum of its fundamental cutset [22], and then extend those cutsets with smaller than f links (e.g., x<f) through randomly including other f-x links in the protection structure, as well as removing the same one.

According Algorithm 1, the PCP protection structure of a graph is generated gradually with graph abstraction approach, so it is possible that a link is involved in two or more different PCPs and generally this link would be assigned with sum of spare capacity needed for each sub-PCP it belongs to. However, there is a slim possibility that all these spare resources are needed simultaneously which happens when simultaneous failures occur in all these PCPs. In this case, SC sharing between structures which involve same links could be considered to further save the spare resources needed. So, it is reasonable that we allocate maximum spare capacity needed by these sub-PCPs to this link in this paper, instead of sum of them. In other words, PCP structure with SC sharing would have higher LSR than that without SC sharing if they have same spare capacity. Performance of PCP with or without SC sharing is evaluated in the following section.

4.3 Recovery approach in PCP after link failures

Regarding to the reconfiguration strategy of node after failure, new protocol should be developed since end nodes of the failed link could direct the traffic onto several different routes, instead of a simple APS to the opposite direction in the ring protection. What’s more, the route selection in PCP is critical if more than one failures are occurred in single PCP, because resource collision would happen if their recovery routes are not properly designed.

However, a direct and greedy approach to implement the recovery in PCP is optical broadcasting the traffic of failed link to those survived on-structure links in the end node and the intermediate node, till it reaches the destination. This broadcasting based recovery approach could achieve the same time efficiency as APS in ring protection. But obviously, it requires optical broadcasting component in the node (could be implemented with ROADM including splitter and wavelength selective switch) and would degrade the recovery probability under simultaneous multi-link failures.

An alternative approach is dynamic path selection in the end node based on a locally maintained PCP structure connectivity and spare capacity utilization information. This approach is resource efficient and could provide sufficient recovery capability if there are enough spare resources. Also, its recovery path computation is extremely simple than the dynamic restoration approach since it is implemented within the PCP structure. Specifically, in this paper, this approach is employed to achieve better failure recovery efficiency.

More specifically, the multi-link failures may happen sequentially (sequential failures) or simultaneously (simultaneous failures). For sequential failures, the available shortest path is computed for each failure employing Dijkstra's algorithm, which implies that the recovery path of current failure may be affected by following failures. This would cause more unsuccessful recovery due to insufficient resources.

For simultaneous failures, following algorithm in Fig. 6 is implemented to achieve better spare resource utilization and higher recovery rate. In this algorithm, shortest recovery path is preferred (if there are multiple, it is randomly selected), and if spare resources needed by the recovery path on some on-structure links exceed the resource allocated, possible alternative path would be iteratively searched till no such route exists.

Fig. 6 Flowchart of algorithm for simultaneous failure recovery.

Download Full Size | PDF

5. Performance evaluation and discussion

In this section, the performance of the proposed protection scheme is evaluated and compared with several previous strategies, e.g., ring based protection (Ring, allowing dynamic ring reconfiguration for sequential failures), subgraph protection (Subgraph) schemes. Specifically, SCE, recovery capability (i.e., Recov. Rate) and average recovery path length (ARPL) of proposed scheme are discussed in terms of topology size and link failure probability.

In the simulations, it is assumed that the capacity of each edge is unlimited (i.e., sufficient fibers) and only one working fiber needs to be protected in each link. The physical network topologies used are randomly generated with 80-160 nodes (5 different topologies are evaluated for each size and the average results are recorded) using the GT-ITM tool and the node degree distribution follows exponential distribution with an average of 4. Also, in order to guarantee the connectivity of physical topology under multiple link failures, all the nodes have degree at least 2.

5.1 Spare capacity consuming

With the knowledge that the SC allocated is directly related with the Recov. Rate claimed, so in this section, the spare resources needed to satisfy specific Recov. Rate (i.e., >90% in this simulation) are examined for Ring, Subgraph, as well as PCP. Specifically, ring cover [7] is searched for Ring to make sure every link belongs to at least two cycles.

In Table 2, the amount of SC needed for Ring, Subgraph and PCP against simultaneous failures (Simul.) and sequential failures (Seq.) are presented with $p_{l} = 0.01$ , as well as their resources saving compared to Ring. To compare fairly, we assume that Seq. and Simul. Suffer same amount link failures through the simulation. We can conclude that: (1) PCP has better resources efficiency than other two schemes under simultaneous and sequential failures; or in other words, same recovery guarantee could be achieved by PCP with fewer spare resources; (2) PCP could save more spare resources in larger physical topology (e.g., save 25% in topology with 80 nodes while 36% saving is achieved within 140 nodes topology), while Subgraph has a almost consistent saving (e.g., 11%-14%); (3) in Seq., more spare capacity are needed than that in Simul., which is the result of fact that all failures in Simul. are jointly and optimally recovered while in Seq. the former recovering approach may leave the following failures recovering consume more resources. Also, same results could be observed in Table 3. Except that, it is observed that the higher recovery rate requirement and link failure probability would consume more spare capacity for all protection strategy. However, PCP could save more spare resources under higher recovery rate requirement and link failure probability, which is the result of fact the polyhedron structure could be extended easily to meet the higher recovery requirement.

Table 2. Amount of Spare Capacities Needed for Different Size Physical Topology (Recov. Rate>90%)

View Table | View all tables in this article

Table 3. Amount of Spare Capacities Needed for COST 239

View Table | View all tables in this article

5.2. Recovery capability

In this section, recovery capability of these schemes is investigated in terms of the SCE. Figure 7 implies that Recov. Rate of failures could be improved dramatically with the increase of SC for all protection schemes, and PCP with SC sharing could achieve at most 40% and 30% Recov. Rate improvement comparing with Ring and Subgraph respectively. Also, such performance improvement is sensitive to the value of SCE, e.g., when SCE is 0.5 to 2, PCP achieves higher Recov. Rate improvement. With a given SCE, PCP without considering SC sharing suffers up to 6% Recov. Rate degradation. Figure 7 also shows that allowing dynamic reconfiguration in Ring based protection could raise the Recov. Rate substantially at the cost of higher computation complexity and dynamic control overhead.

Fig. 7 SCE comparison of random Simul. multi-link failure (Polyhedron* is PCP with SC sharing).

Download Full Size | PDF

In addition, as discussed above, networks with higher link failure probability would consume more spare capacity to achieve specific Recov. Rate. Figure 8 depicts that Recov. Rate reduces sharply with the increase of link failure probability, and the higher the link failure probability is, the larger SC increase induced Recov. Rate improvement could achieve, e.g., especially for SCE increasing from 1 to 2 in this case. Specifically, the Recov. Rate decrease sharply when p_l changes from 0.0008 to 0.001 and from 0.008 to 0.01 compared with other cases (e.g., the case from 0.001 to 0.003), which dues to the feature of FRP function. This could be explained by the fact that the required spare resources changes quickly when p_l changes from 0.0008 to 0.001 and from 0.008 to 0.01, but in this simulation, the spare resources are fixed which would lead to a sharp decrease of Recov. Rate.

Fig. 8 Recovery capability of PCP under probability link failures.

Download Full Size | PDF

5.3 Average recovery path length (ARPL)

Intuitively, in Ring and Subgraph, it suffers from longer recovery path which partially lead to higher SC consumption and longer recovery time. The proposed PCP is supposed to tackle this problem through a mesh like protection structure which could provide high connectivity even under multiple link failures. So, in this section, the average recovery path length of these schemes is investigated, as well as the impact of global cluster coefficient (GCC) and size of topology.

Figure 9 shows that the ARPL of all these schemes are reduced with the increase of GCC (the size of topology is 80), which is the result of fact that the increase of GCC improves the topology connectivity. However, the ARPL of PCP is more susceptible to the variation of GCC, while that of other two schemes could not be reduced further especially when GCC>0.5. For PCP itself, GCC contributes more to the reduction of ARPL when GCC<0.4, but SCE plays an important role afterwards.

Fig. 9 ARPL vs. global cluster coefficient.

Download Full Size | PDF

Regarding to the impact of topology size (GCC = 0.4), Fig. 10 indicates that generally the bigger the topology is, the longer the recovery path turns to be. Also, ARPL of PCP increase slowly with topology size, but that of other two schemes increases sharply. This is also because PCP could provide multiple path options for link recovery after failures and generally the shortest one would be chosen.

Fig. 10 ARPL vs. size of topology.

Download Full Size | PDF

At the same time, the average of k in the PCP of different size topologies is presented in Table 4. Generally, the larger the size of topology and its GCC is, the bigger the average of k is. In Fig. 11, for various link failure probabilities, the distribution of k in each sub-PCPs of a 120 nodes topology’s PCP structure is presented. Generally, the sub-PCPs require larger k when the link failure probability is larger, which is mainly because to achieve specified Recov. Rate higher k could improve the survivability of PCP under multi-link failures.

Table 4. Average of k of different size topology (Recov. Rate>90%)

View Table | View all tables in this article

Fig. 11 Distribution of k under different link failure probability.

Download Full Size | PDF

6. Concluding remarks

Protecting network against multi-link failures is important since such failures would cause unbearable traffic interruption. The proposed PCP structure performs better than previously proposed ring based and subgraph protection in terms of improved capacity efficiency and failure robustness (e.g., higher link recovery rate). Moreover, the proposed schemes are helpful for network operators to estimate the vulnerability of their network.

Acknowledgments

The authors would like to thank the anonymous reviewers and the editors for their valuable comments and suggestions which are helpful to improve the paper. Part of this work has been published in OFC 2013. This work is supported partly by the National Basic Research Program of China (973 Program) (Nos. 2010CB328204 and 2010CB328202), the National Natural Science Foundation of China (Nos. 61205058 and 61331008), the Hi-Tech Research and Development Program of China (863 Program) (No.2012AA011302), the Program for New Century Excellent Talents in University (NCET-12-0793), the Beijing Nova Program (No. 2011065), the Beijing Youth Elite Project for Universities (No. YETP0479), and the Fundamental Research Funds for the Central Universities.

References and links

1. L. Guo, J. Cao, H. Yu, and L. Li, “Path-based routing provisioning with mixed shared protection in WDM mesh networks,” J. Lightwave Technol. 24(3), 1129–1141 (2006). [CrossRef]

2. C. Ou, H. Zang, N. K. Singhal, K. Zhu, L. H. Sahasrabuddhe, R. A. MacDonald, and B. Mukherjee, “Subpath Protection for Scalability and Fast Recovery in Optical WDM Mesh Networks,” IEEE J.. Sel. Areas Commun. 22(9), 1859–1875 (2004). [CrossRef]

3. H.-W. Lee, E. Modiano, and K. Lee, “Diverse routing in networks with probabilistic failures,” IEEE/ACM Trans. Networking 18(6), 1895–1907 (2010). [CrossRef]

4. P. Agarwal, A. Efrat, S. Ganjugunte, D. Hay, S. Sankararaman, and G. Zussman, “The resilience of WDM networks to probabilistic geographical failures,” Proc. IEEE INFOCOM1521–1529 (2011).

5. Y. Zhao, X. Li, H. Li, X. Wang, J. Zhang, and S. Huang, “Multi-link faults localization and restoration based on fuzzy fault set for dynamic optical networks,” Opt. Express 21(2), 1496–1511 (2013). [CrossRef] [PubMed]

6. R. Asthana, Y. N. Singh, and W. D. Grover, “p-Cycles: An overview,” IEEE Commun. Surveys Tutorials 12(1), 97–111 (2010). [CrossRef]

7. G. Ellinas, A. G. Hailemariam, and T. E. Stern, “Protection cycles in mesh WDM networks,” IEEE J. Sel. Areas Commun. 18(10), 1924–1937 (2000). [CrossRef]

8. S. Huang, B. Li, B. Guo, J. Zhang, P. Luo, D. Tan, and W. Gu, “Distributed Protocol for Removal of Loop Backs with Asymmetric Digraph Using GMPLS in P-Cycle Based Optical Networks,” IEEE Trans. Commun. 59(2), 541–551 (2011). [CrossRef]

9. J.-S. Li, C.-F. Yang, and J.-H. Chen, “Star-Block Design in Two-Level Survivable Optical Networks,” IEEE/ACM Trans. Networking 19(2), 526–539 (2011). [CrossRef]

10. M. T. Frederick, P. Datta, and A. K. Somani, “Sub-Graph Routing: A generalized fault-tolerant strategy for link failures in WDM Optical Networks,” Comput. Netw. 50(2), 181–199 (2006). [CrossRef]

11. S. Ramasubramanian and A. Chandak, “Dual-Link Failure Resiliency through Backup Link Mutual Exclusion,” IEEE/ACM Trans. Netw. 16(1), 157–169 (2008). [CrossRef]

12. L. Ruan and T. Feng, “A hybrid protection/restoration scheme for two-link failure in WDM mesh networks,” in Proc. IEEE GLOBECOM 1–5 (2010).

13. L. Guo, “Hybrid survivable configuration for optical wavelength-division-multiplexing mesh networks,” Opt. Express 15(3), 834–838 (2007). [CrossRef] [PubMed]

14. D. Oscar, X. Feng, N. Min-Allah, K. Mahmoud, P. Min, K. Samee, and N. Ghani, “Network Survivability for Multiple Probabilistic Failures,” IEEE Commun. Lett. 16(8), 1320–1323 (2012).

15. X. Cheng, X. Shao, and Y. Wang, “Multiple link failure recovery in survivable optical networks,” Photonic Netw. Commun. 14(2), 159–164 (2007). [CrossRef]

16. X. Li, S. Huang, J. Zhang, Y. Zhao, and W. Gu, “k-regular and k-(edge)-connected Protection Structures in Optical Transport Networks,” OFC JW2A.03 (2013).

17. S. Huang, J. Zhang, X. Li, Y. Zhao, and W. Gu, “Pre-configured polyhedron (p-poly) based protection structure against multi-link failures in optical networks,” in CHINACOM 277–283 (2012).

18. F. T. Boesch and J. F. Wang, “Super line connectivity properties of circulant graphs,” SIAM J. Alg. Disc. Math. 7(1), 89–98 (1986). [CrossRef]

19. A. Frank, Connectivity and Network Flows. Chapter 2 of Handbook of Combinatorics. R.L. Graham, M. Grotschel, and L. Lovasz editors. Elsevier and MIT Press (1995).

20. D. A. Schupke, “Analysis of p-Cycle Capacity in WDM Networks,” Photonic Netw. Commun. 9(8), 756–758 (2006).

21. W. Mader, “Eine Eigenschaft der Atome endlicher Graphen,” Arch. Math. 22(1), 257–262 (1971). [CrossRef]

22. B. Guo, S. Huang, P. Luo, H. Huang, J. Zhang, and W. Gu, “Dynamic Survivable Mapping in IP Over WDM Network,” J. Lightwave Technol. 29(9), 1274–1284 (2011). [CrossRef]

Spare Capacity in Each Link of Protection Structure	Protection Schemes	3 random failures
2 spare fiber	3-regular and 3-edge connected	95%
2 spare fiber	Ring	54%
adequate spare capacity	3-regular and 3-edge connected	99.7%
adequate spare capacity	Ring	62%

Size Strategy	80		100		120		140
Size Strategy	Simul.	Seq.	Simul.	Seq.	Simul.	Seq.	Simul.	Seq.
Polyhedron^†	182(25%)*	216(21%)	210	252	244	288	302(36%)	378(34%)
Ring	244	274	304	336	372	426	470	574
Subgraph	217(11%)	240(12%)	260	288	316	370	404(14%)	494(14%)

Recov. Rate Strategy	$p_{l} = 0.01$				$p_{l} = 0.05$
	>85%		>90%		>85%		>90%
	Simul.	Seq.	Simul.	Seq.	Simul.	Seq.	Simul.	Seq.
Polyhedron^†	31(16%)	36(14%)	39(17%)	45(18%)	34(16%)	39(19%)	45(19%)	53(21%)
Ring	37	42	48	55	41	49	56	68
Subgraph	33(9%)	37(10%)	44(8%)	50(9%)	35(13%)	43(12%)	50(11%)	59(12%)

Size GCC	80	120	160
0.2	3.3	3.5	3.6
0.3	3.5	3.8	4
0.4	3.6	4	4.2
0.5	3.7	4.1	4.2
0.6	3.8	4.3	4.5

Spare Capacity in Each Link of Protection Structure	Protection Schemes	3 random failures
2 spare fiber	3-regular and 3-edge connected	95%
2 spare fiber	Ring	54%
adequate spare capacity	3-regular and 3-edge connected	99.7%
adequate spare capacity	Ring	62%

Pre-configured polyhedron based protection against multi-link failures in optical mesh networks

Abstract

1. Introduction

2. Related work

3. k-regular and k-connected structures and pre-configured polyhedron

3.1 k-regular and k-edge connected graph and PCP schemes

3.2 Analysis of spare resources required for link-restorable optical mesh network and k-regular and k-edge connected protection structure

(a) Lower bound of spare resource required by link-restorable optical mesh networks

(b) Upper bound of SCE of k-regular and k-edge connected protection structures

3.3 Some designing concerns of PCP

4. Heuristics algorithm

4.1 Polyhedron structure construction

4.2 Spare capacity allocation

4.3 Recovery approach in PCP after link failures

5. Performance evaluation and discussion

5.1 Spare capacity consuming

5.2. Recovery capability

5.3 Average recovery path length (ARPL)

6. Concluding remarks

Acknowledgments

References and links

Cited By

Figures (11)

Tables (4)

Equations (11)

Optics Express