Troubleshooting for Network Operators - Van Van Tong - E-Book

Troubleshooting for Network Operators E-Book

Van Van Tong

0,0
142,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Nowadays, the Internet is becoming more and more complex due to an everincreasing number of network devices, various multimedia services and a prevalence of encrypted traffic. Therefore, in this context, this book presents a novel efficient multi modular troubleshooting architecture to overcome limitations related to encrypted traffic and high time complexity. This architecture contains five main modules: data collection, anomaly detection, temporary remediation, root cause analysis and definitive remediation. In data collection, there are two sub modules: parameter measurement and traffic classification. This architecture is implemented and validated in a software-defined networking (SDN) environment.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 249

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Dedication Page

Title Page

Copyright Page

Preface

Introduction

1 State of the Art on Network Troubleshooting

1.1. Network troubleshooting

1.2. Background on encryption protocols

1.3. Drawbacks of troubleshooting with encrypted traffic

1.4. Conclusion

2 Novel Global Troubleshooting Framework for Encrypted Traffic

2.1. Novel network troubleshooting architecture for encrypted traffic

2.2. Proof of concept of novel troubleshooting architecture in SDN

2.3. Data collection

2.4. Troubleshooting dataset

2.5. Conclusion

3 Traffic Classification: Novel QUIC Traffic Classifier Based on Convolutional Neural Network

3.1. Introduction

3.2. Background

3.3. Traffic classification approaches

3.4. Novel traffic classification method for QUIC traffic

3.5. Experimental results

3.6. Conclusion

4 Anomaly Detection

4.1. Introduction

4.2. Anomaly detection approaches

4.3. Anomaly detection approach using machine learning

4.4. Experimental results

4.5. Conclusion

5 Temporary Remediation: SDN-based Application-aware Segment Routing for Large-scale Networks

5.1. Introduction

5.2. Application-aware routing mechanisms

5.3. Adaptive segment routing mechanism for encrypted traffic

5.4. Experimental results

5.5. Conclusion

6 Root Cause Analysis and Definitive Remediation

6.1. Root cause analysis: machine learning based root cause analysis for SDN network

6.2. Definitive remediation: adaptive QUIC BBR algorithm using reinforcement learning for dynamic networks

Conclusions and Prospects

References

Index

Other titles from iSTE in Networks and Telecommunications

End User License Agreement

List of Tables

Chapter 2

Table 2.1. Comparison between NetFlow, sFlow and OpenFlow-based monitoring a...

Table 2.2. Existing troubleshooting dataset

Table 2.3. Considered network conditions

Table 2.4. Datasets for root cause analysis.

Table 2.5. Classification dataset

Chapter 3

Table 3.1. Registered port numbers by IANA for several applications

Table 3.2. Signatures for several P2P applications

Table 3.3. Dataset specification

Table 3.4. Performance metrics of ML algorithms in the first stage of classi...

Table 3.5. Time complexity of ML algorithms in the first stage of classifica...

Chapter 4

Table 4.1. Considered network conditions

Table 4.2. Anomaly detection datasets

Table 4.3. Performance metrics of ML algorithms in anomaly detection for the...

Table 4.4. Time complexity of ML algorithms in anomaly detection for the dat...

Table 4.5. Performance metrics of ML algorithms in anomaly detection for the...

Table 4.6. Time complexity of ML algorithms in anomaly detection for the dat...

Chapter 5

Table 5.1. Configuration of the PC used in the testbed

Table 5.2. Scenarios

Table 5.3. Summarization of average optimal MOS, median and 95% confidence i...

Table 5.4. Summarization of average overhead in the SR mechanisms

Chapter 6

Table 6.1. Considered network conditions

Table 6.2. Troubleshooting datasets

Table 6.3. Performance metrics of the considered ML algorithms for the datas...

Table 6.4. Time complexity of ML algorithms in RCA for the dataset in a stat...

Table 6.5. F1-score of two feature sets in the RCA for the dataset in a stat...

Table 6.6. Performance metrics of the considered ML algorithms for the datas...

Table 6.7. Time complexity of ML algorithms in RCA for the dataset in a dyna...

Table 6.8. Some important results of the considered congestion control algor...

List of Illustrations

Chapter 1

Figure 1.1. Unidirectional link discovery in LLDP. For a color version of this...

Figure 1.2. Overall traditional troubleshooting architecture

Figure 1.3. The global growth of the encrypted traffic e-Security (n.d.). For ...

Figure 1.4. Difference between TCP + TLS and QUIC architecture

Figure 1.5. Comparison of QUIC packet format with TCP + TLS. For a color versi...

Figure 1.6. Comparison of connection establishment between QUIC and TCP + TLS....

Figure 1.7. Multiplexing comparison between HTTP1.1 and HTTP/2 over TCP and QU...

Figure 1.8. IPsec packet structure. For a color version of this figure, see ww...

Figure 1.9. TLS record packet

Chapter 2

Figure 2.1. The novel troubleshooting architecture in the context of encrypted...

Figure 2.2. The novel troubleshooting framework in the SDN environment. For a ...

Figure 2.3. NetFlow architecture (Suérez-Varela and Barlet-Ros 2017). For a co...

Figure 2.4. sFlow architecture. For a color version of this figure, see www.is...

Figure 2.5. The link discovery in LLDP. For a color version of this figure, se...

Chapter 3

Figure 3.1. Byte in payload of QUIC packets for different applications. For a ...

Figure 3.2. Percentage of small and large packets in flows. For a color versio...

Figure 3.3. Novel traffic classification approach for QUIC traffic

Figure 3.4. Macro-averaging precision, macro-averaging recall and macro-averag...

Figure 3.5. Macro-averaging precision, macro-averaging recall and macro-averag...

Figure 3.6. Precision, recall and F1-score of the traffic classification metho...

Chapter 4

Figure 4.1. Overall architecture of ML-based anomaly detection mechanism in th...

Figure 4.2. The ML-based anomaly detection method.

Chapter 5

Figure 5.1. The SDN-based adaptive SR framework issued from the global trouble...

Figure 5.2. The novel traffic classification approach for encrypted traffic

Figure 5.3. The QoE estimator for encrypted traffic

Figure 5.4. The RL-based SR mechanism

Figure 5.5. Average MOS score and standard deviation of three selection algori...

Figure 5.6. The MOS score of three SR mechanisms. For a color version of this ...

Figure 5.7. The average CPU usage and overhead of three SR mechanisms. For a c...

Chapter 6

Figure 6.1. Overall architecture of ML-based RCA in SDN environment. For a col...

Figure 6.2. The ML-based RCA method

Figure 6.3. The accuracy against the number of features in the feature selecti...

Figure 6.4. Congestion control operating point: delivery rate and RTT against ...

Figure 6.5. Adaptive BBR algorithm. For a color version of this figure, see ww...

Figure 6.6. Number of network conditions in which each congestion control algo...

Figure 6.7. Average reward and standard deviation of A-BBR and benchmarks. For...

Figure 6.8. Fairness of A-BBR and benchmarks in dynamic network conditions. Fo...

Guide

Cover

Table of Contents

Dedication Page

Title Page

Copyright Page

Preface

Introduction

Begin Reading

Conclusions and Prospects

References

Index

Other titles from iSTE in Networks and Telecommunications

End User License Agreement

Pages

i

ii

iii

iv

ix

x

xi

xii

xiii

xiv

xv

xvi

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

To my sweet love Ly Ly and beloved familyVan Van Tong

To all my family, my wife, my dear children Sarah and Sinan, with all my love and infinite gratitude for your presence in my lifeSami Souihi

To my dearest family, thank you for being my constant source of love and support throughout my lifeHai-Anh Tran

To my beloved, sublime and thoughtful daughter, Kenza Insafe, on her 20th birthday this yearAbdelhamid Mellouk

New Generation Networks Set

coordinated by Abdelhamid Mellouk

Volume 3

Troubleshooting for Network Operators

The Road to a New Paradigm with Encrypted Traffic

Van Van Tong Sami Souihi Hai-Anh Tran Abdelhamid Mellouk

First published 2023 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd 27-37 St George’s Road London SW19 4EU UK www.iste.co.uk

John Wiley & Sons, Inc. 111 River Street Hoboken, NJ 07030 USA www.wiley.com

© ISTE Ltd 2023The rights of Van Van Tong, Sami Souihi, Hai-Anh Tran and Abdelhamid Mellouk to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s), contributor(s) or editor(s) and do not necessarily reflect the views of ISTE Group.

Library of Congress Control Number: 2023938907

British Library Cataloguing-in-Publication Data A CIP record for this book is available from the British Library ISBN 978-1-78630-867-2

Preface

The Internet network is continuously growing in terms of size and usage. This expansion results in an increase in complexity emphasized by a prevalence of encrypted traffic. This leads to complex problems in the network, bringing many negative economic impacts for network operators. In the past, network troubleshooting solutions were extensively studied to troubleshoot network problems and deal with them completely. However, conventional solutions are unmanageable because of high time-consuming, inadaptability for encrypted traffic, scalability, overhead, automation, and efficiency.

In this book, we provide insight into network troubleshooting in the context of encrypted traffic, and present a solution to these problems, which consists of an adaptive mechanism. The objective is to detect network anomalies, diagnose their root causes, and address them definitively. It highlights the necessity of network troubleshooting and economic impacts on network operators. Additionally, it provides a state of the art on network problems (e.g. rule failure, link failure and so on) as well as existing approaches. Moreover, the book thoroughly analyzes encryption protocols before analyzing the limitations of conventional troubleshooting solutions. Furthermore, it contains a novel troubleshooting architecture in the context of encrypted traffic and presents a proof of concept of this architecture in the software-defined networking (SDN) environment. This architecture contains five main modules: data collection, anomaly detection, temporary remediation, root cause analysis and definitive remediation.

Giving a complete bibliography and a historical account of the research that led to the present form of the subject would be impossible. Thus, some topics are discussed in less detail than others. The choices made reflect, in part, personal taste and expertise and, in part, a preference for up-and-coming research and recent developments in network troubleshooting for network operators.

This book is a preliminary for network troubleshooting in the context of encrypted traffic. We hope that it will inspire other researchers and network operators on the road of network troubleshooting. The authors hope you will enjoy reading this book and get many helpful ideas and overviews for your own studies.

Van VAN TONG

Sami SOUIHI

Hai ANH TRAN

Abdelhamid MELLOUK

June 2023

Introduction

“We are all now connected by the Internet, like neurons in a giant brain”

Stephen Hawking

Network troubleshooting plays an important part in the network. This is because computer networks have rapidly evolved along with the significant growth of the Internet of Things (IoT), increasing not only a network coverage, but also the complexity in the computer network, bringing a risk of incurring problems in the network. For example, there are many problems in the network, including server disruptions, cyberattacks, link failure, etc. Therefore, network troubleshooting, which is a process of detecting an anomaly, identifying its root causes and implementing remediation approaches to solve it definitively, is studied thoroughly by the research community (see Fonseca and Mota (2017); Yu et al. (2018); Cherrared et al. (2019)).

Regarding server disruptions, Table I.1 illustrates the total downtime and corresponding money lost for several service providers (Gagnaire et al. 2012). For instance, YouTube and PayPal lost from $34,000 to over $6,700,000 because of a few hours of disruption on their cloud servers due to failures.

Besides, many cloud services nowadays are disrupted by cyberattacks (e.g. distributed denial of service (DDoS) attacks, etc.). A DDoS attack is a kind of cyberattack designed to overload and disrupt network services by exhausting them with access requests. In February 2020, Amazon announced that their AWS Shield services mitigated the largest recorded DDoS attack with 2.3 Tbps of network traffic (Felter 2021). This attack, which caused three days of “elevated threat” for AWS Shield services, was carried out using hijacked Connection-less Lightweight Directory Access Protocol (CLDAP) web servers.

Table I.1.Downtime of service providers and their economic impacts

 

Total downtime (h)

Cost (USD)

YouTube

0.17

34,000

CloudFlare

1

168,000

Zoho

33.5

600,000

Cisco

5.33

1,066,000

eBay

6.25

1,406,250

Facebook

8.5

1,700,000

PayPal

30.2

6,795,000

To deal with network problems, network troubleshooting has been extensively studied for over 30 years. However, conventional solutions are not effective because of high time consumption, inadaptability for encrypted traffic, overheads, scalability, etc.

Regarding the high time consumption, the processing time for root cause analysis and remediation can take from 1 h to more than 5 h depending on the status of anomalies in the network (Zeng et al. 2012b). As a result, network systems can suffer from negative impacts (e.g. high latency, high loss, etc). These impacts can result in frequent connection interruptions in the network. Depending on the anomaly’s nature, there are two possible cases. If the root cause of the anomaly is identified and solved quickly, temporary remediation is not necessary. Otherwise, temporary remediation is required to guarantee the availability of the network. Therefore, it became inevitable to think about network troubleshooting frameworks that guarantee the network’s availability during the root cause analysis and definitive remediation.

As for the indaptability, the traditional troubleshooting mechanisms were not designed for encrypted traffic. However, many service providers today encrypt network traffic to prevent attackers from inspecting data packets for illegal activities. Concretely, 80% of web traffic was encrypted by 2019 compared to 40% by 2016 (Cisco 2021a). From the point of view of network operators (NOs), the information in the packets is hidden, such as the sequence number, acknowledgment number, payload signatures, etc. This introduces several limitations related to network performance monitoring approaches (e.g. estimation of quality of experience, application identification, etc.) and intrusion detection systems (Kühlewind et al. 2018; Moriarty and Morton 2018). Therefore, encrypted traffic results in many obstacles for troubleshooting, particularly in data collection (e.g. collecting performance metrics, etc.) and remediation approaches using a deep packet inspection (e.g. application-aware traffic engineering, signature-based intrusion detection systems, etc.).

Concerning the overhead, carrying out data collection without influencing the network performance is a challenging task. In fact, the continuous monitoring of network data and network traffic can generate a huge overhead on the network and thus influence the network performance. Many network monitoring proposals have developed different approaches to balance data accuracy and monitoring overhead.

As for the scalability, in traditional network architectures, control logic is distributed in network devices, so updated policies in network troubleshooting are implemented separately in each network device. Nowadays, the number of network devices increases rapidly due to the rapid growth of the Internet network and IoT. This leads to the scalability issue for network troubleshooting, as well as network management.

In this book, we put a special focus on the high time consumption and inadaptability issues. Concretely, we present an analysis and comments on the network troubleshooting mechanisms before elaborating our vision of a troubleshooting framework for network operators in the context of encrypted traffic. This framework is composed of five modules: data collection, anomaly detection, temporary remediation, root cause analysis and definitive remediation. In addition to the troubleshooting framework, this book focuses on four essential points:

– in addition to the traditional data collection module, it is necessary to think about a novel traffic classification approach to classify encrypted traffic into different kinds of applications (e.g. video streaming, file transfer, etc.). In fact, the application class plays an important role in the remediation approaches (e.g. application-aware mechanisms, etc.) in network troubleshooting;

– a temporary remediation approach to assure the availability of the network as well as meet strict SLA requirements during the root cause analysis and definitive remediation;

– a proof of concept for the root cause analysis and definitive remediation in network troubleshooting that allows us to automatically identify the root cause of anomalies and address it completely;

– troubleshooting datasets: we build and contribute the troubleshooting datasets that contain a dataset for encrypted traffic classification approaches and two datasets for the root cause analysis, in order to facilitate the network troubleshooting.

In this book, each chapter is dedicated to one module of the troubleshooting framework. The remainder of the book is organized as follows:

Chapter 1: provides related work on several network problems (e.g. link failure, switch failure, etc.). In addition to explaining the fundamental parts of traditional troubleshooting architecture, we explain how network traffic is encrypted and provide an analysis on the limitations of network troubleshooting for encrypted traffic.

Chapter 2: presents fundamental parts of novel troubleshooting architecture in the context of encrypted traffic and shows a proof of concept of this architecture in a software-defined networking (SDN) environment. We also present a parameter measurement module to collect data in order to build troubleshooting datasets. Moreover, the chapter thoroughly describes the troubleshooting datasets, which are composed of datasets for root cause analysis and encrypted traffic classification.

Chapter 3: describes a novel encrypted traffic classification method to identify different kinds of applications. The purpose is to provide information about application classes for application-aware mechanisms in network troubleshooting.

Chapter 4: presents related work on anomaly detection. Moreover, this chapter takes into account congestion to generate anomalies and presents an anomaly detection approach using machine learning to detect these anomalies in the network.

Chapter 5: presents an application-aware segment routing mechanism in temporary remediation. This mechanism identifies application classes according to traffic classification. In a particular application, this mechanism implements a specific routing strategy based on a reinforcement learning algorithm to meet strict SLA requirements.

Chapter 6: considers congestion as a use-case for the root cause analysis and definite remediation. This chapter presents a root cause analysis using machine learning to identify the root cause of congestion. It also presents an adaptive congestion control algorithm to solve it completely.

Conclusions and Prospects: this last chapter concludes this book and provides an insight into the future work and prospects in the area of network troubleshooting.

1State of the Art on Network Troubleshooting

“A protocol approach to troubleshooting”

Ed Wilson

Chapter 1 presents the state of the art on network troubleshooting and a traditional troubleshooting architecture for non-encrypted traffic. We then discuss its limitations when traffic is encrypted.

1.1. Network troubleshooting

In the early 19th century, technicians were dispatched to find problems in telegraph and phone line infrastructure to repair and solve the issues. Historically, a troubleshooter refers to a skilled worker who finds and solves technical problems. Nowadays, troubleshooting is a form of problem-solving that aims to repair failed processes in a machine or a system. According to the related work Morris and Rouse (1985) and Jonassen and Hung (2006), there are several existing conceptions of the troubleshooting process. The basic concept of troubleshooting is finding the faulty components in a device to repair or replace it Perez (1991). Schaafstal et al. (2000) designed the troubleshooting process with four subtasks: formulating problem description, cause generation, test and evaluation. Similarly, troubleshooting is considered as an iterative process with four subprocesses: problem space construction, problem space reduction, fault diagnosis and solution verification (Johnson et al. 1993).

Network troubleshooting is an iterative process with three subtasks: identifying, diagnosing and solving problems in the network. In the past, network operators (NOs) implemented manual troubleshooting tools such as ping, traceroute, etc. ping is a computer network administration utility designed to check a reachability between a source and a destination and round-trip time of packets in the network. traceroute is a computer network diagnostic utility used to display possible routes between a source and a destination and measure a transit delay of packets in the network. These troubleshooting tools are used to diagnose complex problems such as loops caused by undefined interaction between spanning tree protocols (Heller et al. 2013), etc. However, these approaches are not effective with a huge number of network devices. Besides, 24.6% of administrators reported that anomaly diagnosis takes more than 1 h on average to solve anomalies (Zeng et al. 2012a). Therefore, it is necessary for an automated troubleshooting process that aims to detect an anomaly, locate its causes and solve it. Consequently, network troubleshooting is considered by the research community Fonseca and Mota (2017); Yu et al. (2018); Cherrared et al. (2019). In the following section, we present the state of the art of network troubleshooting.

1.1.1. State of the art

According to the related work on network troubleshooting (Yu et al. 2018; Fonseca and Mota 2017; Van et al. 2018), problems can be classified into several categories thanks to locations where problems happen or factors that result in problems. Yu et al. (2018) and Fonseca and Mota (2017) categorize problems into problems in application, control and infrastructure layer. Similarly, problems can be classified into problems in application service providers (ASP) or Internet service providers (ISP) (Van et al. 2018). Besides, problems can be classified into problems caused by administrators (e.g. router misconfiguration, server misconfiguration, etc.) or problems that are not caused by administrators (e.g. link failure, switch failure, buffer overload, etc.). According to a survey of NOs (Zeng et al. 2012b), in this book, we present several problems that are not caused by administrators in following sections.

1.1.1.1. Rule failure

Bu et al. (2016) categorized failure rule in the network into missing fault and priority fault. The missing fault occurs when a rule is not executed as expected, whereas the priority fault occurs when overlapping rules violate a priority order.

There are research studies concentrating on the missing fault including ATPG in Zeng et al. (2012a) and Monocle Perešíni et al. (2015). These approaches verify the rules by generating probe packets to exercise every rule. ATPG uses a header space analysis (Kazemian et al. 2012) to check the reachability between all test hosts. Then, the reachability result is transferred to a probe packet generator to compute a minimal set of probe packets via greedy algorithm (Slavık 1997). Next, these probe packets are sent into the network systems to check the rule’s corrections. If an error is detected, a fault localization algorithm is implemented to narrow down to identify the root cause. However, ATPG has a drawback when it generates the probe packets for all rules. It is not effective when there are only a few up-to-date rules. Consequently, Monocle is proposed to overcome this drawback. This approach only verifies recently installed rules and reports misbehaviors. Besides, Monocle formulates knowledge from flow tables in the switches as constraints and applies an SAT solver (Biere 2008) to generate a set of probe packets.

Probing is an intrusive method that generates significant overheads and increases link utilization in the network. Consequently, it is necessary to minimize the number of probe packets. This is a minimum set cover problem, which is an NP-Complete problem (Zeng et al. 2012a). Therefore, Bu et al. (2016) proposed RuleScope, a framework for detecting rule failures in the network. RuleScope divides flow tables into solvable subsets of rules to minimize probe scale. Then, this approach creates a directed acyclic graph for each subset and generates a set of probe packets for each subset. As a result, this approach processes the probe packet generation more quickly due to a small scale of rule subsets.

Although RuleScope minimizes the number of probe packets, this approach suffers from a drawback related to a separation in the flow tables. This leads to the priority fault in the switches. The separation in the flow tables into small subsets can result in pretermitting two overlapping rules in two different subsets of rules. Zhao et al. (2018a) proposed SERVE, a rule verification to identify rule failure in the switches automatically. Firstly, SERVE extracts all rules for each device and builds a multi-rooted tree that considers rule connections. Next, SERVE analyzes the multi-rooted tree to generate the minimum number of probe packets. The minimum set cover problem is an NP-Complete problem, so SERVE applies the depth-first search (DFS) algorithm to generate the probe packets. Zhao et al. (2018b) extended the previous study of Zhao et al. (2018a) to present a complete framework. After generating the probe packets, SERVE injects these packets into network systems using an out-band channel. Besides, SERVE also computes a desired network behavior using the multi-rooted trees. According to a comparison between the feedback from the out-band channel for every rule and the desired network behaviors, SERVE can detect faulty rules and send notifications to administrators. SERVE’s performance is evaluated to benchmarks in processing time, number of probe packets and overheads. Concerning the number of probe packets, SERVE decreases the number of probe packets by up to 75% in comparison with Monocle. Regarding the processing time, SERVE’s figure is three times less than the figure for ATPG. As for the overhead, in-band bandwidth is not influenced according to using the out-band channel to inject the probe packets. Besides, the out-band bandwidth is far less than link capacity.

1.1.1.2. Link failure

Link failure refers to unreachability between two switches. It can lead to a high packet loss and performance degradation in the network. Link failure can be detected according to probe packets in active monitoring approaches. ping is a simple troubleshooting tool that sends probe packets to check the reachability between two end-points. If probe packets are lost, it means that there is a faulty link between these end-points. Similarly, Cascone et al. (2017) proposed a fast failure detection mechanism to detect the link failure based on the exchange of bidirectional “heartbeat” packets. When the packet rate drops below a threshold, a node sends heartbeat packets to its neighbors. If there are no responses from its neighbors after a given time, the link failure happens in the network. However, this mechanism requires a strict consumption related to the backup solutions that cannot be utilized to guarantee the short failover delays (1 ms).

Moreover, this problem can be detected by using the Link Layer Discovery Protocol (LLDP) in software-defined networking (SDN) (Khan et al. 2016; Tarnaras et al. 2015). According to the topology discovery protocol, SDN controller can detect link failure and remove it from network topology. Firstly, an OpenFlow (OF) switch connects to the controller so that the controller knows its active ports. Next, the controller generates a Packet-out message to each active port in the switch to discover the topology. The LLDP between switch s1 and s2 is depicted in Figure 1.1. Firstly, the controller encapsulates an LLDP packet in a Packet-out message and sends it to the switch s1. When switch s1 receives the Packet-out message, it will forward the LLDP packet to switch s2. After receiving the LLDP packet, switch s2 encapsulates this packet in a Packet-in message and sends it back to the controller. The controller receives this message and creates a link from switch s1 to s2. The same process is performed to identify the link for an opposite direction. When link s1–s2 is faulty, the controller will not receive the Packet-in message from switch s2. Then, the controller will remove this link from the network topology. In the network with S switches interconnected by a set of L links, the total number of Packet-out and Packet-in messages are described in equations [1.1] and [1.2], respectively. Pi is the number of the active port in the switch Si.

[1.1]
[1.2]

Figure 1.1.Unidirectional link discovery in LLDP. For a color version of this figure, see www.iste.co.uk/tong/troubleshooting.zip

Unlike the SDN environments, a hybrid SDN contains OF switches and traditional switches that LLDP cannot discover. Therefore, SDN controllers