103,99 €
DATA EXFILTRATION THREATS AND PREVENTION TECHNIQUES Comprehensive resource covering threat prevention techniques for data exfiltration and applying machine learning applications to aid in identification and prevention Data Exfiltration Threats and Prevention Techniques provides readers the knowledge needed to prevent and protect from malware attacks by introducing existing and recently developed methods in malware protection using AI, memory forensic, and pattern matching, presenting various data exfiltration attack vectors and advanced memory-based data leakage detection, and discussing ways in which machine learning methods have a positive impact on malware detection. Providing detailed descriptions of the recent advances in data exfiltration detection methods and technologies, the authors also discuss details of data breach countermeasures and attack scenarios to show how the reader may identify a potential cyber attack in the real world. Composed of eight chapters, this book presents a better understanding of the core issues related to the cyber-attacks as well as the recent methods that have been developed in the field. In Data Exfiltration Threats and Prevention Techniques, readers can expect to find detailed information on: * Sensitive data classification, covering text pre-processing, supervised text classification, automated text clustering, and other sensitive text detection approaches * Supervised machine learning technologies for intrusion detection systems, covering taxonomy and benchmarking of supervised machine learning techniques * Behavior-based malware detection using API-call sequences, covering API-call extraction techniques and detecting data stealing behavior based on API-call sequences * Memory-based sensitive data monitoring for real-time data exfiltration detection and advanced time delay data exfiltration attack and detection Aimed at professionals and students alike, Data Exfiltration Threats and Prevention Techniques highlights a range of machine learning methods that can be used to detect potential data theft and identifies research gaps and the potential to make change in the future as technology continues to grow.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 444
Veröffentlichungsjahr: 2023
Cover
Title Page
Copyright
Dedication
About the Authors
Acknowledgments
Acronyms
Abstract
1 Introduction
1.1 Data Exfiltration Methods
1.2 Important Questions
1.3 Book Scope
1.4 Book Summary
1.5 Book Structure
References
2 Background
2.1 Hidden Markov Model
2.2 Memory Forensics
2.3 Bag‐of‐Words Model
2.4 Sparse Distributed Representation
2.5 Summary
References
3 Data Security Threats
3.1 Data Security
3.2 Security vs. Protection vs. Privacy
3.3 Advanced Persistent Threats Attacks
3.4 Cybersecurity Threats
3.5 Conclusion
References
Notes
4 Use Cases Data Leakage Attacks
4.1 Most Significant Attacks
4.2 Top Infection Vectors
4.3 Top Threats of Recent Years
4.4 Malware Development Trends
4.5 Geographic Trends
4.6 Industry Trends
4.7 Conclusion
References
Notes
5 Survey on Building Block Technologies
5.1 Motivation
5.2 Background
5.3 Taxonomy
5.4 Supervised Learning Methods
5.5 Systematic Literature Review
5.6 Evaluation of Supervised Learning Methods
5.7 Key Open Problems
5.8 Summary
References
6 Behavior‐Based Data Exfiltration Detection Methods
6.1 Motivation
6.2 Existing Methods
6.3 Sub‐Curve HMM Method
6.4 Evaluation
6.5 Experimental Results
6.6 Discussion
6.7 Summary
References
7 Memory‐Based Data Exfiltration Detection Methods
7.1 Motivation
7.2 Existing Methods
7.3 Concepts
7.4 Fast Lookup Bag‐of‐Words (FBoW)
7.5 Evaluation
7.6 Summary
References
8 Temporal‐Based Data Exfiltration Detection Methods
8.1 Motivation
8.2 Existing Methods
8.3 Definitions
8.4 Temporary Memory Bag‐of‐Words (TMBoW)
8.5 Experimental Results
8.6 Summary
References
9 Conclusion
9.1 Summary
9.2 What Is Innovative in the Described Methods?
9.3 What Is Next?
Index
End User License Agreement
Chapter 2
Table 2.1 HMM notation.
Table 2.2 Examples of Volatility plugins.
Chapter 3
Table 3.1 Types of malware.
Table 3.2 Trojan malware.
Table 3.3 Difference between virus and worm.
Chapter 4
Table 4.1 Top 10 vulnerabilities of 2021.
Chapter 5
Table 5.1 The number of papers published in each category.
Table 5.2 Evaluation criteria for SCADA‐based IDSs.
Table 5.3 Comparison of methods for SCADA‐based IDS groups by classificatio...
Table 5.4 Description of dataset.
Table 5.5 Description of testbed.
Chapter 6
Table 6.1 Malware detection sources used by existing methods and their limi...
Table 6.2 Example of first eight INS observed from Process ID 2428.
Table 6.3 Malware dataset.
Table 6.4 Benign dataset.
Table 6.5 A comparison of malware detection accuracy (%) of proposed method
Table 6.6 Comparison of precision and recall of malware detection methods....
Table 6.7 Keyloggers detection rates (%) of Sub‐Curve HMM, IG, and 10 antiv...
Chapter 7
Table 7.1 Frequently used notations.
Table 7.2 Characteristics of a dataset.
Table 7.3 Benchmark the state‐of‐the‐art and the FBoW method on the fixed f...
Table 7.4 Benchmark the state‐of‐the‐art and the FBoW on the whole text d...
Table 7.5 Robustness comparison on the memory dump dataset.
Table 7.6 Features comparison between state‐of‐the‐art and the FBoW method....
Chapter 8
Table 8.1 Terminology.
Table 8.2 Detection result using naïve method.
Table 8.3 Detection result using TMBoW method.
Table 8.4 Probability of false alarm: small, medium, and large based on spa...
Chapter 1
Figure 1.1 Overview of data exfiltration problems.
Figure 1.2 Scope of the work for Q2–Q4 compared to related work based on thr...
Figure 1.3 Book structure.
Chapter 2
Figure 2.1 Relationship of HMM hidden state and observable event in the ...
Figure 2.2 The illustration of functions (so‐called Di‐gammas).
Figure 2.3 The decision flowchart of choosing memory acquisition strategies....
Chapter 3
Figure 3.1 CIA Triad.
Figure 3.2 CIA Triad relations.
Figure 3.3 Data security, privacy, protection.
Figure 3.4 How advanced persistent threats work.
Figure 3.5 Advanced persistent threats life cycle.
Figure 3.6 Types of data security threats.
Figure 3.7 Ransomware life cycle.
Figure 3.8 Infection flow of fileless malware.
Figure 3.9 ZeuS malware.
Figure 3.10 The Emotet attack chain.
Figure 3.11 A comparison between DoS and DDoS attacks.
Figure 3.12 Man‐in‐the‐Middle flow.
Figure 3.13 Successful MITM execution steps.
Figure 3.14 Social engineering life cycle.
Figure 3.15 How a phishing attack happens.
Figure 3.16 SQL injection attack.
Chapter 4
Figure 4.1 Top attack types, 2021 vs. 2020.
Figure 4.2 Percentage of IR incidents that were ransomware, by month, 2020 v...
Figure 4.3 Ransomware groups that have shut down, 2017–2021.
Figure 4.4 Types of ransomware observed in 2021.
Figure 4.5 Percentage of incidents that were BEC, 2021.
Figure 4.6 Top infection vectors, 2021 vs. 2020.
Figure 4.7 Vulnerabilities discovered by year, 2011–2021.
Figure 4.8 Linux malware with unique code, 2021 vs. 2020.
Figure 4.9 Malware with unique code, Linux vs. Windows, 2021.
Figure 4.10 Breakdown of attacks by region, 2021 vs. 2020.
Figure 4.11 Breakdown of attacks on the top 10 industries, 2021 vs. 2020.
Chapter 5
Figure 5.1 The number of vulnerabilities identified in various industries, p...
Figure 5.2 Evolution of IDS schemes based on ML methods (between 2007 and 20...
Figure 5.3 The first generation of SCADA architecture.
Figure 5.4 The second generation of SCADA architecture.
Figure 5.5 The third generation of SCADA architecture.
Figure 5.6 A taxonomy of IDSs for SCADA systems.
Figure 5.7 The process of the supervised learning‐based IDS approach.
Figure 5.8 Categories of classification algorithms used in SCADA‐based IDSs....
Figure 5.9 Paper selection process.
Chapter 6
Figure 6.1 Data exfiltration malware.
Figure 6.2 Cuckoo sandbox's API calls monitoring report of the sample progra...
Figure 6.3 Architecture of Sub‐Curve HMM.
Figure 6.4 Hidden Markov model: states and observations.
Figure 6.5 API sequence matching with log‐likelihood at different time perio...
Figure 6.6 Flowchart of Sub‐Curve extraction.
Figure 6.7 Comparison the detection accuracy of feature extraction methods w...
Figure 6.8 Comparison the detection accuracy of feature extraction methods w...
Figure 6.9 Distribution of detection accuracy over six malware families of S...
Chapter 7
Figure 7.1 The Fast lookup BoW conceptual diagram.
Figure 7.2 Example of text data extracted from the memory image compared to ...
Figure 7.3 Example of noise from: string and other variables contained in th...
Figure 7.4 Example of noise from: program binary header (a); and other rando...
Figure 7.5 The Fast lookup BoW method.
Figure 7.6 Example of grouping and sorting of the term‐frequency pairs.
Figure 7.7 Average runtime comparison between Aho–Corasick and FBoW.
Figure 7.8 The average searching time per vs. accuracy on variation of thr...
Figure 7.9 The data structure size comparison.
Figure 7.10 The data structure size vs. accuracy on the variation of thresho...
Figure 7.11 The accuracy on news articles dataset.
Figure 7.12 The accuracy dataset product reviews dataset.
Figure 7.13 The data structure size on variation of and threshold.
Figure 7.14 The average searching time per ...
Figure 7.15 The accuracy on variation of and threshold.
Chapter 8
Figure 8.1 Example of original text document A lines 1–7 (a) and text docume...
Figure 8.2 Decoded text from a memory snapshot shows two discoveries of the ...
Figure 8.3 State diagram of the temporal data exfiltration detection system....
Figure 8.4 Attack timeline of the temporal‐based ransomware.
Figure 8.5 Document ID discovered from multiple time steps.
Figure 8.6 Data structure size comparison between original string and .
Figure 8.7 Size memory snapshot (a) and number of the extracted SDR from mem...
Figure 8.8 of each document when the sensitive database contains 100 files...
Figure 8.9 SDR vector size (a) over time step and memory image size (b).
Cover
Table of Contents
Title Page
Copyright
Dedication
About the Authors
Acknowledgments
Acronyms
Abstract
Begin Reading
Index
Wiley End User License Agreement
iii
iv
v
xv
xvii
xix
xx
xxi
xxii
xxiii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
249
250
251
252
253
254
255
256
257
258
259
260
261
Zahir Tari
RMIT UniversityMelbourne, Australia
Nasrin Sohrabi
RMIT UniversityMelbourne, Australia
Yasaman Samadi
RMIT UniversityMelbourne, Australia
Jakapan Suaboot
Prince of Songkla UniversityPhuket, Thailand
Copyright © 2023 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging‐in‐Publication Data applied for:
ISBN: 9781119898870 (HB); ePDF: 9781119898887; epub: 9781119898894
Cover design by Wiley
Cover image: © jijomathaidesigners/Shutterstock
To our dear parents.
Zahir Tari is a full professor of distributed systems at RMIT University (Australia) and research director of the Centre for Cyber Security Research and Innovation (CCSRI). His expertise is in the areas of system's performance (e.g. P2P, Cloud, IoT) as well as system's security (e.g. SCADA, SmartGrid, Cloud, IoT). He is the co‐author of eight books (John Wiley, Springer), and he has edited over 25 conference proceedings. Zahir is or has been an associate editor of ACM Computing Surveys, IEEE Transactions on Computers (TC), IEEE Transactions on Parallel and Distributed Systems (TPDS), and IEEE Magazine on Cloud Computing.
Nasrin Sohrabi received a PhD degree in Computer Science from RMIT University in early 2022. She is currently a Postdoctoral Research Fellow in Cloud, Systems and Security discipline at the School of Computing Technologies, RMIT University (Australia). She is also a core member of the RMIT Centre for Cyber Security Research and Innovation (CCSRI). She is an expert in distributed ledger (e.g. blockchain) and cybersecurity (e.g. network and system security) technologies, as well as in scalability and performance of large‐scale distributed systems.
Yasaman Samadi is a PhD student in computer science at RMIT University, Australia, and a researcher in quantum cybersecurity. Yasaman has a Master's in Computer Architecture and worked as a quantum engineer at QBee. Her research interests include security, low‐level control electronic systems, FPGAs, and micro‐architecture for intelligent devices.
Jakapan Suaboot received BEng and MEng (research) degrees in Computer Engineering from Prince of Songkla University (Thailand) and a PhD degree in Computer Science from RMIT University (Australia) in 2007, 2010, and 2021, respectively. He is a lecturer at the College of Computing, Prince of Songkla University. His research interests include malware detection, data breach prevention, machine learning technologies, and digital financial security.
The enormous work done in this book will not be able to get done without the continual support of our families, friends, and colleagues.
Zahir Tari, Nasrin Sohrabi, Yasaman Samadi, and Jakapan Suaboot
ACC
accuracy
AMAL
AutoMal and MalLabel
ANN
artificial neural network
API
application programming interface
APT
advanced persistent threat
ARP
address resolution protocol
BEC
business email compromise
C2
command and control
CBA
classification based on associations
CFG
control‐flow graph
CHS
Community Health Systems
CI
critical infrastructure
CIA
confidentiality, integrity, availability
CMAR
multiple association rules
COTS
commercial‐off‐the‐shelf
CRA
Canada Revenue Agency
CRF
conditional random field
CRT
classification and regression tree
DFA
deterministic finite automaton
DLP
data leakage prevention
DNS
domain name service
DoS
Denial of Service
FBoW
Fast lookup Bag‐of Words
FPG
frequent pattern growth
FN
false negative
FP
false positive
GSP
generalized sequential pattern
HID
host‐based intrusion detection
HMM
hidden Markov model
HMI
human machine interface
HTM
hierarchical temporal memory
ICS
industrial control system
ICT
information and communication technologies
IDS
intrusion detection system
IED
intelligent electronic device
IoT
Internet of Things
IR
incident response
IS
information system
IT
information technology
LAN
local area network
MFA
multi‐factor authentication
MIST
Malware Instruction Set
MITM
Man‐in‐the‐Middle
NB
naïve Bayes
NID
network‐based intrusion detection
OCSVM
one class support vector machine
OT
operational technology
PLC
programmable logic controller
PR
precision
RAM
read access memory
RAT
remote administration Trojan
RC
recall
RF
random forest
RIPPER
repeated incremental pruning to produce error reduction
RTU
remote terminal unit
SCADA
supervisory control and data acquisition
SDR
sparse distributed representation
SLQ
supervised learning in quest
SQL
structured query language
SVM
support vector machine
TLS
transport layer security
TP
true positive
TMBoW
Temporary Memory Bag‐of‐Words
TN
true negative
UTF
Unicode Transformation Format
VM
virtual machine
WAN
wide area network
Data exfiltration is a type of cyberattacks that causes breaches of sensitive information. It is undoubtedly a critical issue for the modern world of data‐centric services. In particular, the sectors of critical infrastructure (CI), information technology (IT), and mobile computing are the targets of advanced persistent threat (APT). Data breaches cause huge losses every year to a wide range of industries, including large enterprises such as Google, Facebook, Microsoft, to name a few. Furthermore, such breaches can have major impacts on national security if government departments or the military are targeted. Since the adversary constantly attacks the target using various system vulnerabilities (e.g. unknown or zero‐day exploits), a prevention‐based measure alone is not sufficient to thwart the adversary. To address such a problem, in this book, a holistic approach for data exfiltration detection based on three approaches for the detection of data breaches will be discussed in detail.
We begin the description of the technical content by providing basic background so to enable readers understand some of the fundamental and technical concepts/models covered in the remaining chapters of the book. The background covers, for example, basic knowledge of hidden Markov model (HMM), memory forensics, bag‐of‐words (BoW) model, and sparse distributed representations (SDRs). Cybersecurity threats are also covered, as these can cause a wide range of damage, including the physical destruction of an entire information systems facility. Recognizing different types of data security threats and the way that they steal sensitive information from individuals and organizations will give readers a clear understanding on how to protect their data. Hence, data security threats are explained, and various attacks are discussed, such as malware, denial of service (DoS), SQL injection, Emotet (malspam), social engineering and phishing, and man‐in‐the‐middle (MITM) attacks. These attacks often access the high‐value targets, such as nation states and major corporations, to steal the crucial data.
To better grasp the importance of addressing the various topics of this book, a few use cases (examples) of data leakage attacks over the last three years across all continents are discussed in a separate chapter. This helps readers to understand the real‐world cases of cyberattacks and the damage they caused. According to various reports, several attacks (e.g. ransomware, server access, business email compromise (BEC), data theft, credential harvesting, remote administration Trojan (RAT), misconfiguration, and malicious insider) are among the most known attacks that occurred over the period of 2020–2021. This chapter provides all the information, including statistical information, about these use cases.
We later examine the major technologies that have a strong potential to be used as the basis for data leakage prevention technologies. The literature review revealed numerous advanced intrusion detection methods that use similar core technologies for the data exfiltration detection solution. Industrial control systems (ICSs), especially the ones with supervisory control and data acquisition (SCADA), are the most challenging ones for researchers and practitioners as they are constantly subject to cyberattacks. In the systematic literature review we carried out, over a hundred peer‐reviewed articles were examined. This study illustrated the development of supervised‐based learning approaches from the industrial perspective that target various data auditing sources and attacking methods. Based on this extensive review, both qualitative and quantitative benchmarks for several machine learning methods were conducted in 10 different categories. Furthermore, future directions for the development of new algorithms/methods and facilitating the selection of algorithms/methods for industrial‐based intrusion detection system (IDS) were identified.
Following the extensive review, the technical aspects of a holistic data exfiltration detection approach were considered. First, the data exfiltration problem caused by malicious software is examined in detail, as malware is a critical tool used by attackers to steal sensitive information. For the real‐time detection of suspicious data‐stealing behaviors, an innovation method, called Sub‐Curve HMM, is described, and this is based on the HMM to extract the sub‐contained malicious behavior from a long application programming interface (API) call sequence. This method is intended to detect malicious activities that occur only over a short period. By projecting a series of matching scores into a curve, this approach distinguishes malignant actions from other system's activities using discontinuities in the slope of the curve. When testing the long API call sequence, malicious and benign activities obtain different matching scores for an adjoining set of API calls. Experimental results show that the Sub‐Curve HMM method outperforms existing solutions in detecting six families of malware: the detection accuracy of Sub‐Curve HMM is over 94% compared to 83% for the baseline HMM approach and 73% for information gain.
After that, we move away from behavioral‐based to the sensitive‐data‐oriented methods. In particular, the common belief of this research field is challenged here by discussing new ways of monitoring the physical memory for sensitive information instead of checking for malicious activities or scanning the network traffic. Essentially, the main memory is a single point of data flow in the computer system; hence, the adversary cannot evade the detection system by using other channels. This approach helps to shortlist processes that involve sensitive data; using anomaly detection systems, advanced hackers who use a legitimate program to commit data exfiltration can be detected. To efficiently monitor sensitive text‐based files in the memory space of the running processes, the Fast lookup Bag‐of Words (FBoW) method will be discussed in detail. This method transforms a text document into a BoW sequence and then builds a Deterministic Finite Automaton (DFA) to match content in the read access memory (RAM) with the database of sensitive text documents. Experimental results show that FBoW is the most scalable method compared to other state‐of‐the‐art, pattern‐matching methods when the size of sensitive data is increasing. Specifically, it uses 31–400 times less memory than the Aho–Corasick method, with a trade‐off of less than a 2% drop in the detection accuracy for the non‐memory‐based dataset. When tested with memory‐based datasets, FBoW distinctively outperformed the state‐of‐the‐art methods in terms of memory efficiency, runtime, and robustness.
Finally, this book will look at one of the most challenging data exfiltration problems, namely temporal data exfiltration. A sophisticated adversary could delay the data‐stealing activity by exfiltrating tiny pieces of information over a long period instead of transferring a lot of information at once. Although one can detect small fractions of sensitive data in the RAM, if a piece of information is too small, the detection could return a false‐negative result. This research is the very first attempt to mitigate temporal data exfiltration by proposing a novel Temporary Memory Bag‐of‐Words (TMBoW) method, which combines SDR and BoW representation to efficiently detect the temporal patterns of the time‐delayed data exfiltration. Experimental results show that TMBoW has 100% accuracy when the minimum detection threshold is only 0.5, and the analytical result shows that the probability of TMBoW reporting a false alarm is approximately zero.
The high penetration of data‐centric services has markedly increased the risk of exposing sensitive customer and corporate data. In particular, the sectors of critical infrastructures (information technology (IT), and mobile computing are those which are constantly targeted by sophisticated adversaries, insiders, or bribed workers who launch an attack using advanced malware and hacking tools [1]. The main purpose of these attackers is to gain long‐term access to a system and steal critical sensitive data from the enterprise. This causes data breach or data leakage, also known as data exfiltration, which poses huge losses every year to a wide range of industries including many technology giants such as Google, Facebook, and Tesla. Google alone stores an enormous volume of sensitive data derived from sources worldwide [2]. Despite the Covid‐19 pandemic, in 2020, of the reported 32,002 incidents, 3950 involved data breaches [3], of which 86% were financially motivated. Recent outbreaks of ransomware are good examples of new data exfiltration‐based attacks for the purpose of financial gain [4]. Not only are attack methods becoming increasingly sophisticated, but most of the advanced hacking is conducted by state‐sponsored hackers [5]. Moreover, the consequences of data leakage could pose a critical threat to security and the privacy of users, particularly those who work for the government or the military. For instance, the recent leakage of the subcontractor database of the Australian Defence Force had severe implications for national security as the design of military combat aircraft was compromised [6].
Intruders take the opportunity to exploit the unknown vulnerabilities of security systems in order to penetrate an organization. Stuxnet worm [7] was one of the most well‐known attack tools created by the intelligence agencies of the United States and Israel and was intended to thwart the Iranian nuclear development program. Indeed, this worm virus could be classified as a military‐grade weapon considering its origin and sophistication. Stuxnet was used to exploit several undiscovered bugs in the Windows operating system to wirelessly spread itself and install a rootkit on the Siemens' programmable logic controllers (PLCs), which were manipulated and used to destroy the delicate equipment at the targeted nuclear power plant. Another example of the widespread vulnerability of security systems is the OpenSSL Heartbleed [8]. There was no clear evidence indicating whether the bug was discovered and used by hackers before 2014, and for how long. Despite having a mathematically robust design, the implementation bug was accidentally embedded in the widely used OpenSSL cryptography library for decades. Since OpenSSL is the core of Transport Layer Security (TLS) encryption, vital services like secure web pages, email, or the secure shell protocol had been affected. Unfortunately, the security patching process took several months or more to fix the widespread vulnerability in several million devices worldwide. This resulted in several data breach incidents, such as the data breaches involving the Community Health Systems (CHS), which was the giant private hospital chain in the United States, and the Canada Revenue Agency (CRA) incident that impacted millions of Canadian taxpayers.
Since attack prevention measures might not be adequate, the detection‐based approach plays a crucial role in minimizing damages resulting from data breaches. Generally, a malicious program (known as malware) is the primary tool employed by adversaries to help them gain access to a system or even automatically exfiltrate sensitive information. In spite of numerous malware detection approaches proposed in the literature, few works focus on data exfiltration. On the other hand, the studies that propose solutions for data breach detection focus only on a single data leakage channel: network monitoring, unsafe data exportation, or user authorization, to name a few. Such atomistic solutions give attackers opportunities to exfiltrate sensitive data via alternative channels. This suggests the need for a holistic data exfiltration detection approach. Hence, this research investigates several methods for detecting data exfiltration, which is crucial for various corporate sectors as the modern world is evolving toward the adoption of data‐centric services.
Generally, off‐the‐shelf antivirus scanners primarily use known signatures to detect malicious programs. The signature‐based solution has a very low false alarm rate, as it uses a hash of the program or signature strings contained in the malware binary to match with the virus signature. However, advanced hackers could develop a new unseen malware for the well‐protected target or even use the benign program to steal the data. Furthermore, some sophisticated malware does not need to be installed on the disk storage at all (e.g. Code Red worm [9]). Instead, this malware could run in the memory to perform the malicious activity for the entire time. This is where the real‐time behavior detection method plays an important role. However, detecting a nefarious purpose from a series of the program's actions is challenging, as there are too many possibilities that the program's actions infer the malicious transaction, especially in a real‐time context. Hence, behavior‐based data exfiltration is one issue that needs a careful and systematic investigation.
Apart from using malware to steal sensitive data, some attackers might simply exploit the benign program to exfiltrate the sensitive information. In some cases, the monitoring system cannot detect the unforeseen malware. Hence, the malicious behavior‐monitoring approach alone might not be sufficient. Therefore, by taking different perspectives, we may be able to obtain a list of processes used to access sensitive data, which could cause data leakage. To do so, one will need to search for sensitive data in the memory space of those processes. This will allow sensitive files to be read, and other inputs such as keystrokes containing sensitive keywords, to be detected. Ultimately, all data in the computer system needs to be loaded into the main memory before it can be processed; hence, the adversary cannot avoid having the sensitive data in the main memory. Moreover, the data can remain in the memory even if the process has already been terminated [10]. Hence, if those processes can be listed, it will be easier to narrow down and identify the root cause of the data breach regardless of whether the program is classified as malicious or benign. However, to the best of the authors' knowledge, this idea has not been closely examined by previous studies, and therefore this book will elucidate new ways of addressing such issues.
A sophisticated adversary could obfuscate the data exfiltration even more by using a temporal attack [11] to evade the detection system. Here, the data‐stealing activity is delayed so as to trick the monitoring system into “thinking” that it is just a false alarm (false positive). In other words, the hacker could minimize the chance of being discovered by the monitoring system by stealing a minuscule amount of sensitive data at a time. Over a period of time, the attacker could reassemble those small amounts to form the original sensitive file. Even though these time‐delay attacks have been reported for over a decade, few researchers have attempted to address the issue (e.g. [11, 12]). If this method is used to penetrate the critical systems of government departments or the military, the consequences will be devastating. Hence, this book looks at the data exfiltration detection issue by holistically approaching the problem from several perspectives, namely by: (i) examining the program's behavior, (ii) monitoring sensitive data program is accessing, and (iii) monitoring the collective activities of the process related to fractions of sensitive information being collected over a period of time.
The prevention of data exfiltration is a broad and complex issue. To examine this issue more closely, current methods can be categorized into four areas: (i) state‐of‐the‐art survey, (ii) behavior‐based data exfiltration solution, (iii) memory‐based data exfiltration solution, and (iv) temporal‐based data exfiltration prevention. The shortcomings of each of these methods are discussed here.
To begin with, this book surveyed technologies that have a high potential to be used as a fundamental building block for data leakage prevention (DLP) methods, and data exfiltration prevention solutions shared similar core technologies that are used for intrusion detection systems (IDSs). While an IDS is a standard measure to protect computer systems from outsider and insider attacks, DLP is a more specialized and advanced security solution that can provide a better protection against security breaches. DLP aims to detect abnormal access to sensitive data, and this is based on the use of either machine learning (ML) or temporal reasoning algorithms.
Industrial control systems (ICSs), especially supervisory control and data acquisition (SCADA) systems, are a constant target by cyberattacks as their inability to function can have serious impact on people's daily lives. This obviously creates a nightmare to the security community to find the right solutions that could protect ICSs' essential services, so these can provide the required functionalities. IDSs, when tailored to deal with specific requirements (e.g. real‐time detection and reliability), can provide an appropriate protection to ICSs.
Despite several attempts to address SCADA security, only a few of the studies reported in the literature have focused on the development of appropriate ML solutions for SCADA systems. Many of the researchers examined, from different perspectives, the design of specific IDS for CIs. For example, the survey conducted in [13] focused on the design of IDS architecture for SCADA systems. In the other hand, the work in [14] looked at classifying IDS solutions based on detection methods and audit materials. In [15], a review is conducted to gather information about the testbeds used for ICSs. Therefore, it is crucial to conduct a survey based on the supervised ML methods, focusing on CI which has the potential to contribute to the research on efficient methods and technologies for the prevention of data leakage.
Malware has become one of the most effective tools for searching and stealing users' sensitive information. Over several decades, numerous malware detection methods have been proposed in the literature. However, only a few current malware detection solutions focus directly on the detection of data exfiltration; instead, they are used to detect different types of virus/malicious programs. Intuitively, if a Trojan horse software accesses sensitive data, it is highly likely to cause data leakage. Hence, observing the behavior of all running processes (to identify potential malware) is the key to detecting data breaches.
Nevertheless, in most cases [16], data is stolen by malware but remains undetected for two reasons [17]: the signature of the malware is new to antivirus programs, and the anomaly‐based systems cannot detect variations in the malware behavior and therefore cannot differentiate between legitimate and exfiltration activities. Based on a sequence of application programming interface (API) calls invoked by a process, hidden Markov model (HMM) has been used extensively to discriminate between malicious and benign behavior (e.g. [18–20]). However, the dynamic behavior analysis of a malware can generate a very long sequence of API calls. For example, in the experiments carried out by our team, Keylogger malware (MD5 hash value d4259130a53dae8ddce89ebc8e80b560) generated more than 300,000 series of API calls in less than two minutes. Also, HMM seems to perform poorly with such long sequences of API calls. This allows the exfiltration‐based malware to pose as a benign software for most of the time and perform malicious activities only for a very short period of time to avoid detection. Hence, the effect of the sequence length on the detection accuracy which has not been fully studied in the context of malware detection will be examined in more detail in this book.
The detection of suspicious activities is one approach used to prevent data exfiltration. However, the behavior‐monitoring approach has two significant limitations: (i) it covers only certain types of malicious activity (e.g. searches for user identity/credentials, installation of a keystroke‐capturing service, and attempts to contact or export the user's documents to the malicious server). However, the adversary could pass on the data through several legitimate services by, for instance, attaching the sensitive file with email or putting in the email content, inserting the sensitive data into the untrusted database server, storing the sensitive data in files which will be exported later. (ii) The behavior‐monitoring system cannot totally guarantee the detection of all suspicious incidents. Therefore, one needs a detection method that monitors the system from different angles. Specifically, the idea of monitoring the random‐access memory (RAM) for the sensitive data has great potential as a solution that protects the system from attacks coming from nearly all channels. Intuitively, any computer program needs to load the stored sensitive data into the memory before any processing activity can be conducted. In other words, RAM is a sole gateway for the processing of computer data, including the sensitive data that needs protection.
The monitoring of sensitive data in the memory is however a challenging task, since the size of the sensitive data to be monitored (i.e. patterns) could be large, in some cases even larger than the size of the main memory itself. Existing text‐searching methods are designed for using a small‐size patterns to search against a larger‐size database, e.g. Smith–Waterman [21], Aho–Corasick [22], and regular expressions [23]. Unfortunately, existing string‐matching methods are not designed for the monitoring of a large set of sensitive documents. In the context of this book, the searching patterns could be bigger than the size of the memory. Therefore, a new search approach will be provided here to allow a large database of searching patterns to be summarized/compressed into a smaller representation. This will enable more sensitive files to be monitored in real time, particularly in the memory.
There are several challenges to be addressed to build such a system. Firstly, sensitive data could include various information and therefore not only limited to personal information. It can also include organizational information such as hospital patient medical records, government CI sites detail, corporate intellectual property, and organization internal information. Sensitive data can be also classified into various formats, for example, a corpus of text documents (e.g. ASCII, Unicode UTF‐8/16/32) and database files (e.g. comma separated values, spreadsheet, and JavaScript Object Notation). Dealing with different format of sensitive data, indeed, requires different monitoring methods. As the type of data search is directly related to the design of a monitoring algorithm, the discussion here will first focus on sensitive data that is a corpus of text documents, giving that the text is pre‐decoded (i.e. using standard Unicode) from the raw memory data/snapshot.
Much research work focused on monitoring sensitive data using various methods, such as creating and tracking data signatures of sensitive documents [24, 25], tagging sensitive files with additional meta‐data [26, 27], or classifying/detecting sensitive documents using advanced ML methods [12, 28, 29]). However, sophisticated espionage might use highly advanced methods to hide malicious software from being detected while still be able to spy on a victim.
Temporal‐based data exfiltration is one of the hardest attack detection problems and remains one of the most challenging threats [11], because a spy software keeps collecting small fragments of sensitive information over an interval of time (hence, a temporal attack). Consequently, a malicious process could avoid being detected by the security system. Only a few solutions have dealt with temporal data exfiltration attacks by, for example, monitoring the network traffic over a period of time [12]. Interestingly, the hierarchical temporal memory (HTM) technology, which mimics the behavior of neocortex cells in the human brain [30], has been able to recognize a pattern in temporal input data. This technology has not yet been investigated fully to determine whether it can be applied to memory‐based data leakage patterns to detect temporal‐based data exfiltration. At this stage, related work attempting to detect temporal data exfiltration attacks cannot be found, especially by monitoring the RAM's sensitive data, which is also the main focus of this book.
Figure 1.1 Overview of data exfiltration problems.
This book comprehensively addresses the data exfiltration issues by answering the following four important questions, as depicted in Figure 1.1.
What is a survey and benchmark of existing methods?
How does a process behave or how to detect the malicious behavior of the process?
What information the process has access to or how to efficiently detect if any process has access to the sensitive data?
What is the meaning of series of process's actions over time or how to efficiently detect the temporal data exfiltration?
A detailed description of each data exfiltration problem is given in the following section.
(Q1) What are the limitations of existing supervised learning methods for data leakage prevention (DLP) and/or intrusion detection system (IDS) for industrial control systems (ICSs)?
To address the first problem, an extensive literature survey is conducted to find and investigate the state‐of‐the‐art technologies that have been used in data leakage and/or IDSs. Over the past decade, various ML methods have been used to train artificial intelligence to detect abnormal behaviors, including data exfiltration. The supervised‐based machine learning (supervised ML) methods are more common in the anomaly detection system than the unsupervised methods (known as clustering). This is because the patterns of the attack vector are known. To contribute to the current knowledge regarding supervised ML for DLP/intrusion detection, a survey is conducted to examine existing supervised ML technologies that were proposed for anomaly detection systems, including data exfiltration and other cyberattacks. Specifically, one will need to: (i) categorize existing Supervised‐ML based on detection methods and auditing material; (ii) identify system‐specific requirements of supervised ML; and (iii) benchmark supervised ML methods from a holistic perspective.
(Q2) How to effectively detect data‐stealing behaviors from the sequence of API instructions of a process?
Detecting the malicious activities of a specific running process is a challenging task, particularly when we need to differentiate between patterns of malicious and legitimate processes using a sequence of API calls. Unlike text or image processing where all features are observed and extracted by examining the whole picture, a running process can act differently depending on the underlying stages or task it is doing. Therefore, the program behavior, which changes expeditiously, could easily hide fractions of malicious instructions over a brief period of execution time. Hence, it is very challenging for the malware detection technology to detect the malicious behavior that is carried out within seconds or milliseconds. Therefore, the methodology tailored for detecting data exfiltration behavior will require a new method and algorithm to address this problem. This will particularly look at the feature extraction problem, in particular, for the long sequence of API calls that indicate a stream of ongoing activities currently being executing by the program.
(Q3) How to efficiently inspect the sensitive information on the random‐access memory space to which a process has access?
This problem is related to the monitoring of data in the memory space of a process in order to detect the presence of sensitive information. In a computer system, data needs to be loaded into the RAM before it can be processed or transferred; hence, a malicious software will not be able to hide from the detection system if it is accessing sensitive information. Thing et al. [10] suggest that attackers have even tried to evade the memory forensic tool itself to prevent their malicious software from being detected. Even though data in the RAM is changing all the time, the unused data will stay in the RAM until the address space is recycled by the operating system. For instance, they capture up to 75–100% of conversation on instant messages by analyzing the mobile phone memory. On the other hand, Wang et al. [31] identified up to 98.6% of the kernel rootkit by using memory images. The monitoring of sensitive data on the RAM will be a more robust way of mitigating various data exfiltration incidents. However, monitoring memory‐based data is challenging in terms of scalability issue, especially when there is a large number of sensitive documents. The robustness of the method is also an important issue, particularly when the monitoring system directly accesses data from the program's memory.
(Q4) How to efficiently mitigate with a sophisticated temporal‐based data exfiltration?
The last problem relates to a particularly sophisticated attack, namely temporal‐based data exfiltration. This threat might occur when the target information is highly protected, such as military or business databases. Despite the tight surveillance, the hacking bot steals sensitive documents from the compromised machine by splitting them into chunks. The bot then exports pieces of the sensitive data one by one through single or multiple data leakage channels. Even though the small fraction of sensitive data can still be detected, the matching probability could be lower than the detection threshold and considered as a false negative (i.e. not detected). This enables attackers to evade the detection system by collecting several pieces of sensitive data over a period. The hackers then reconstruct the pieces of data to obtain the complete original sensitive file. Furthermore, the delay of data exfiltration action could also return a false positive (i.e. false alarm) when the small part of sensitive data is detected repeatedly. For instance, if a fraction from the sensitive text file is found twice, it could be the same instance that has been detected previously or the new data that has just been loaded into the memory. Unfortunately, the current DLP technology is not adequate in preventing temporal data exfiltration.
The main purpose of this book is to address the problems stated in Section 1.2: exhaustive survey (Q1) and specialized data exfiltration detection methods (Q2–Q4). For (Q1), state‐of‐the‐art supervised ML methods have been surveyed here and describe the intrusion detection methods/systems proposed in the literature for over the past decade. Existing supervised‐ML‐based methods are categorized and evaluated on specific requirements, namely the requirements of CI systems (i.e. SCADA), as such systems are those that are used in our daily life (i.e. critical) and therefore need to be fully, strongly, and properly protected. Ultimately, such an assessment and evaluation of (critical) systems will help the reader understand the key research challenges and ideas about the use of supervised ML methods in IDSs, in particular, for critical systems. On the other hand, Q2–Q4 relate to data exfiltration detection method from three different perspectives, namely behavior‐based (Q2), memory‐based (Q3), and temporal‐based (Q4). Since they are all technical methods that share similarities with existing data leakage/intrusion/malware detection methods, the scope of the Q2–Q4 questions is summarized by comparing them with the mentioned methods based on the following categorization criteria: (i) the detection source, (ii) detection method, and (iii) input characteristics. Figure 1.2 depicts such a categorization.
Figure 1.2 Scope of the work for Q2–Q4 compared to related work based on three different listed aspects.
To address RQ2, the goal is to detect the program's suspicious behavior, in particular, the data scraping and stealing actions. The sequence of API calls is chosen as the detection source because it indicates the ongoing activities of a program. This is known as a heuristic approach and is very flexible compared to the static method. In the behavior‐based model, the number of command sequences issued by a program to steal sensitive data could be very large, making it impossible to find all possible signatures of the malicious command sequences. Also, the signature‐based approach suffers from zero‐day attack, or variant of polymorphic malware where their signatures are unknown. Indeed, the delay in generating the new signature consumes a lot of time and budget. Therefore, the heuristic‐based approach is being developed, facilitated by artificial intelligence and data mining technology.
RQ3 and RQ4, on the other hand, refer to signature‐based methods. The explanation is as follows: to detect a sophisticated malware or a corrupted employee exporting sensitive data using ordinary software, the sensitive data will be monitored regardless of whether or not the program is classified as malicious. The main advantage of the signature‐based method is that it has a very high true positive rate compared to the heuristic‐based method. Here, the static signature or fingerprint is used instead of the generic pattern used by the heuristic approach. However, because sensitive data is text‐based (i.e. unformatted text), the signature‐based method is not scalable. For instance, when the size of the signature database is large, the memory footprint and runtime could be affected. Moreover, the signature‐based method becomes less effective when the input data contains too much noise. In this case, RQ3 and RQ4, which relate to the monitoring of data leakage by examining the physical memory, will definitely need to deal with noise from the RAM's data extracted (Chapter 7). These scalability and robustness issues are primary targets of the research addressing RQ3 and RQ4.
By addressing the four problems, this book describes new methods that address various limitations of current state‐of‐the‐art methods that prevent them from detecting data exfiltration incidents from the four different aspects mentioned earlier.
Data security refers to preserving data against unwanted access, corruption, or theft across all stages of its life cycle. This also refers to covering every aspect of security, including the logical security of applications, administrative and access controls, and the physical security of hardware and storage devices. Specific policies and procedures are used to guarantee data security. However, data can be still vulnerable to various attacks and threats. Data security threats refer to activities that have the potential to compromise the confidentiality, integrity, and availability of the data and have therefore considerable damage and harm to organizations. Confidentiality refers to ensuring that the data is kept secret or private, and accessing it must be controlled to prevent any malicious or accidental unauthorized sharing of data. Integrity refers to ensuring that the data is reliable and immutable. Availability means that the data should be available for the users in the entire life cycle of the data. Chapter 3 aims to cover data security threats in detail, and more specifically it will cover the most known cyberattacks, e.g. Malware, denial of service (DoS), SQL Injection, Emotet (malspam), Man in the Middle (MITM), Password Attacks, and Social Engineering & Phishing. This will provide readers with a good understanding of existing cybersecurity threats.
Studying data security can be seen as “boring” (and thus not convincing) when the focus is only on the theoretical aspects and concepts cybersecurity. This study will hopefully convince readers about the importance of properly understanding as well as dealing with cyberattacks through the description of some real‐world use cases. Thus, to make this book more interesting to readers, Chapter 4 will study several use cases of data leakage attacks that occurred over the last three years across all the continents. More specifically, it first introduces the cyberattack types and categorized them based on the objectives and attackers intend to accomplish. Some of these cyberattacks (e.g. ransomware, server access, business email compromise (BEC), data theft, credential harvesting, remote administration Trojan (RAT), misconfiguration, and malicious insider) are all grouped based on attackers ultimate goals. Ransomware is considered as the most common attack in the recent three years. After ransomware, the server access attacks are ranked the most frequent common ones. They occur when an attacker gains unauthorized access to a server. The third most frequently used attack is BEC. After discussing the cyberattack types in detail, the chapter explains the initial infection vectors, i.e. the method through which a network is breached and compromised. It is important to note that the cyberattack types and initial infection vectors are two different concepts and not be confused. Phishing, stealing credentials, and exploiting vulnerabilities are considered initial infections vectors. Reports show that in 2021, phishing alone achieved the highest record of 222,127. The effect of this attack vector can be reduced by monitoring suspicious connections.