Data Analytics and Big Data - Soraya Sedkaoui - E-Book

Data Analytics and Big Data E-Book

Soraya Sedkaoui

0,0
139,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

The main purpose of this book is to investigate, explore and describe approaches and methods to facilitate data understanding through analytics solutions based on its principles, concepts and applications. But analyzing data is also about involving the use of software. For this, and in order to cover some aspect of data analytics, this book uses software (Excel, SPSS, Python, etc) which can help readers to better understand the analytics process in simple terms and supporting useful methods in its application.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 245

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Preface

Introduction

Why this book?

Whom is this book for?

Organization of the book

Part 1: Towards an Understanding of Big Data: Are You Ready?

1 From Data to Big Data: You Must Walk Before You Can Run

1.1. Introduction

1.2. No analytics without data

1.3. From bytes to yottabytes: the data revolution

1.4. Big data: definition

1.5. The 3Vs model

1.6. Why now and what does it bring?

1.7. Conclusions

2 Big Data: A Revolution that Changes the Game

2.1. Introduction

2.2. Beyond the 3Vs

2.3. From understanding data to knowledge

2.4. Improving decision-making

2.5. Things to take into account

2.6. Big data and businesses

2.7. Conclusions

Part 2: Big Data Analytics: A Compilation of Advanced Analytics Techniques that Covers a Wide Range of Data

3 Building an Understanding of Big Data Analytics

3.1. Introduction

3.2. Before breaking down the process… What is data analytics?

3.3. Before and after big data analytics

3.4. Traditional versus advanced analytics: What is the difference?

3.5. Advanced analytics: new paradigm

3.6. New statistical and computational paradigm within the big data context

3.7. Conclusions

4 Why Data Analytics and When Can We Use It?

4.1. Introduction

4.2. Understanding the changes in context

4.3. When real time makes the difference

4.4. What should data analytics address?

4.5. Analytics culture within companies

4.6. Big data analytics application: examples

4.7. Conclusions

5 Data Analytics Process: There’s Great Work Behind the Scenes

5.1. Introduction

5.2. More data, more questions for better answers

5.3. Next steps: do you have an idea about a “secret sauce”?

5.4. Disciplines that support the big data analytics process

5.5. Wait, it’s not so simple: what to avoid when building a model?

5.6. Conclusions

Part 3: Data Analytics and Machine Learning: the Relevance of Algorithms

6 Machine Learning: a Method of Data Analysis that Automates Analytical Model Building

6.1. Introduction

6.2. From simple descriptive analysis to predictive and prescriptive analyses: what are the different steps?

6.3. Artificial intelligence: algorithms and techniques

6.4. ML: what is it?

6.5. Why is it important?

6.6. How does ML work?

6.7. Data scientist: the new alchemist

6.8. Conclusion

7 Supervised versus Unsupervised Algorithms: a Guided Tour

7.1. Introduction

7.2. Supervised and unsupervised learning

7.3. Regression versus classification

7.4. Clustering gathers data

7.5. Conclusion

8 Applications and Examples

8.1. Introduction

8.2. Which algorithm to use?

8.3. The duo big data/ML: examples of use

8.4. Conclusions

Bibliography

Index

End User License Agreement

List of Tables

1 From Data to Big Data: You Must Walk Before You Can Run

Table 1.1. Data: from byte to yottabyte

2 Big Data: A Revolution that Changes the Game

Table 2.1. The seven additional Vs of big data

Table 2.2. Structured, semistructured and unstructured data

3 Building an Understanding of Big Data Analytics

Table 3.1. Traditional versus advanced analytics

4 Why Data Analytics and When Can We Use It?

Table 4.1. The leaders in the big data analytics field

6 Machine Learning: a Method of Data Analysis that Automates Analytical Model Building

Table 6.1. ML examples as illustrated by Mitchell

Table 6.2. Matrix in a table form

8 Applications and Examples

Table 8.1. Supervised versus unsupervised algorithms

Table 8.2. Algorithm application examples

List of Illustrations

1 From Data to Big Data: You Must Walk Before You Can Run

Figure 1.1. Example of structured data (in Excel table)

Figure 1.2. Example of text analytics

Figure 1.3. Data generated by the IoT

Figure 1.4. Data life cycle

Figure 1.5. Knowledge pyramid

2 Big Data: A Revolution that Changes the Game

Figure 2.1. The 3Vs that characterize big data

Figure 2.2. Valuating data to extract knowledge

Figure 2.3. Costs of storage and data availability (2009-2017)

3 Building an Understanding of Big Data Analytics

Figure 3.1. Types of analytics

Figure 3.2. The difference from another point of view

Figure 3.3. Big data analytics: the road for knowledge

4 Why Data Analytics and When Can We Use It?

Figure 4.1. Big data pipeline

5 Data Analytics Process: There’s Great Work Behind the Scenes

Figure 5.1. Loss of information

Figure 5.2. Evolution of the US presidential polls

6 Machine Learning: a Method of Data Analysis that Automates Analytical Model Building

Figure 6.1. Supervised and unsupervised ML algorithms

Figure 6.2. The skill set of a data scientist

7 Supervised versus Unsupervised Algorithms: a Guided Tour

Figure 7.1. Illustration of classification and regression

Figure 7.2. An example of clustering

Figure 7.3. The internal representation of the concepts: “face” and “cat” learned by an unsupervised algorithm

Figure 7.4. Savings level by income

Figure 7.5. Regression model

Figure 7.6. Classification types

Figure 7.7. Sigmoid function

Figure 7.8. Example of SVM. For a color version of this figure, see www.iste.co.uk/sedkaoui/data.zip

Figure 7.9. Decision tree example

Figure 7.10. Random Forest scheme

Figure 7.11. Example of a neural network

Figure 7.12. Principle clustering algorithm

Figure 7.13. Example of K-means clustering. For a color version of this figure, see www.iste.co.uk/sedkaoui/data.zip

8 Applications and Examples

Figure 8.1. Traditional learning versus transfer learning

Figure 8.2. Profile personalization example for “Good Will Hunting”

Figure 8.3. Evolution of the number of Netflix subscribers

Figure 8.4. Netflix net earnings

Figure 8.5. Amazon versus the major U.S. players (December 2016)

Guide

Cover

Table of Contents

Begin Reading

Pages

C1

ii

iii

iv

v

xi

xii

xiii

xiv

xv

xvii

xviii

xix

xx

xxi

xxii

xxiii

xxiv

xxv

xxvi

xxvii

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

101

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

G1

G2

G3

G4

G5

G6

G7

G8

e1

To “Ben M’hidi”

My idol and the soul of my homeland

Data Analytics and Big Data

Soraya Sedkaoui

First published 2018 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUKwww.iste.co.uk

John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USAwww.wiley.com

© ISTE Ltd 2018

The rights of Soraya Sedkaoui to be identified as the author of this work have been asserted by her in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Control Number: 2018936255

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISBN 978-1-78630-326-4

Acknowledgments

“No guide, no realization”.

It is true that writing a book needs time, patience and motivation in equal measures. However, the use of analytics, the application of algorithms and uncovering the hidden patterns behind the data available today have always excited me. When we consider the opportunities offered by the big data universe, the power of analytics and what may be revealed by each byte of data, the effort involved to write this book must be doubled.

I would be remiss if I did not mention the excellent advice and additional motivation that I received from Professor Hans-Werner Gottinger and Professor Jean-Louis Monino, who helped me to shape my ideology on how big data analytics can be applied to generate value. Their guidance and useful advice helped me to pursue my ultimate dream of writing a book. Thank you for everything!

I must also acknowledge my beloved family: my mother, as I would not be doing this if it was not for her and the drive to make her proud of me; my sisters and brother (Saliha, Nadia, Zahra and Kamel) and, with special attention, Manel and Zaki, for their continuous encouragement, support and help in every step that I take. They provide me with the strength that I need to go forward. I am very grateful to have such a wonderful and supportive family; they are great people and without them, this book may not have been written.

Also, my sincere thanks to my friends who support me and understand that I do not have much time but I still count on the love and support that they have given me throughout my career and the development of this book.

Soraya

Preface

“If you can look into the seeds of time,

And say which grain will grow and which will not,

Speak then to me, who neither beg nor fear

Your favors nor your hate”.

Shakespeare, Macbeth, Act I, Scene III, 59–62.

This book treats the roots and the fruits of the movement that marks, affects and transforms any part of business and society. It is about the large amounts of data (the seeds of our time) that we are sowing and creating by simple contact with our connected objects or simple use of advanced IT tools and the value generation that we have to derive and reap, as Shakespeare suggests, through sophisticated methods and advanced tools.

At the time of reading this book, you have to know that more different types of data will be produced. It is no longer about the word “big”, but it is more about how to handle this “big” amount of structured and unstructured data, which cannot be managed with traditional tools, and deal with its diversity and velocity to generate value.

Therefore, this book is about “big data analytics”, which are probably nothing new in reality but have become one of the most exciting fields of our time. This exciting field opens the way to new opportunities that have significantly changed the business playground.

We have probably noticed that “big” companies such as Google, Facebook, Apple, Amazon, IBM, Netflix and many other companies invest continuously in big data and analytics applications in order to take advantage of every data byte. Many companies have realized that knowledge is power, and in order to get this power they have to gather its source, which is data, and make sense of it.

However, with great power comes great responsibility! Thus, the mission of this book is to provide the reader with the different concepts and applications behind big data analytics, those that are necessary and most important in order to be familiar with the ways in which data analytics process and algorithms work, and how to use them.

Every chapter of this book is meant for readers who are looking to discover the importance of analytics tools and the pertinence of algorithm applications, and who have a critical vision toward how knowledge or this “power” is derived from data.

So, if you want to become a data analysis practitioner or a better problem solver, or even if you are considering a career in big data and joining the analytics arena, then this book is for you! If you are familiar with big data analytics techniques and Machine Learning (ML) algorithm applications and you want to enrich your knowledge and gain more insights into how it works, then this book will help you to put your knowledge into practice.

Also, if you are a novice in this field and you are seeking to developing your analytics ability, then this book is for you, too! This book will provide you a complete overview related to this context. So, do not worry, because even if you are completely new to the big data universe, analytics techniques and ML algorithm applications, this book will change the way that you think about it. You will realize at the end of this book that it can be an exciting field for you, too.

By writing this book, I want to share my knowledge in the hope that the reader will embrace the opportunity offered by this practical exciting context and focus on its applications. The necessary theoretical concepts behind big data analytics and ML will be simplified in order for the reader to understand how make sense out of data.

Before we dive into this universe, I say: “may the big data analytics power and ML algorithms’ relevance be with you”!

Dr. Soraya SEDKAOUI

March 2018

Introduction

It is quite natural for academics who are continuously passionate to publish and share their knowledge, and to want to always create something from scratch that is their own fresh creation.

It is true that writing a book is a huge investment in time and energy, but the most essential thing is to do a great work. This book is an experiment in not starting from scratch, as it is instead a “redesigning” of my previous works, which are related to the data analytics field.

The genesis of the idea for this book began in early 2017, after I was lucky enough to be part of many teaching programs, research endeavors and conferences. In that time, I told myself that it was time to write the book focused on “big data analytics”.

While writing this book, I suggest that the reader must have some basic concepts and methods related to statistics, linear algebra and mathematics. But, you do not have to worry because even if you have forgotten most or some of it, this book will help you to refresh your understanding of these concepts and methods.

So, if you want to understand big data analytics, its complexity, promises and applications of its models and mechanisms, as well as machine learning algorithms, then I tell you, whoever are you (student, manager, academic, etc.), welcome to this book!

But, remember that “I can only show you the door. You’re the one that has to walk through it”. (Morpheus, The Matrix)

Why this book?

As a trend that has emerged around the business context, a first reflex is to think that data analytics is like a fast and furious phenomenon or even a kind of magic ball that can predict all kinds of things with extraordinary precision. In the case of Google, Facebook, Amazon, as well as banks and insurers, the constitution of huge databases gives an increasingly central place to “big data analytics”.

Big data analytics has become an extremely important and challenging problem in disciplines such as computer science, biology, medicine, finance and homeland security. As massive amounts of data are available for analysis, scalable integration techniques become important.

Nowadays, companies are starting to realize the importance of using more data in order to support decision for their strategies. It was said and proved through case studies that “more data usually beats better algorithms”.

Data sizes have been growing exponentially within many companies. Facing this size of data – meta-tagged piecemeal, produced in real time, and arriving in continuous streams from multiple sources – and analyzing the data, to spot patterns and extract useful information, is still harder.

This includes the ever-changing landscape of data and their associated characteristics, evolving data analysis paradigms, challenges of computational infrastructure, data sharing and data access, and – crucially – our ability to integrate datasets and their analysis toward an improved understanding.

New forms of methods and technologies are required to analyze and process these data. This need has motivated the development of big data analytics and machine learning algorithms in this book.

The objective is to familiarize anyone who is curious to have an overview of big data analytics as a tool for addressing and applying new analytics methods and algorithms of machine learning, in order to process data and make more intelligent decisions.

Whom is this book for?

This book provides a basic introduction to big data analytics, data science and machine learning algorithms, which are being adopted and used more frequently, especially in businesses that are looking for new methods to develop smarter capabilities and tackle challenges in the dynamic processes.

It will help those who are interested in developing a broad picture of the current context characterized by big data analytics and machine learning, and enable them to recognize the possible trajectories of future developments. It will provide for those seeking to build a common set of concepts, terms, references, methods, applications and approaches in this area.

Organization of the book

“Paths are made by walking”.

Franz Kafka

The concepts behind big data analytics are actually nothing new. Organizations have always used descriptive, predictive and perspective analytics (business intelligence), and academics and researchers have been using data to analyze phenomena for many years. However, the amount of data available today and the emergence of the big data age in the early years of this decade, which impose many challenges, are changing the data analytics arena.

The challenge, therefore, lies in the ability to extract value from the volume of data produced in real-time continuous streams in multiple forms and from multiple sources. In other words, the key to exploring data and uncovering secrets from it, is to find and develop applicable ways in which to extract knowledge that can conduct decision-making processes and business strategies.

This is what this book will explore by highlighting the contents in three parts.

The first part discusses the general context of the big data area and presents the corresponding state of the art. It offers, in Chapters 1 and 2, the general theoretical background and framework necessary to understand the rest of this book. This first part will cover the key challenges and benefits of big data. It gives a platform to precede to different big data-related concepts and how this phenomenon is changing business opportunities.

The second part contains three chapters, (Chapters 3–5), dedicated to the data analytics process, which mainly focuses on how we can make sense of data, and the essential tools and technologies for organizing, analyzing and benefiting from big data. It illustrates the power of advanced analytics and its wide range of applications by showing how it can be applied in order to solve fundamental data analysis tasks.

The three chapters of the third part (Chapters 6–8) introduce the main subareas of artificial intelligence (AI) and machine learning (ML). They discuss the essential ML algorithm families that can be used to tackle various problem tasks by giving a machine the ability to learn from data in order to better guide the model building paths.

Glossary

In order to attain a basic understanding of what big data analytics entails, it is necessary to provide and review the terms that shape a framework related to this field. This section introduces the concepts that are most associated with “big data analytics”.

Algorithm:

A set of computational rules to be followed to solve a mathematical problem. More recently, the term has been adopted to refer to a process to be followed, often by a computer.

Amazon Web Services (AWS):

This is a comprehensive, evolving cloud computing platform provided by

Amazon.com

. Web services are sometimes called cloud services or remote computing services. The first AWS offerings were launched in 2006 to provide online services for websites and client-side applications.

Analytics:

This has emerged as a catch-all term for a variety of different business intelligence (BI) and application-related initiatives. For some, it is the process of analyzing information from a particular domain, such as Website Analytics. For others, it is applying the breadth of BI capabilities to a specific content area (for example sales, service, supply chain and so on). In particular, BI vendors use the “analytics” moniker to differentiate their products from the competition. Increasingly, “analytics” is used to describe statistical and mathematical data analysis that clusters, segments, scores and predicts what scenarios are most likely to happen. Whatever the use cases, “analytics” has moved deeper into the business vernacular.

Analytics has garnered a burgeoning interest from business and IT professionals looking to exploit huge amounts of internally generated and externally available data.

Artificial intelligence:

The theory and development of computer systems able to perform tasks that traditionally have required human intelligence.

Big data:

A generic term that designates the massive volume of data that is generated by the increased use of digital tools and information systems. The term big data is used when the amount of data that an organization has to manage reaches a critical volume that requires new technological approaches in terms of storage, processing and usage. Volume, velocity and variety are usually the three criteria used to qualify a database as “big data”.

Business intelligence

(BI): This is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance.

Cloud computing:

The National Institute of Standards and Technology (NIST) definition of cloud computing: “Cloud Computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g. networks, servers, storage, applications), and services that can be rapidly provisioned and released with minimal management effort or service provider interaction”. This term designates a set of processes that use computational and/or storage capacities from remote servers connected through a network, usually the Internet. This model allows access to the network on demand. Resources are shared and computational power is configured according to requirements.

Cluster analysis:

A statistical technique whereby data or objects are classified into groups (clusters) that are similar to one another but different from data or objects in other clusters.

Computer science:

Computer science is the study of how to manipulate, manage, transform and encode information.

Customer relationship management (CRM):

This is a business strategy that optimizes revenue and profitability while promoting customer satisfaction and loyalty. CRM technologies facilitate the implementation of a strategy, and make it possible to identify and manage customer relationships, in person or virtually. CRM software provides functionality to companies in four segments: sales, marketing, customer service and digital commerce.

Cyber security:

This is also known as computer security or IT security; it is involved in the protection of computer systems from the theft or damage of hardware, software or the information on them, as well as from disruption or misdirection of the services they provide.

Data:

This term comprises facts, observations and raw information. Data itself have little meaning if it is not processed.

Data analysis:

This is a class of statistical methods that makes it possible to process a very large volume of data and identify the most interesting aspects of its structure. Some methods help to extract relations between different sets of data, and thus draw statistical information that makes it possible to describe the most important information contained in the data in the most succinct manner possible. Other techniques make it possible to group data in order to identify its common denominators clearly, and thereby understand them better.

Data mining:

This practice consists of extracting information from data with the objective of drawing knowledge from large quantities of data through automatic or semiautomatic methods. Data mining uses algorithms drawn from disciplines as diverse as statistics, artificial intelligence and computer science in order to develop models from data, that is, in order to find interesting structures or recurrent themes according to criteria determined beforehand, and to extract the largest possible amount of knowledge useful to companies. It groups together all technologies capable of analyzing database information in order to find useful information and possible significant and useful relationships within the data.

Data science:

This is a new discipline that combines elements of mathematics, statistics, computer science and data visualization. The objective is to extract information from data sources. In this sense, data science is devoted to database exploration and analysis. This discipline has recently received much attention due to the growing interest in big data.

Deep learning:

This is also known as deep structured learning or hierarchical learning; it is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms.

Exploratory data analysis (EDA):

In statistics, EDA is an approach of analyzing datasets to summarize their main characteristics, often with visual methods.

Garbage in, garbage out (GIGO):

In the field of computer science or information and communications technology, this refers to the fact that computers, since they operate through logical processes, will unquestioningly process unintended, even nonsensical, input data (“garbage in”) and produce undesired, often nonsensical, output (“garbage out”). The principle applies to other fields as well.

Hadoop:

Big data software infrastructure that includes a storage system and a distributed processing tool.

Information:

This consists of interpreted data and has discernible meaning. It lies in descriptions and answers questions like “Who?” “What?”, “When?” and “How many?”

Innovation:

Innovation can refer to something new or to a change made to an existing product, idea or field.

Internet of Things (IoT):

The internetworking of physical devices, vehicles, buildings and other items embedded with electronics, software, sensors, actuators and network connectivity that enable these objects to collect and exchange data and send, receive and execute commands. According to the Gartner group, IoT is the network of physical objects that contain embedded technology to communicate and sense or interact with their internal states or the external environment.

Knowledge:

This is a type of know-how that makes it possible to transform information into instructions. Knowledge can either be obtained through transmission from those who possess it or by extraction from experience.

Machine learning:

A method of designing a sequence of actions to solve a problem that automatically optimizes through experience and with limited or no human intervention.

Machine-to-machine (M2M):

Communications is used for automated data transmission and measurement between mechanical or electronic devices. The key components of an M2M system are field-deployed wireless devices with embedded sensors or RFID-wireless communication networks with complementary wireline access. This includes, but is not limited to cellular communication, Wi-Fi, ZigBee, WiMAX, wireless LAN (WLAN), generic DSL (xDSL) and fiber to the x (FTTx).

MapReduce:

This is a programming model or algorithm for the processing of data using a parallel programming implementation and was originally used for academic purposes associated with parallel programming techniques.

Natural language processing (NLP):

An interdisciplinary field of computer science, artificial intelligence and computation linguistics that focuses on programming computers and algorithms to parse, process and understand human language.

Nowcasting:

Nowcasting is the prediction of the present, the very near future and the very recent past in economics. The term is a contraction for

now

and

forecasting