Data Analysis Using Big Data & Hadoop Framework - Dr. Amit Wadhwa - E-Book

Data Analysis Using Big Data & Hadoop Framework E-Book

Dr. Amit Wadhwa

0,0
1,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

The book is all about the Introduction to Data Analytics using Big Data and Hadopp Framework. It covers the basics of Big Data Technology and Hadoop Framework, used to achieve the goal of data analytics.

The intial chapter covers basics of Big Data and its background related to data analytics. Further, it covers description about some of the tools and technologies used for Data Analytics followed by Requirement Specification and Dataset representations. Later, Implementation and result analysis has been covered using Airlines Data Set as an example. The book is authored by Dr. Amit Wadhwa, Assistant Professor, Amity University Haryana (India).

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Veröffentlichungsjahr: 2019

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Dr. Amit Wadhwa

Data Analysis Using Big Data & Hadoop Framework

BookRix GmbH & Co. KG81371 Munich

INTRODUCTION

CHAPTER 1

 

INTRODUCTION

 

1.1  Introduction to Big Data

 

Big data analytics is a complex process of analysing large and different types of data sets (referred as big data). Its being used to unhide information which mainly includes various features like hidden patterns, unknown correlation, customer preferences which would be helpful for business organizations in decision making.

 

Data analytics with its techniques and various technologies is able to provide a means for analysis of Large data sets and use them to draw conclusions which would be further helpful for organizations in decision making related to their business and its growth.

 

Big data analytics is a type of analytics mechanism involving advanced and complex applications including prediction models and statistical algorithms which could be worked upon by high-end analytics systems for detecting about system performance.

 

Importance of Big Data analytics is, as it provides various business benefits like effective revenue and marketing opportunities with better customer service having enhanced efficiency among operations. Further it also provides competitive advantages over other profiteers in market.

 

Data growth have been increased enormously due to advancement in technology like computation, communication mechanisms, sensors, storage devices and their storage capacity has enhanced as compared to size of devices. Likewise in search engines like Google, Yahoo, bing etc have created a pretty big market by collecting all data and information available for free through www and start to provide it to users through the respective platforms which makes a separate economy for these market players.

 

Huge amount of data has been added to their storage servers day in and day out which the use of new services such as satellite images, driving directions, videos and imagery content. The use of these data items and information content providing benefits to society could not be measured and ignored for making the society of current scenario. These search engines has changed the ways one accesses these information in different forms, which is used effectively now a days for Big Data computing operations. This not only help the information users but also the provider companies, researchers, defence forces and medical systems etc.

 

The increased importance of big-data and its computing capabilities has originated out from rapid advancement in different digital technologies:

 

Digital Sources: Digital data has been generated out from many different sources like evolving digital imaging mechanisms which includes digital cameras, medical equipment’s like X rays, CT-Scans and MRI machines, sensors in field of chemical and biology like environmental monitors, and from millions of individuals and organizations creating their own digital web interfaces for catching customers.

 

Computer networks: These could also be the reason for generating massive data sets using specific local sensor networks or from Internet.

 

Storage: As storage and its devices are getting cheaper day by day so that also leads to generation of huge amount of digital information which needs to be managed by effective and evolving data handling technologies or frameworks like Big Data & Hadoop.

 

Cluster Computers: It basically comprise of a new form of computer systems, consisting of thousands of systems connected with each other and having several processors and storage devices connected with help of a latest local-area networks having high speed of processing. These configuration of systems and networks are common these days.

 

These devices or networks provide storage capacity for large data sets plus computing power to organize, analyse and respond to queries about the data from distant users.

 

Cloud Computing and Data Analysis Algorithms: Increase in data centers and cluster computing environments has created a new business architecture helping businesses and individuals to use technology on pay per use model rather than personally owning computing facilities through large investments in terms of money. There are few market players in to provide these technology solutions and making huge business markets out of it.

 

 

As data sets are huge so handling or working with them require use of automated or semi-automated analysis algorithms or techniques for different data processing requirements.

 

These advanced algorithms have been designed by many organisations to satisfy their dedicated business oriented data processing requirements.

 

1.2  Technology and Application Challenges

 

Technology required to handle big-data computing is being developed and still growing at an enormous pace due to factors like technological evolution and other market forces. IF we discuss about disk drives and their capacity, these are growing in a vice versa manner, like capacity is increasing but prices are dropping. Reason for such effect is a result of the evolution of magnetic storage technology and huge economies of scale provided by large data centers and PCs' respectively. There are some other focus points requiring more attention like:

 

Increased Networking Speed: Although 1 TB of data can be stored on disk by spending few hundred bucks, but in order to transfer it requires about an hour or so or may be more amount of time within a cluster environment plus around a day's time using high-speed internet connectivity. These possess a challenge of making efficient use of the computing and storage resources in a cluster due to increasing bandwidth requirements. This difference between the amount of data to be stored practically and the amount that is practically communicable.

 

Cluster Computing and Its Programming: It has been a challenge to program large-scale and highly distributed computing servers is and has become important so as to process huge data sets in limited amounts of available time. The software should be able to distribute data and computation across the nodes in a cluster, plus further detect and remove the hardware and software errors which were otherwise impossible to avoid in such huge systems.

 

Over the years methods were designed toto organize and program such systems. One of such system in cloud computing environment is done by MapReduce programming environment evolved by Google Inc pvt ltd.

 

Increase in Cloud Computing Technologies: Tasks requiring extensive computation over large amounts of data are still not easy to handle by even the

 

servers of a big market player like Amazon. This could be due to technological challenges and shared bandwidth which further imposes huge burden in terms of time and money getting the cloud user's data in and out of a cloud environment. In order to make cloud environments less vulnerable to situations like earthquakes etc these should be geographically dispersed which further requires high levels of interoperability and data mobility. Further a new cost effective model needs to be designed to handle such scenario's.

 

Machine learning and other Data Analysis Techniques: Machine learning could help solve the problems of large computations or scientific calculations but still many of its algorithms have scalability issues in situations where data sets size is huge or out of control to manage. In future more sophisticated algorithms are required for handling this exponential digital data growth. So to cater to this Big Data plays an important role in every variant of different industries.

 

 

Security Issues: Security in another issue which creates hurdles for implementation and handling of big data sets. Data could be of any type like highly sensitive or less sensitive one. But tools required to extract and make use of the huge data gave rise to unauthorized access to sensitive or non-sensitive

 

information without owner's permission. Much of our preservation of privacy in society relies on current inefficiencies.

 

Cloud environment also provides attackers a open access to launch attacks on systems carrying data as technology has reduced the cost of making it feasible. In order to handle such situation more safeguards are needed in terms of effective security of big data.

 

1.3 Emergence and Growth of Big Data Analytics

 

In era of mid 1990's, for referring to exponential increase in digital data, a term was coined as Big-Data. In 2001, Doug Laney, further moved focus of big data, in terms of relating it to increases in the variety and frequency of data being generated by different organizations. Later these three factors were named as 3 V's of the broader area i.e. Big Data. Three V's refer to increased volume, increasing velocity/frequency and variety which became popular as Gartner made

  

it popular after acquiring Meta Group and getting associated with D. Laney in the year 2005. Later in 2006, a distributed processing framework known as Hadoop was launched as part of Apache open source project.

 

It was found to be useful in running big data applications with ease. Big Data and Hadoop framework for processing Big Data became popular by 2011, after it was adopted and recognised by many organisations. Along with it many other Big Data processing related frameworks and technologies have evolved around it.

 

In the initial years, Hadoop Framework started to grow and became commonly used framework for Big Data applications. These applications were mainly developed for huge market players in domain of internet and e-commerce like Google, Facebook, Yahoo along with other analytics and marketing services providers. Further it was adopted by many sectors like financial services sector, retailer, healthcare , suppliers and manufacturers and few other enterprises.

 

1.4 Tools and Technologies for Big Data Analytics

 

As we categorise data into different categories, but generically these are of two types: Structured and Unstructured. Further if we bifurcate it another type could be of semi-structured data type. But by use of commonly used relational databases these unstructured and semi-structured data types could not be handled or processed. Further, real time data processing requirements of big data systems could not be handled well with these relational database based systems.

 

This was the reason for development of NoSQL databases, used mainly by organisations involved in collecting, processing and analysis of Big Data. Apart from it Hadoop and some other tools were evolved and used by organisations working on Big Data like:

 

Pig: Its an open source technology that offers a high-level system for parallel programming and processing of tasks based on MapReduce and further executed on Hadoop clusters.

 

Hive: Its an open source data warehousing system designed for the purpose of querying and analysing large unstructured or semi structured data sets stored in HDFS based files.

 

 

 

MapReduce: It is a software development framework allowing developers in developing applications that process unstructured data in parallel across a distributed cluster of processors or stand-alone network of systems.

 

HBase: Its a key/value pair based data store designed or developed to run on top of Hadoop Distributed File System (HDFS).

 

Spark: Its an open source framework used for parallel processing and allowing users to run large-scale data analytics applications across different

 

clustered systems.

 

Further, Apache Hive is considered for analysing large datasets stored in Hadoop's HDFS and compatible file systems. It provides an SQL-like language called HiveQL. Hadoop allows to store and process Big Data in a distributed environment across group of computers using simple programming models. It is intended to scale up starting with solitary machines and will be scaled to many machines. In this paper Hive tool is used. The primary goal of Hive is to provide answers about business functions, system performance, and user activity. To meet these needs strongly dumping the data into MYSQL data set, but now since huge amount of data in Terabytes which is injected into Hadoop Distributed File System files and processed by Hive Tool.

 

1.5 Goal of Work

 

In order to make an analysis of Big Data and its technologies better way is to use an example to demonstrate the analysis using a real time system. Every Airport has huge amount of data related to number of flights, date and time of arrival and dispatch, flight routes, No. of airports operating in each country, list of active airlines in each country. The problem they faced till now is that they have limited ability to analyse only limited sized data sets using relational database processing techniques. The proposed model is built and analysed with an intension to develop a model using airline data to provide the platform for data analytics based on HDFS (Hadoop Distributed File System) and Hive, thereby executing queries and making an analysis of their large data sets more efficiently and effectively.

 

 

BACKGROUND STUDY

CHAPTER 2

 

BACKGROUND STUDY

 

Big Data is not only a Broad term but also a latest approach to analyse a complex and huge amount of data; there is no single accepted definition for Big Data, but many researchers working on Big Data have defined Big Data in different ways. One such approach is that it is characterized by the widely used 4 V’s approach; the first “V” is Volume, from which the Big Data comes from. This is the data which is difficult to handle in conventional data analytics. To analyse such data, it is the Big data that comes to aid of data analytics, the second “V” is velocity, the high speed at which the data is created, processed and analysed, the third “V” is variety which helps to analyse the data like Facebook data which contains all types of variety, like text messages, attachments, images, photos and so on, the fourth “V” is Veracity, that is cleanliness and accuracy of the data with the available huge amount of data which is being used for processing.

 

Nowadays in the telecommunication industry, the companies that make use of data analytic tools gain a competitive advantage in modern business environment. Therefore, these companies need to harness different kinds of data from multiple sources, with different velocities and disparate sources.

 

According to a researcher in 2006, companies are now extending their research and development (R&D) mechanisms to access external sources of knowledge. The different structures and formats of data makes the telecommunications companies struggle to understand the information in the context of gaining the much needed competitive advantage.

 

In 2016 a researcher asserts the key ingredient for value creation in the telecommunication industry with the rapidly evolving business environment is the capability of make use of data analytical technologies to obtain information for decision making and innovation purposes.

 

Researchers working in the structured data face many challenges in analysing the data. For instance, the data created through social media, in blogs, in Facebook posts or Snapchat. These types of data have different structures and formats and are more difficult to store in a traditional business data base. The data in big data comes in all shapes and formats including structured. Working with big data means handling a variety of data formats and structures.

 

Big data can be a data created from sensors which track the movement of objects or changes in the environment such as temperature fluctuations or astronomy data. In the world of the internet of things, where devices are connected and these wearable create huge volume of data. Thus big data approaches are used to manage and analyse this kind of data.

 

Big Data include data from a whole range of fields such as flight data, population data, financial and health data such data brings as to another V, value which has been proposed by a number of researcher i.e., Veracity.

 

Most of the time social media is analysed by advertisers and used to promote produces and events but big data has many other uses. It can also be used to assess risk in the insurance industry and to track reaction to products in real time. Big Data is also used to monitor things as diverse as wave movements, flight data, traffic data, financial transactions, health and crime.

 

The challenge of Big Data is how to use it to create something that is value to the user. How to gather it, store it, process it and analyse it to turn the raw data information to support decision making.

 

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, cancelled, and diverted flights appear in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.

 

Apache Hive and Pig are used to perform Aviation Data Analysis for gaining some insights on the U.S. Airline data using. 

TECHNOLOGIES USED

CHAPTER 3

 

TECHNOLOGIES USED

 

 

3.1 Hadoop

 

Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

 

Hadoop makes it possible to run applications on systems with thousands of commodity hardware nodes, and to handle thousands of terabytes of data. Its distributed filesystem facilitates rapid data transfer rates among nodes and allows the system to continue operating in case of a node failure.

 

This approach lowers the risk of catastrophic system failure and unexpected data loss, even if a significant number of nodes become inoperative. Consequently, Hadoop quickly emerged as a foundation for big data processing tasks, such as scientific analytics, business and sales planning, and processing enormous volumes of sensor data, including from internet of things sensors.

 

Hadoop was created by computer scientists Doug Cutting and Mike Cafarella in 2006 to support distribution for the Nutch search engine. It was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts.

 

Any of these parts, which are also called fragments or blocks, can be run on any node in the cluster. After years of development within the open source community, Hadoop 1.0 became publicly available in November 2012 as part of the Apache project sponsored by the Apache Software Foundation.

 

 

Figure 3.1- Hadoop Distributed File System (HDFS) Architecture

 

3.2 MapReduce

 

MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 

Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.

 

3.3 Hive

 

Apache Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hadoop is a framework for handling large datasets in a distributed computing environment.

 

Hive has three main functions: data summarization, query and analysis. It supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop. In addition, HiveQL supports custom MapReduce scripts to be plugged into queries.

 

 

3.4 Pig

 

Apache Pig is a platform for analysing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

 

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:

 

Ease of programming- It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.

 2. Optimization opportunities- The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.

 3. Extensibility- Users can create their own functions to do special-purpose processing.

 

3.5 Sqoop

 

Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

 

3.6 Tableau

 

Tableau is a data visualization tool which helps to create beautiful and visually appealing interactive visualizations like graphs, charts, reports and dashboard on our existing data.

Tableau provides an easy to use, drag and drop interface which takes only seconds or minutes rather than months or years to create interactive visualizations on the data. 

We can connect to a wide variety of data sources, including files, SQL databases, web data and cube (multidimensional) databases using Tableau. Tableau is also designed to support and visualize data which are present in Hadoop platforms (Hadoop Cloudera Hive and Hadoop MapR Hive). 

REQUIREMENT SPECIFICATION AND DATASET

CHAPTER 4

 

REQUIREMENT SPECIFICATION AND DATASET

 

 

Hardware Specification:

 

Processor: 2.0 GHz.

 

RAM: 4 GB

 

Free Space required on hard disk: 5 GB

 

 

Software Specification:

 

CDH (Cloudera Distribution Including Apache Hadoop)

 

Cloudera Manager 5

 

Oracle VM VirtualBox

 

 

4.1 Required Dataset

 

In this use case we are using 3 data sets; Final_airlines, routes.dat, airports_mod.dat.

 

Air Ports Data Set:

 

It contains the following fields as shown in Table 4.1:

 

Table 4.1 Air Port Data Set Representation

 

Airport ID

Unique Open Flights identifier for this airport.

Name

Name of airport. May or may not contain the City name.

City

Main city served by airport. May be spelled differently from Name.

Country

Country or territory where airport is located.

IATA/FAA

3-letter FAA code, for airports located in Country "United States of America".

 

3-letter

IATA code, for all other airports. Blank if not assigned.

ICAO

4-letter ICAO code. Blank if not assigned.

Latitude

Decimal degrees. Negative is South, positive is North.

Longitude

Decimal degrees. Negative is West, positive is East.

Altitude

In feet.

TimeZone

Hours offset from UTC. Fractional hours are expressed as decimals

DST

Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown).

 

Air Lines Data Set: