Practical Data Analysis - Second Edition - Hector Cuesta - E-Book

Practical Data Analysis - Second Edition E-Book

Hector Cuesta

0,0
41,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

A practical guide to obtaining, transforming, exploring, and analyzing data using Python, MongoDB, and Apache Spark

About This Book

  • Learn to use various data analysis tools and algorithms to classify, cluster, visualize, simulate, and forecast your data
  • Apply Machine Learning algorithms to different kinds of data such as social networks, time series, and images
  • A hands-on guide to understanding the nature of data and how to turn it into insight

Who This Book Is For

This book is for developers who want to implement data analysis and data-driven algorithms in a practical way. It is also suitable for those without a background in data analysis or data processing. Basic knowledge of Python programming, statistics, and linear algebra is assumed.

What You Will Learn

  • Acquire, format, and visualize your data
  • Build an image-similarity search engine
  • Generate meaningful visualizations anyone can understand
  • Get started with analyzing social network graphs
  • Find out how to implement sentiment text analysis
  • Install data analysis tools such as Pandas, MongoDB, and Apache Spark
  • Get to grips with Apache Spark
  • Implement machine learning algorithms such as classification or forecasting

In Detail

Beyond buzzwords like Big Data or Data Science, there are a great opportunities to innovate in many businesses using data analysis to get data-driven products. Data analysis involves asking many questions about data in order to discover insights and generate value for a product or a service.

This book explains the basic data algorithms without the theoretical jargon, and you'll get hands-on turning data into insights using machine learning techniques. We will perform data-driven innovation processing for several types of data such as text, Images, social network graphs, documents, and time series, showing you how to implement large data processing with MongoDB and Apache Spark.

Style and approach

This is a hands-on guide to data analysis and data processing. The concrete examples are explained with simple code and accessible data.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 278

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Practical Data Analysis - Second Edition
Credits
About the Authors
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Getting Started
Computer science
Artificial intelligence
Machine learning
Statistics
Mathematics
Knowledge domain
Data, information, and knowledge
Inter-relationship between data, information, and knowledge
The nature of data
The data analysis process
The problem
Data preparation
Data exploration
Predictive modeling
Visualization of results
Quantitative versus qualitative data analysis
Importance of data visualization
What about big data?
Quantified self
Sensors and cameras
Social network analysis
Tools and toys for this book
Why Python?
Why mlpy?
Why D3.js?
Why MongoDB?
Summary
2. Preprocessing Data
Data sources
Open data
Text files
Excel files
SQL databases
NoSQL databases
Multimedia
Web scraping
Data scrubbing
Statistical methods
Text parsing
Data transformation
Data formats
Parsing a CSV file with the CSV module
Parsing CSV file using NumPy
JSON
Parsing JSON file using the JSON module
XML
Parsing XML in Python using the XML module
YAML
Data reduction methods
Filtering and sampling
Binned algorithm
Dimensionality reduction
Getting started with OpenRefine
Text facet
Clustering
Text filters
Numeric facets
Transforming data
Exporting data
Operation history
Summary
3. Getting to Grips with Visualization
What is visualization?
Working with web-based visualization
Exploring scientific visualization
Visualization in art
The visualization life cycle
Visualizing different types of data
HTML
DOM
CSS
JavaScript
SVG
Getting started with D3.js
Bar chart
Pie chart
Scatter plots
Single line chart
Multiple line chart
Interaction and animation
Data from social networks
An overview of visual analytics
Summary
4. Text Classification
Learning and classification
Bayesian classification
Naïve Bayes
E-mail subject line tester
The data
The algorithm
Classifier accuracy
Summary
5. Similarity-Based Image Retrieval
Image similarity search
Dynamic time warping
Processing the image dataset
Implementing DTW
Analyzing the results
Summary
6. Simulation of Stock Prices
Financial time series
Random Walk simulation
Monte Carlo methods
Generating random numbers
Implementation in D3js
Quantitative analyst
Summary
7. Predicting Gold Prices
Working with time series data
Components of a time series
Smoothing time series
Lineal regression
The data - historical gold prices
Nonlinear regressions
Kernel Ridge Regressions
Smoothing the gold prices time series
Predicting in the smoothed time series
Contrasting the predicted value
Summary
8. Working with Support Vector Machines
Understanding the multivariate dataset
Dimensionality reduction
Linear Discriminant Analysis (LDA)
Principal Component Analysis (PCA)
Getting started with SVM
Kernel functions
The double spiral problem
SVM implemented on mlpy
Summary
9. Modeling Infectious Diseases with Cellular Automata
Introduction to epidemiology
The epidemiology triangle
The epidemic models
The SIR model
Solving the ordinary differential equation for the SIR model with SciPy
The SIRS model
Modeling with Cellular Automaton
Cell, state, grid, neighborhood
Global stochastic contact model
Simulation of the SIRS model in CA with D3.js
Summary
10. Working with Social Graphs
Structure of a graph
Undirected graph
Directed graph
Social networks analysis
Acquiring the Facebook graph
Working with graphs using Gephi
Statistical analysis
Male to female ratio
Degree distribution
Histogram of a graph
Centrality
Transforming GDF to JSON
Graph visualization with D3.js
Summary
11. Working with Twitter Data
The anatomy of Twitter data
Tweet
Followers
Trending topics
Using OAuth to access Twitter API
Getting started with Twython
Simple search using Twython
Working with timelines
Working with followers
Working with places and trends
Working with user data
Streaming API
Summary
12. Data Processing and Aggregation with MongoDB
Getting started with MongoDB
Database
Collection
Document
Mongo shell
Insert/Update/Delete
Queries
Data preparation
Data transformation with OpenRefine
Inserting documents with PyMongo
Group
Aggregation framework
Pipelines
Expressions
Summary
13. Working with MapReduce
An overview of MapReduce
Programming model
Using MapReduce with MongoDB
Map function
Reduce function
Using mongo shell
Using Jupyter
Using PyMongo
Filtering the input collection
Grouping and aggregation
Counting the most common words in tweets
Summary
14. Online Data Analysis with Jupyter and Wakari
Getting started with Wakari
Creating an account in Wakari
Getting started with IPython notebook
Data visualization
Introduction to image processing with PIL
Opening an image
Working with an image histogram
Filtering
Operations
Transformations
Getting started with pandas
Working with Time Series
Working with multivariate datasets with DataFrame
Grouping, Aggregation, and Correlation
Sharing your Notebook
The data
Summary
15. Understanding Data Processing using Apache Spark
Platform for data processing
The Cloudera platform
Installing Cloudera VM
An introduction to the distributed file system
First steps with Hadoop Distributed File System - HDFS
File management with HUE - web interface
An introduction to Apache Spark
The Spark ecosystem
The Spark programming model
An introductory working example of Apache Startup
Summary

Practical Data Analysis - Second Edition

Practical Data Analysis - Second Edition

Copyright © 2016 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: October 2013

Second published: September 2016

Production reference: 1260916

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78528-971-2

www.packtpub.com

Credits

Authors

Hector Cuesta

Dr. Sampath Kumar

Copy Editor

Safis Editing

Reviewers

Chandana N. Athauda

Mark Kerzner

Project Coordinator

Ritika Manoj

Commissioning Editor

Amarabha Banarjee

Proofreader

Safis Editing

Acquisition Editor

Denim Pinto

Indexer

Tejal Daruwale Soni

Content Development Editor

Divij Kotian

Production Coordinator

Melwyn Dsa

Technical Editor

Rutuja Vaze

Cover Work

Melwyn Dsa

About the Authors

Hector Cuesta is founder and Chief Data Scientist at Dataxios, a machine intelligence research company. Holds a BA in Informatics and a M.Sc. in Computer Science. He provides consulting services for data-driven product design with experience in a variety of industries including financial services, retail, fintech, e-learning and Human Resources. He is an enthusiast of Robotics in his spare time.

You can follow him on Twitter at https://twitter.com/hmCuesta.

I would like to dedicate this book to my wife Yolanda, and to my wonderful children Damian and Isaac for all the joy they bring into my life. To my parents Elena and Miguel for their constant support and love. 

Dr. Sampath Kumar works as an assistant professor and head of Department of Applied Statistics at Telangana University. He has completed M.Sc., M.Phl., and Ph. D. in statistics. He has five years of teaching experience for PG course. He has more than four years of experience in the corporate sector. His expertise is in statistical data analysis using SPSS, SAS, R, Minitab, MATLAB, and so on. He is an advanced programmer in SAS and matlab software. He has teaching experience in different, applied and pure statistics subjects such as forecasting models, applied regression analysis, multivariate data analysis, operations research, and so on for M.Sc. students. He is currently supervising Ph.D. scholars.

About the Reviewers

Chandana N.  Athauda is currently employed at BAG (Brunei Accenture Group) Networks—Brunei and he serves as a technical consultant. He mainly focuses on Business Intelligence, Big Data and Data Visualization tools and technologies.

He has been working professionally in the IT industry for more than 15 years (Ex-Microsoft Most Valuable Professional (MVP) and Microsoft Ranger for TFS). His roles in the IT industry have spanned   the entire spectrum from programmer to technical consultant. Technology has always been a passion for him.

If you would like to talk to Chandana about this book, feel free to write to him at info @inzeek.net or by giving him a tweet @inzeek.

Mark Kerzner is a Big Data architect and trainer. Mark is a founder and principal at Elephant Scale, offering Big Data training and consulting. Mark has written HBase Design Patterns for Packt.

I would like to acknowledge my co-founder Sujee Maniyam and his colleague Tim Fox, as well as all the students and teachers. Last but not least, thanks to my multi-talented family.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Free access for Packt account holders

Get notified! Find out when new books are published by following @PacktEnterprise on Twitter or the Packt Enterprise Facebook page.

Preface

Practical Data Analysis provides a series of practical projects in order to turn data into insight. It covers a wide range of data analysis tools and algorithms for classification, clustering, visualization, simulation and forecasting. The goal of this book is to help you to understand your data to find patterns, trends, relationships and insight.

This book contains practical projects that take advantage of the MongoDB, D3.js, Python language and its ecosystem to present the concepts using code snippets and detailed descriptions.

What this book covers

Chapter 1, Getting Started, In this chapter, we discuss the principles of data analysis and the data analysis process.

Chapter 2, Preprocessing Data, explains how to scrub and prepare your data for the analysis, also introduces the use of OpenRefine which is a Data Cleansing tool.

Chapter 3, Getting to Grips with Visualization, shows how to visualize different kinds of data using D3.js which is a JavaScript Visualization Framework.

Chapter 4, Text Classification, introduces the binary classification using a Naïve Bayes Algorithm to classify spam.

Chapter 5, Similarity-Based Image Retrieval, presents a project to find the Similarity between images using a dynamic time warping approach.

Chapter 6, Simulation of Stock Prices, explains how to simulate a Stock Price using Random Walk algorithm, visualized with a D3.js animation.

Chapter 7, Predicting Gold Prices, introduces how Kernel Ridge Regression works, and how to use it to predict the gold price using time series.

Chapter 8, Working with Support Vector Machines, describes how to use Support Vector Machines as a classification method.

Chapter 9, Modeling Infectious Diseases with Cellular Automata, introduces the basic concepts of Computational Epidemiology simulation and explains how to implement a cellular automaton to simulate an epidemic outbreak using D3.js and JavaScript.

Chapter 10, Working with Social Graphs, explains how to obtain and visualize your social media graph from Facebook using Gephi.

Chapter 11, Working with Twitter Data, explains how to use the Twitter API to retrieve data from twitter. We also see how to improve the text classification to perform a sentiment analysis using the Naïve Bayes Algorithm implemented in the Natural Language Toolkit (NLTK).

Chapter 12, Data Processing and Aggregation with MongoDB, introduces the basic operations in MongoDB as well as methods for grouping, filtering, and aggregation.

Chapter 13, Working with MapReduce, illustrates how to use the MapReduce programming model implemented in MongoDB.

Chapter 14, Online Data Analysis with Jupyter and Wakari, explains how to use the Wakari platform and introduces the basic use of Pandas and PIL with IPython.

Chapter 15, Understanding Data Processing using Apache Spark, explains how to use distributed file system along with Cloudera VM and how to get started with a data environment. Finally, we describe the main features of Apache Spark with a practical example.

What you need for this book

The basic requirements for this book are as follows:

PythonOpenRefineD3.jsMlpyNatural Language Toolkit (NLTK)GephiMongoDB

Who this book is for

This book is for Software Developers, Analyst and Computer Scientists who want to implement data analysis and visualization in a practical way. The book is also intended to provide a self-contained set of practical projects in order to get insight from different kinds of data like, time series, numerical, multidimensional, social media graphs and texts.

You are not required to have previous knowledge about data analysis, but some basic knowledge about statistics and a general understanding of Python programming is assumed.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "For this example, we will use the BeautifulSoup library version 4."

A block of code is set as follows:

from bs4 import BeautifulSoup import urllib.request from time import sleep from datetime import datetime

Any command-line input or output is written as follows:

>>> [email protected] >>> readers >>> packt.com

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Now, just click on the OK button to apply the transformation."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Data-Analysis-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/B04227_PracticalDataAnalysisSecondEdition_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Chapter 1. Getting Started

Data analysis is the process in which raw data is ordered and organized to be used in methods that help to evaluate and explain the past and predict the future. Data analysis is not about the numbers, it is about making/asking questions, developing explanations, and testing hypotheses based on logical and analytical methods. Data analysis is a multidisciplinary field that combines computer science, artificial intelligence, machine learning, statistics, mathematics, and business domain, as shown in the following figure:

All of these skills are important for gaining a good understanding of the problem and its optimal solutions, so let's define those fields.

Computer science

Computer science creates the tools for data analysis. The vast amount of data generated has made computational analysis critical and has increased the demand for skills like programming, database administration, network administration, and high-performance computing. Some programming experience in Python (or any high-level programming language) is needed to follow the chapters in this book.

Artificial intelligence

According to Stuart Russell and Peter Norvig:

"Artificial intelligence has to do with smart programs, so let's get on and write some".

In other words, Artificial intelligence (AI) studies the algorithms that can simulate an intelligent behavior. In data analysis we use AI to perform those activities that require intelligence, like inference, similarity search, or unsupervised classification. Fields like deep learning rely on artificial intelligence algorithms; some of its current uses are chatbots, recommendation engines, image classification, and so on.

Machine learning

Machine learning (ML) is the study of computer algorithms to learn how to react in a certain situation or recognize patterns. According to Arthur Samuel (1959):

"Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed".

ML has a large amount of algorithms generally split into three groups depending how the algorithms are training. They are as follows:

Supervised learningUnsupervised learningReinforcement learning

The relevant number of algorithms is used throughout the book and they are combined with practical examples, leading the reader through the process from the initial data problem to its programming solution.

Statistics

In January 2009, Google's Chief Economist Hal Varian said:

"I keep saying the sexy job in the next ten years will be statisticians. People think I'm joking, but who would've guessed that computer engineers would've been the sexy job of the 1990s?"

Statistics is the development and application of methods to collect, analyze, and interpret data. Data analysis encompasses a variety of statistical techniques such as simulation, Bayesian methods, forecasting, regression, time-series analysis, and clustering.

Mathematics

Data analysis makes use of a lot of mathematical techniques like linear algebra (vector and matrix, factorization, eigenvalue), numerical methods, and conditional probability, in algorithms. In this book, all the chapters are self-contained and include the necessary math involved.

Knowledge domain

One of the most important activities in data analysis is asking questions, and a good understanding of the knowledge domain can give you the expertise and intuition needed to ask good questions. Data analysis is used in almost every domain, including finance, administration, business, social media, government, and science.

Data, information, and knowledge

Data is facts of the world. Data represents a fact or statement of an event without relation to other things. Data comes in many forms, such as web pages, sensors, devices, audio, video, networks, log files, social media, transactional applications, and much more. Most of these data are generated in real time and on a very large-scale. Although it is generally alphanumeric (text, numbers, and symbols), it can consist of images or sound. Data consists of raw facts and figures. It does not have any meaning until it is processed. For example, financial transactions, age, temperature, and the number of steps from my house to my office are simply numbers. The information appears when we work with those numbers and we can find value and meaning.

Information can be considered as an aggregation of data. Information has usually got some meaning and purpose. The information can help us to make decisions easier. After processing the data, we can get the information within a context in order to give proper meaning. In computer jargon, a relational database makes information from the data stored within it.

Knowledge is information with meaning. Knowledge happens only when human experience and insight is applied to data and information. We can talk about knowledge when the data and the information turn into a set of rules to assist the decisions. In fact, we can't store knowledge because it implies the theoretical or practical understanding of a subject. The ultimate purpose of knowledge is for value creation.

Inter-relationship between data, information, and knowledge

We can observe that the relationship between data, information, and knowledge looks like cyclical behavior. The following diagram demonstrates the relationship between them. This diagram also explains the transformation of data into information and vice versa, similarly information and knowledge. If we apply valuable information based on context and purpose, it reflects knowledge. At the same time, the processed and analyzed data will give the information. When looking at the transformation of data to information and information to knowledge, we should concentrate on the context, purpose, and relevance of the task.

Now I would like to discuss these relationships with a real-life example:

Our students conducted a survey for their project with the purpose of collecting data related to customer satisfaction of a product and to see the conclusion of reducing the price of that product. As it was a real project, our students got to make the final decision to satisfy the customers. Data collected by the survey was processed and a final report was prepared. Based on the project report, the manufacturer of that product has since reduced the cost. Let's take a look at the following:

Data: Facts from the survey.
For example: Number of customers purchased the product, satisfaction levels, competitor information, and so on.
Information: Project report.
For example: Satisfaction level related to price based on the competitor product.
Knowledge: The manufacturer learned what to do for customer satisfaction and increase product sales.
For example: The manufacturing cost of the product, transportation cost, quality of the product, and so on.

Finally, we can say that the data-information-knowledge hierarchy seemed like a great idea. However, by using predictive analytics we can simulate an intelligent behavior and provide a good approximation. In the following image is an example of how to turn data into knowledge:

The nature of data

Data is the plural of datum, so is always treated as plural. We can find data in all situations of the world around us, in all the structured or unstructured, in continuous or discrete conditions, in weather records, stock market logs, in photo albums, music playlists, or in our Twitter account. In fact, data can be seen as the essential raw material to any kind of human activity. According to the Oxford English Dictionary, data are

"known facts or things used as basis for inference or reckoning".

As it is shown in the following image, we can see data in two distinct ways, Categorical and Numerical:

Categorical data are values or observations that can be sorted into groups or categories. There are two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, housing is a categorical variable with two categories (own and rent). An ordinal variable has an established ordering. For example, age as a variable with three orderly categories (young, adult, and elder).

Numerical data are values or observations that can be measured. There are two kinds of numerical values, discrete and continuous. Discrete data are values or observations can be counted and are distinct and separate, for example, the number of lines in a code. Continuous data are values or observations that may take on any value within a finite or infinite interval, for example, an economic time series like historic gold prices.

The kinds of datasets used in this book are the following:

E-mails (unstructured, discrete)Digital images (unstructured, discrete)Stock market logs (structured, continuous)Historic gold prices (structured, continuous)Credit approval records (structured, discrete)Social media friends relationships (unstructured, discrete)Tweets and treading topics (unstructured, continuous)Sales records (structured, continuous)

For each of the projects in this book we try to use a different kind of data. This book is trying to give the reader the ability to address different kinds of data problems.

The data analysis process

When you have a good understanding of a phenomenon it is possible to make predictions about it. Data analysis helps us to make this possible through exploring the past and creating predictive models.

The data analysis process is composed of following steps:

The statement of problemCollecting your dataCleaning the dataNormalizing the dataTransforming the dataExploratory statisticsExploratory visualizationPredictive modelingValidating your modelVisualizing and interpreting your resultsDeploying your solution

All of these activities can be grouped as is shown in the following image:

The problem

The problem definition starts with high-level business domain questions, such as how to track differences in behavior between groups of customers or knowing what the gold price will be in the next month. Understanding the objectives and requirements from a domain perspective is the key for a successful data analysis project.

Types of data analysis questions include:

InferentialPredictiveDescriptiveExploratoryCausalCorrelational

Data preparation

Data preparation is about how to obtain, clean, normalize, and transform the data into an optimal dataset, trying to avoid any possible data quality issues such as invalid, ambiguous, out-of-range, or missing values. This process can take up a lot of time. In Chapter 11, Working with Twitter Data, we will go into more detail about working with data, using OpenRefine to address complicated tasks. Analyzing data that has not been carefully prepared can lead you to highly misleading results.

The characteristics of good data are as follows:

CompleteCoherentAmbiguity eliminationCountableCorrectStandardizedRedundancy elimination

Data exploration

Data exploration is essentially looking at the processed data in a graphical or statistical form and trying to find patterns, connections, and relations in the data. Visualization is used to provide overviews in which meaningful patterns may be found. In Chapter 3, Getting to Grips with Visualization, we will present a JavaScript visualization framework (D3.js) and implement some examples of how to use visualization as a data exploration tool.

Predictive modeling

From the galaxy of information we have to extract usable hidden patterns and trends using relevant algorithms. To extract the future behavior of these hidden patterns, we can use predictive modeling. Predictive modeling is a statistical technique to predict future behavior by analyzing existing information, that is, historical data. We have to use proper statistical models that best forecast the hidden patterns of the data or information.

Predictive modeling is a process used in data analysis to create or choose a statistical model to try to best predict the probability of an outcome. Using predictive modeling, we can assess the future behavior of the customer. For this, we require past performance data of that customer. For example, in the retail sector, predictive analysis can play an important role in getting better profitability. Retailers can store galaxies of historical data. After developing different predicting models using this data, we can forecast to improve promotional planning, optimize sales channels, optimize store areas, and enhance demand planning.

Initially, building predictive models requires expertise views. After building relevant predicting models, we can use them automatically for forecasts. Predicting models give better forecasts when we concentrate on a careful combination of predictors. In fact, if the data size increases, we get more precise prediction results.

In this book we will use a variety of those models, and we can group them into three categories based on their outcomes:

Model

Chapter

Algorithm

Categorical outcome

(Classification)

4

Naïve Bayes Classifier

11

Natural Language Toolkit and Naïve Bayes Classifier

Numerical outcome

(Regression)

6

Random walk

8

Support vector machines

8

Distance-based approach and k-nearest neighbor

9

Cellular automata

Descriptive modeling

(Clustering)

5

Fast Dynamic Time Warping (FDTW) + distance metrics

10

Force layout and Fruchterman-Reingold layout

Another important task we need to accomplish in this step is finishing the evaluating model we chose as optimal for the particular problem.

Model assumptions are important for the quality of the predictions model. Better predictions will result from a model that satisfies its underlying assumptions. However, assumptions can never be fully met in empirical data, and evaluation preferably focuses on the validity of the predictions. The strength of the evidence for validity is usually considered to be stronger.

The no free lunch theorem proposed by Wolpert in 1996 said:

"No Free Lunch theorems have shown that learning algorithms cannot be universally good".

But extracting valuable information from the data means the predictive model should be accurate. There are many different tests to determine if the predictive models we create are accurate, meaningful representations that will prove valuable information.

The model evaluation helps us to ensure that our analysis is not overoptimistic or over fitted. In this book we are going to present two different ways of validating the model:

Cross-validation: Here, we divide the data into subsets of equal size and test the predictive model in order to estimate how it is going to perform in practice. We will implement cross-validation in order to validate the robustness of our model as well as evaluate multiple models to identify the best model based on their performance.Hold-out: Here, a large dataset is arbitrarily divided into three subsets: training set, validation set, and test set.

Visualization of results

This is the final step in our analysis process. When we present model output results, visualization tools can play an important role. The visualization results are an important piece of our technological architecture. As the database is the core of our architecture, various technologies and methods for the visualization of data can be employed.

In an explanatory data analysis process, simple visualization techniques are very useful for discovering patterns, since the human eye plays an important role. Sometimes, we have to generate a three-dimensional plot for finding the visual pattern. But, for getting better visual patterns, we can also use a scatter plot matrix, instead of a three-dimensional plot. In practice, the hypothesis of the study, dimensionality of the feature space, and data all play important roles in ensuring a good visualization technique.

In this book, we will focus in the univariate and multivariate graphical models. Using a variety of visualization tools like bar charts, pie charts, scatterplots, line charts, and multiple line charts, all implemented in D3.js; we will also learn how to use standalone plotting in Python with Matplotlib.