Practical Data Analysis - Hector Cuesta - E-Book

Practical Data Analysis E-Book

Hector Cuesta

0,0
39,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Plenty of small businesses face big amounts of data but lack the internal skills to support quantitative analysis. Understanding how to harness the power of data analysis using the latest open source technology can lead them to providing better customer service, the visualization of customer needs, or even the ability to obtain fresh insights about the performance of previous products. Practical Data Analysis is a book ideal for home and small business users who want to slice and dice the data they have on hand with minimum hassle.Practical Data Analysis is a hands-on guide to understanding the nature of your data and turn it into insight. It will introduce you to the use of machine learning techniques, social networks analytics, and econometrics to help your clients get insights about the pool of data they have at hand. Performing data preparation and processing over several kinds of data such as text, images, graphs, documents, and time series will also be covered.Practical Data Analysis presents a detailed exploration of the current work in data analysis through self-contained projects. First you will explore the basics of data preparation and transformation through OpenRefine. Then you will get started with exploratory data analysis using the D3js visualization framework. You will also be introduced to some of the machine learning techniques such as, classification, regression, and clusterization through practical projects such as spam classification, predicting gold prices, and finding clusters in your Facebook friends' network. You will learn how to solve problems in text classification, simulation, time series forecast, social media, and MapReduce through detailed projects. Finally you will work with large amounts of Twitter data using MapReduce to perform a sentiment analysis implemented in Python and MongoDB. Practical Data Analysis contains a combination of carefully selected algorithms and data scrubbing that enables you to turn your data into insight.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 294

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Practical Data Analysis
Credits
Foreword
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Getting Started
Computer science
Artificial intelligence (AI)
Machine Learning (ML)
Statistics
Mathematics
Knowledge domain
Data, information, and knowledge
The nature of data
The data analysis process
The problem
Data preparation
Data exploration
Predictive modeling
Visualization of results
Quantitative versus qualitative data analysis
Importance of data visualization
What about big data?
Sensors and cameras
Social networks analysis
Tools and toys for this book
Why Python?
Why mlpy?
Why D3.js?
Why MongoDB?
Summary
2. Working with Data
Datasource
Open data
Text files
Excel files
SQL databases
NoSQL databases
Multimedia
Web scraping
Data scrubbing
Statistical methods
Text parsing
Data transformation
Data formats
CSV
Parsing a CSV file with the csv module
Parsing a CSV file using NumPy
JSON
Parsing a JSON file using json module
XML
Parsing an XML file in Python using xml module
YAML
Getting started with OpenRefine
Text facet
Clustering
Text filters
Numeric facets
Transforming data
Exporting data
Operation history
Summary
3. Data Visualization
Data-Driven Documents (D3)
HTML
DOM
CSS
JavaScript
SVG
Getting started with D3.js
Bar chart
Pie chart
Scatter plot
Single line chart
Multi-line chart
Interaction and animation
Summary
4. Text Classification
Learning and classification
Bayesian classification
Naïve Bayes algorithm
E-mail subject line tester
The algorithm
Classifier accuracy
Summary
5. Similarity-based Image Retrieval
Image similarity search
Dynamic time warping (DTW)
Processing the image dataset
Implementing DTW
Analyzing the results
Summary
6. Simulation of Stock Prices
Financial time series
Random walk simulation
Monte Carlo methods
Generating random numbers
Implementation in D3.js
Summary
7. Predicting Gold Prices
Working with the time series data
Components of a time series
Smoothing the time series
The data – historical gold prices
Nonlinear regression
Kernel ridge regression
Smoothing the gold prices time series
Predicting in the smoothed time series
Contrasting the predicted value
Summary
8. Working with Support Vector Machines
Understanding the multivariate dataset
Dimensionality reduction
Linear Discriminant Analysis
Principal Component Analysis
Getting started with support vector machine
Kernel functions
Double spiral problem
SVM implemented on mlpy
Summary
9. Modeling Infectious Disease with Cellular Automata
Introduction to epidemiology
The epidemiology triangle
The epidemic models
The SIR model
Solving ordinary differential equation for the SIR model with SciPy
The SIRS model
Modeling with cellular automata
Cell, state, grid, and neighborhood
Global stochastic contact model
Simulation of the SIRS model in CA with D3.js
Summary
10. Working with Social Graphs
Structure of a graph
Undirected graph
Directed graph
Social Networks Analysis
Acquiring my Facebook graph
Using Netvizz
Representing graphs with Gephi
Statistical analysis
Male to female ratio
Degree distribution
Histogram of a graph
Centrality
Transforming GDF to JSON
Graph visualization with D3.js
Summary
11. Sentiment Analysis of Twitter Data
The anatomy of Twitter data
Tweet
Followers
Trending topics
Using OAuth to access Twitter API
Getting started with Twython
Simple search
Working with timelines
Working with followers
Working with places and trends
Sentiment classification
Affective Norms for English Words
Text corpus
Getting started with Natural Language Toolkit (NLTK)
Bag of words
Naive Bayes
Sentiment analysis of tweets
Summary
12. Data Processing and Aggregation with MongoDB
Getting started with MongoDB
Database
Collection
Document
Mongo shell
Insert/Update/Delete
Queries
Data preparation
Data transformation with OpenRefine
Inserting documents with PyMongo
Group
The aggregation framework
Pipelines
Expressions
Summary
13. Working with MapReduce
MapReduce overview
Programming model
Using MapReduce with MongoDB
The map function
The reduce function
Using mongo shell
Using UMongo
Using PyMongo
Filtering the input collection
Grouping and aggregation
Word cloud visualization of the most common positive words in tweets
Summary
14. Online Data Analysis with IPython and Wakari
Getting started with Wakari
Creating an account in Wakari
Getting started with IPython Notebook
Data visualization
Introduction to image processing with PIL
Opening an image
Image histogram
Filtering
Operations
Transformations
Getting started with Pandas
Working with time series
Working with multivariate dataset with DataFrame
Grouping, aggregation, and correlation
Multiprocessing with IPython
Pool
Sharing your Notebook
The data
Summary
A. Setting Up the Infrastructure
Installing and running Python 3
Installing and running Python 3.2 on Ubuntu
Installing and running IDLE on Ubuntu
Installing and running Python 3.2 on Windows
Installing and running IDLE on Windows
Installing and running NumPy
Installing and running NumPy on Ubuntu
Installing and running NumPy on Windows
Installing and running SciPy
Installing and running SciPy on Ubuntu
Installing and running SciPy on Windows
Installing and running mlpy
Installing and running mlpy on Ubuntu
Installing and running mlpy on Windows
Installing and running OpenRefine
Installing and running OpenRefine on Linux
Installing and running OpenRefine on Windows
Installing and running MongoDB
Installing and running MongoDB on Ubuntu
Installing and running MongoDB on Windows
Connecting Python with MongoDB
Installing and running UMongo
Installing and running Umongo on Ubuntu
Installing and running Umongo on Windows
Installing and running Gephi
Installing and running Gephi on Linux
Installing and running Gephi on Windows
Index

Practical Data Analysis

Practical Data Analysis

Copyright © 2013 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: October 2013

Production Reference: 1151013

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78328-099-5

www.packtpub.com

Cover Image by Hector Cuesta (<[email protected]>)

Credits

Author

Hector Cuesta

Reviewers

Dr. Sampath Kumar Kanthala

Mark Kerzner

Ricky J. Sethi, PhD

Dr. Suchita Tripathi

Dr. Jarrell Waggoner

Acquisition Editors

Edward Gordon

Erol Staveley

Lead Technical Editor

Neeshma Ramakrishnan

Technical Editors

Pragnesh Bilimoria

Arwa Manasawala

Manal Pednekar

Project Coordinator

Anugya Khurana

Proofreaders

Jenny Blake

Bridget Braund

Indexer

Hemangini Bari

Graphics

Rounak Dhruv

Abhinash Sahu

Sheetal Aute

Production Coordinator

Arvindkumar Gupta

Cover Work

Arvindkumar Gupta

Foreword

The phrase: From Data to Information, and from Information to Knowledge, has become a cliché but it has never been as fitting as today. With the emergence of Big Data and the need to make sense of the massive amounts of disparate collection of individual datasets, there is a requirement for practitioners of data-driven domains to employ a rich set of analytic methods. Whether during data preparation and cleaning, or data exploration, the use of computational tools has become imperative. However, the complexity of underlying theories represent a challenge for users who wish to apply these methods to exploit the potentially rich contents of available data in their domain. In some domains, text-based data may hold the secret of running a successful business. For others, the analysis of social networks and the classification of sentiments may reveal new strategies for the dissemination of information or the formulation of policy.

My own research and that of my students falls in the domain of computational epidemiology. Designing and implementing tools that facilitate the study of the progression of diseases in a large population is the main focus in this domain. Complex simulation models are expected to predict, or at least suggest, the most likely trajectory of an epidemic. The development of such models depends on the availability or data from which population and disease specific parameters can be extracted. Whether census data, which holds information about the makeup of the population, of medical texts, which describe the progression of disease in individuals, the data exploration represents a challenging task. As many areas that employ data analytics, computational epidemiology is intrinsically multi-disciplinary. While the analysis of some data sources may reveal the number of eggs deposited by a mosquito, other sources may indicate the rate at which mosquitoes are likely to interact with the human population to cause a Dengue and West-Nile Virus epidemic. To convert information to knowledge, computational scientists, biologists, biostatisticians, and public health practitioners must collaborate. It is the availability of sophisticated visualization tools that allows these diverse groups of scientists and practitioners to explore the data and share their insight.

I first met Hector Cuesta during the Fall Semester of 2011, when he joined my Computational Epidemiology Research Laboratory as a visiting scientist. I soon realized that Hector is not just an outstanding programmer, but also a practitioner who can readily apply computational paradigms to problems from different contexts. His expertise in a multitude of computational languages and tools, including Python, CUDA, Hadoop, SQL, and MPI allows him to construct solutions to complex problems from different domains. In this book, Hector Cuesta is demonstrating the application of a variety of data analysis tools on a diverse set of problem domains. Different types of datasets are used to motivate and explore the use of powerful computational methods that are readily applicable to other problem domains. This book serves both as a reference and as tutorial for practitioners to conduct data analysis and move From Data to Information, and from Information to Knowledge.

Armin R. Mikler

Professor of Computer Science and Engineering

Director of the Center for Computational Epidemiology and Response Analysis

University of North Texas

About the Author

Hector Cuesta holds a B.A in Informatics and M.Sc. in Computer Science. He provides consulting services for software engineering and data analysis with experience in a variety of industries including financial services, social networking, e-learning, and human resources.

He is a lecturer in the Department of Computer Science at the Autonomous University of Mexico State (UAEM). His main research interests lie in computational epidemiology, machine learning, computer vision, high-performance computing, big data, simulation, and data visualization.

He helped in the technical review of the books, Raspberry Pi Networking Cookbook by Rick Golden and Hadoop Operations and Cluster Management Cookbook by Shumin Guo for Packt Publishing. He is also a columnist at Software Guru magazine and he has published several scientific papers in international journals and conferences. He is an enthusiast of Lego Robotics and Raspberry Pi in his spare time.

You can follow him on Twitter at https://twitter.com/hmCuesta.

Acknowledgments

I would like to dedicate this book to my wife Yolanda, my wonderful children Damian and Isaac for all the joy they bring into my life, and to my parents Elena and Miguel for their constant support and love.

I would like to thank my great team at Packt Publishing, particular thanks goes to, Anurag Banerjee, Erol Staveley, Edward Gordon, Anugya Khurana, Neeshma Ramakrishnan, Arwa Manasawala, Manal Pednekar, Pragnesh Bilimoria, and Unnati Shah.

Thanks to my friends, Abel Valle, Oscar Manso, Ivan Cervantes, Agustin Ramos, Dr. Rene Cruz, Dr. Adrian Trueba, and Sergio Ruiz for their helpful suggestions and improvements to my drafts. I would also like to thank the technical reviewers for taking the time to send detailed feedback for the drafts.

I would also like to thank Dr. Armin Mikler for his encouragement and for agreeing to write the foreword of this book. Finally, as an important source of inspiration I would like to mention my mentor and former advisor Dr. Jesus Figueroa-Nazuno.

About the Reviewers

Mark Kerzner holds degrees in Law, Math, and Computer Science. He has been designing software for many years, and Hadoop-based systems since 2008. He is the President of SHMsoft, a provider of Hadoop applications for various verticals, and a co-author of the Hadoop Illuminated book/project. He has authored and co-authored books and patents.

I would like to acknowledge the help of my colleagues, in particular Sujee Maniyam, and last but not least I would acknowledge the help of my multi-talented family.

Dr. Sampath Kumar works as an assistant professor and head of the Department of Applied Statistics at Telangana University. He has completed M.Sc, M.Phl, and Ph.D. in Statistics. He has five years of teaching experience for PG course. He has more than four years of experience in the corporate sector. His expertise is in statistical data analysis using SPSS, SAS, R, Minitab, MATLAB, and so on. He is an advanced programmer in SAS and matlab software. He has teaching experience in different, applied and pure statistics subjects such as forecasting models, applied regression analysis, multivariate data analysis, operations research, and so on for M.Sc students. He is currently supervising Ph.D. scholars.

Ricky J. Sethi is currently the Director of Research for The Madsci Network and a research scientist at University of Massachusetts Medical Center and UMass Amherst. Dr. Sethi's research tends to be interdisciplinary in nature, relying on machine-learning methods and physics-based models to examine issues in computer vision, social computing, and science learning. He received his B.A. in Molecular and Cellular Biology (Neurobiology)/Physics from the University of California, Berkeley, M.S. in Physics/Business (Information Systems) from the University of Southern California, and Ph.D. in Computer Science (Artificial Intelligence/Computer Vision) from the University of California, Riverside. He has authored or co-authored over 30 peer-reviewed papers or book chapters and was also chosen as an NSF Computing Innovation Fellow at both UCLA and USC's Information Sciences Institute.

Dr. Suchita Tripathi did her Ph.D. and M.Sc. at Allahabad University in Anthropology. She also has skills in computer applications and SPSS data analysis software. She has language proficiency in Hindi, English, and Japanese. She learned primary and intermediate level Japanese language from ICAS Japanese language training school, Sendai, Japan and received various certificates. She is the author of six articles and one book. She had two years of teaching experience in the Department of Anthropology and Tribal Development, GGV Central University, Bilaspur (C.G.). Her major areas of research are Urban Anthropology, Anthropology of Disasters, Linguistic and Archeological Anthropology.

I would like to acknowledge my parents and my lovely family for their moral support, and well wishes.

Dr. Jarrell Waggoner is a software engineer at Groupon, working on internal tools to perform sales analytics and demand forecasting. He completed his Ph.D. in Computer Science and Engineering from the University of South Carolina and has worked on numerous projects in the areas of computer vision and image processing, including an NEH-funded document image processing project, a DARPA competition to build an event recognition system, and an interdisciplinary AFOSR-funded materials science image processing project. He is an ardent supporter of free software, having used a variety of open source languages, operating systems, and frameworks in his research. His open source projects and contributions, along with his research work, can be found on GitHub (https://github.com/malloc47) and on his website (http://www.malloc47.com).

www.PacktPub.com

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support files and downloads related to your book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by PacktCopy and paste, print and bookmark contentOn demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

Preface

Practical Data Analysis provides a series of practical projects in order to turn data into insight. It covers a wide range of data analysis tools and algorithms for classification, clustering, visualization, simulation, and forecasting. The goal of this book is to help you understand your data to find patterns, trends, relationships, and insight.

This book contains practical projects that take advantage of the MongoDB, D3.js, and Python language and its ecosystem to present the concepts using code snippets and detailed descriptions.

What this book covers

Chapter 1, Getting Started, discusses the principles of data analysis and the data analysis process.

Chapter 2, Working with Data, explains how to scrub and prepare your data for the analysis and also introduces the use of OpenRefine which is a data cleansing tool.

Chapter 3, Data Visualization, shows how to visualize different kinds of data using D3.js, which is a JavaScript Visualization Framework.

Chapter 4, Text Classification, introduces the binary classification using a Naïve Bayes algorithm to classify spam.

Chapter 5, Similarity-based Image Retrieval, presents a project to find the similarity between images using a dynamic time warping approach.

Chapter 6, Simulation of Stock Prices, explains how to simulate stock prices using Random Walk algorithm, visualized with a D3.js animation.

Chapter 7, Predicting Gold Prices, introduces how Kernel Ridge Regression works and how to use it to predict the gold price using time series.

Chapter 8, Working with Support Vector Machines, describes how to use support vector machines as a classification method.

Chapter 9, Modeling Infectious Disease with Cellular Automata, introduces the basic concepts of computational epidemiology simulation and explains how to implement a cellular automaton to simulate an epidemic outbreak using D3.js and JavaScript.

Chapter 10, Working with Social Graphs, explains how to obtain and visualize your social media graph from Facebook using Gephi.

Chapter 11, Sentiment Analysis of Twitter Data, explains how to use the Twitter API to retrieve data from Twitter. We also see how to improve the text classification to perform a sentiment analysis using the Naïve Bayes algorithm implemented in the Natural Language Toolkit (NLTK).

Chapter 12, Data Processing and Aggregation with MongoDB, introduces the basic operations in MongoDB as well as methods for grouping, filtering, and aggregation.

Chapter 13, Working with MapReduce, illustrates how to use the MapReduce programming model implemented in MongoDB.

Chapter 14, Online Data Analysis with IPython and Wakari, explains how to use the Wakari platform and introduces the basic use of Pandas and PIL with IPython.

Appendix, Setting Up the Infrastructure, provides detailed information on installation of the software tools used in this book.

What you need for this book

The basic requirements for this book are as follows:

PythonOpenRefineD3.jsmlpyNatural Language Toolkit (NLTK)GephiMongoDB

Who this book is for

This book is for software developers, analysts, and computer scientists who want to implement data analysis and visualization in a practical way. The book is also intended to provide a self-contained set of practical projects in order to get insight about different kinds of data such as, time series, numerical, multidimensional, social media graphs, and texts. You are not required to have previous knowledge about data analysis, but some basic knowledge about statistics and a general understanding of Python programming is essential.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the erratasubmissionform link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. Getting Started

Data analysis is the process in which raw data is ordered and organized, to be used in methods that help to explain the past and predict the future. Data analysis is not about the numbers, it is about making/asking questions, developing explanations, and testing hypotheses. Data Analysis is a multidisciplinary field, which combines Computer Science, Artificial Intelligence & Machine Learning, Statistics & Mathematics, and Knowledge Domain as shown in the following figure:

Computer science

Computer science creates the tools for data analysis. The vast amount of data generated has made computational analysis critical and has increased the demand for skills such as programming, database administration, network administration, and high-performance computing. Some programming experience in Python (or any high-level programming language) is needed to understand the chapters.

Artificial intelligence (AI)

According to Stuart Russell and Peter Norvig:

"[AI] has to do with smart programs, so let's get on and write some."

In other words, AI studies the algorithms that can simulate an intelligent behavior. In data analysis, we use AI to perform those activities that require intelligence such as inference, similarity search, or unsupervised classification.

Machine Learning (ML)

Machine learning is the study of computer algorithms to learn how to react in a certain situation or recognize patterns. According to Arthur Samuel (1959),

"Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed."

ML has a large amount of algorithms generally split in to three groups; given how the algorithm is training:

Supervised learningUnsupervised learningReinforcement learning

Relevant numbers of algorithms are used throughout the book and are combined with practical examples, leading the reader through the process from the data problem to its programming solution.

Statistics

In January 2009, Google's Chief Economist, Hal Varian said,

"I keep saying the sexy job in the next ten years will be statisticians. People think I'm joking, but who would've guessed that computer engineers would've been the sexy job of the 1990s?"

Statistics is the development and application of methods to collect, analyze, and interpret data.

Data analysis encompasses a variety of statistical techniques such as simulation, Bayesian methods, forecasting, regression, time-series analysis, and clustering.

Mathematics

Data analysis makes use of a lot of mathematical techniques such as linear algebra (vector and matrix, factorization, and eigenvalue), numerical methods, and conditional probability in the algorithms. In this book, all the chapters are self-contained and include the necessary math involved.

Knowledge domain

One of the most important activities in data analysis is asking questions, and a good understanding of the knowledge domain can give you the expertise and intuition needed to ask good questions. Data analysis is used in almost all the domains such as finance, administration, business, social media, government, and science.

Data, information, and knowledge

Data are facts of the world. For example, financial transactions, age, temperature, number of steps from my house to my office, are simply numbers. The information appears when we work with those numbers and we can find value and meaning. The information can help us to make informed decisions.

We can talk about knowledge when the data and the information turn into a set of rules to assist the decisions. In fact, we can't store knowledge because it implies theoretical or practical understanding of a subject. However, using predictive analytics, we can simulate an intelligent behavior and provide a good approximation. An example of how to turn data into knowledge is shown in the following figure:

The nature of data

Data is the plural of datum, so it is always treated as plural. We can find data in all the situations of the world around us, in all the structured or unstructured, in continuous or discrete conditions, in weather records, stock market logs, in photo albums, music playlists, or in our Twitter accounts. In fact, data can be seen as the essential raw material of any kind of human activity. According to the Oxford English Dictionary:

Data are known facts or things used as basis for inference or reckoning.

As shown in the following figure, we can see Data in two distinct ways: Categorical and Numerical:

Categorical data are values or observations that can be sorted into groups or categories. There are two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, housing is a categorical variable having two categories (own and rent). An ordinal variable has an established ordering. For example, age as a variable with three orderly categories (young, adult, and elder).

Numerical data are values or observations that can be measured. There are two kinds of numerical values, discrete and continuous. Discrete data are values or observations that can be counted and are distinct and separate. For example, number of lines in a code. Continuous data are values or observations that may take on any value within a finite or infinite interval. For example, an economic time series such as historic gold prices.

The kinds of datasets used in this book are as follows:

E-mails (unstructured, discrete)Digital images (unstructured, discrete)Stock market logs (structured, continuous)Historic gold prices (structured, continuous)Credit approval records (structured, discrete)Social media friends and relationships (unstructured, discrete)Tweets and trending topics (unstructured, continuous)Sales records (structured, continuous)

For each of the projects in this book, we try to use a different kind of data. This book is trying to give the reader the ability to address different kinds of data problems.

The data analysis process

When you have a good understanding of a phenomenon, it is possible to make predictions about it. Data analysis helps us to make this possible through exploring the past and creating predictive models.

The data analysis process is composed of the following steps:

The statement of problemObtain your dataClean the dataNormalize the dataTransform the dataExploratory statisticsExploratory visualizationPredictive modelingValidate your modelVisualize and interpret your resultsDeploy your solution

All these activities can be grouped as shown in the following figure:

The problem

The problem definition starts with high-level questions such as how to track differences in behavior between groups of customers, or what's going to be the gold price in the next month. Understanding the objectives and requirements from a domain perspective is the key to a successful data analysis project.

Types of data analysis questions are listed as follows:

InferentialPredictiveDescriptiveExploratoryCausalCorrelational

Data preparation

Data preparation is about how to obtain, clean, normalize, and transform the data into an optimal dataset, trying to avoid any possible data quality issues such as invalid, ambiguous, out-of-range, or missing values. This process can take a lot of your time. In Chapter 2, Working with Data, we go into more detail about working with data, using OpenRefine to address the complicated tasks. Analyzing data that has not been carefully prepared can lead you to highly misleading results.

The characteristics of good data are listed as follows:

CompleteCoherentUnambiguousCountableCorrectStandardizedNon-redundant

Data exploration

Data exploration is essentially looking at the data in a graphical or statistical form trying to find patterns, connections, and relations in the data. Visualization is used to provide overviews in which meaningful patterns may be found.

In Chapter 3, Data Visualization, we present a visualization framework (D3.js) and we implement some examples on how to use visualization as a data exploration tool.

Predictive modeling

Predictive modeling is a process used in data analysis to create or choose a statistical model trying to best predict the probability of an outcome. In this book, we use a variety of those models and we can group them in three categories based on its outcome:

 

Chapter

Algorithm

Categorical outcome (Classification)

4

Naïve Bayes Classifier

11

Natural Language Toolkit + Naïve Bayes Classifier

Numerical outcome (Regression)

6

Random Walk

8

Support Vector Machines

9

Cellular Automata

8

Distance Based Approach + k-nearest neighbor

Descriptive modeling (Clustering)

5

Fast Dynamic Time Warping (FDTW) + Distance Metrics

10

Force Layout and Fruchterman-Reingold layout

Another important task we need to accomplish in this step is evaluating the model we chose to be optimal for the particular problem.

The No Free Lunch Theorem proposed by Wolpert in 1996 stated:

"No Free Lunch theorems have shown that learning algorithms cannot be universally good."

The model evaluation helps us to ensure that our analysis is not over-optimistic or over-fitted. In this book, we are going to present two different ways to validate the model:

Cross-validation: We divide the data into subsets of equal size and test the predictive model in order to estimate how it is going to perform in practice. We will implement cross-validation in order to validate the robustness of our model as well as evaluate multiple models to identify the best model based on their performance.Hold-Out: Mostly, large dataset is randomly divided in to three subsets: training set, validation set, and test set.

Visualization of results

This is the final step in our analysis process and we need to answer the following questions:

How is it going to present the results?

For example, in tabular reports, 2D plots, dashboards, or infographics.

Where is it going to be deployed?

For example, in hard copy printed, poster, mobile devices, desktop interface, or web.

Each choice will depend on the kind of analysis and a particular data. In the following chapters, we will learn how to use standalone plotting in Python with matplotlib and web visualization with D3.js.

Quantitative versus qualitative data analysis

Quantitative and qualitative analysis can be defined as follows:

Quantitative data: It is numerical measurements expressed in terms of numbersQualitative data: It is categorical measurements expressed in terms of natural language descriptions

As shown in the following figure, we can observe the differences between quantitative and qualitative analysis:

Quantitative analytics involves analysis of numerical data. The type of the analysis will depend on the level of measurement. There are four kinds of measurements:

Nominal: Data has no logical order and is used as classification dataOrdinal: Data has a logical order and differences between values are not constantInterval: Data is continuous and depends on logical order. The data has standardized differences between values, but does not include zeroRatio: Data is continuous with logical order as well as regular interval differences between values and may include zero

Qualitative analysis can explore the complexity and meaning of social phenomena. Data for qualitative study may include written texts (for example, documents or email) and/or audible and visual data (for example, digital images or sounds). In Chapter 11, Sentiment Analysis of Twitter Data, we present a sentiment analysis from Twitter data as an example of qualitative analysis.

Importance of data visualization

The goal of the