39,59 €
Plenty of small businesses face big amounts of data but lack the internal skills to support quantitative analysis. Understanding how to harness the power of data analysis using the latest open source technology can lead them to providing better customer service, the visualization of customer needs, or even the ability to obtain fresh insights about the performance of previous products. Practical Data Analysis is a book ideal for home and small business users who want to slice and dice the data they have on hand with minimum hassle.Practical Data Analysis is a hands-on guide to understanding the nature of your data and turn it into insight. It will introduce you to the use of machine learning techniques, social networks analytics, and econometrics to help your clients get insights about the pool of data they have at hand. Performing data preparation and processing over several kinds of data such as text, images, graphs, documents, and time series will also be covered.Practical Data Analysis presents a detailed exploration of the current work in data analysis through self-contained projects. First you will explore the basics of data preparation and transformation through OpenRefine. Then you will get started with exploratory data analysis using the D3js visualization framework. You will also be introduced to some of the machine learning techniques such as, classification, regression, and clusterization through practical projects such as spam classification, predicting gold prices, and finding clusters in your Facebook friends' network. You will learn how to solve problems in text classification, simulation, time series forecast, social media, and MapReduce through detailed projects. Finally you will work with large amounts of Twitter data using MapReduce to perform a sentiment analysis implemented in Python and MongoDB. Practical Data Analysis contains a combination of carefully selected algorithms and data scrubbing that enables you to turn your data into insight.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 294
Veröffentlichungsjahr: 2013
Copyright © 2013 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2013
Production Reference: 1151013
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78328-099-5
www.packtpub.com
Cover Image by Hector Cuesta (<[email protected]>)
Author
Hector Cuesta
Reviewers
Dr. Sampath Kumar Kanthala
Mark Kerzner
Ricky J. Sethi, PhD
Dr. Suchita Tripathi
Dr. Jarrell Waggoner
Acquisition Editors
Edward Gordon
Erol Staveley
Lead Technical Editor
Neeshma Ramakrishnan
Technical Editors
Pragnesh Bilimoria
Arwa Manasawala
Manal Pednekar
Project Coordinator
Anugya Khurana
Proofreaders
Jenny Blake
Bridget Braund
Indexer
Hemangini Bari
Graphics
Rounak Dhruv
Abhinash Sahu
Sheetal Aute
Production Coordinator
Arvindkumar Gupta
Cover Work
Arvindkumar Gupta
The phrase: From Data to Information, and from Information to Knowledge, has become a cliché but it has never been as fitting as today. With the emergence of Big Data and the need to make sense of the massive amounts of disparate collection of individual datasets, there is a requirement for practitioners of data-driven domains to employ a rich set of analytic methods. Whether during data preparation and cleaning, or data exploration, the use of computational tools has become imperative. However, the complexity of underlying theories represent a challenge for users who wish to apply these methods to exploit the potentially rich contents of available data in their domain. In some domains, text-based data may hold the secret of running a successful business. For others, the analysis of social networks and the classification of sentiments may reveal new strategies for the dissemination of information or the formulation of policy.
My own research and that of my students falls in the domain of computational epidemiology. Designing and implementing tools that facilitate the study of the progression of diseases in a large population is the main focus in this domain. Complex simulation models are expected to predict, or at least suggest, the most likely trajectory of an epidemic. The development of such models depends on the availability or data from which population and disease specific parameters can be extracted. Whether census data, which holds information about the makeup of the population, of medical texts, which describe the progression of disease in individuals, the data exploration represents a challenging task. As many areas that employ data analytics, computational epidemiology is intrinsically multi-disciplinary. While the analysis of some data sources may reveal the number of eggs deposited by a mosquito, other sources may indicate the rate at which mosquitoes are likely to interact with the human population to cause a Dengue and West-Nile Virus epidemic. To convert information to knowledge, computational scientists, biologists, biostatisticians, and public health practitioners must collaborate. It is the availability of sophisticated visualization tools that allows these diverse groups of scientists and practitioners to explore the data and share their insight.
I first met Hector Cuesta during the Fall Semester of 2011, when he joined my Computational Epidemiology Research Laboratory as a visiting scientist. I soon realized that Hector is not just an outstanding programmer, but also a practitioner who can readily apply computational paradigms to problems from different contexts. His expertise in a multitude of computational languages and tools, including Python, CUDA, Hadoop, SQL, and MPI allows him to construct solutions to complex problems from different domains. In this book, Hector Cuesta is demonstrating the application of a variety of data analysis tools on a diverse set of problem domains. Different types of datasets are used to motivate and explore the use of powerful computational methods that are readily applicable to other problem domains. This book serves both as a reference and as tutorial for practitioners to conduct data analysis and move From Data to Information, and from Information to Knowledge.
Armin R. Mikler
Professor of Computer Science and Engineering
Director of the Center for Computational Epidemiology and Response Analysis
University of North Texas
Hector Cuesta holds a B.A in Informatics and M.Sc. in Computer Science. He provides consulting services for software engineering and data analysis with experience in a variety of industries including financial services, social networking, e-learning, and human resources.
He is a lecturer in the Department of Computer Science at the Autonomous University of Mexico State (UAEM). His main research interests lie in computational epidemiology, machine learning, computer vision, high-performance computing, big data, simulation, and data visualization.
He helped in the technical review of the books, Raspberry Pi Networking Cookbook by Rick Golden and Hadoop Operations and Cluster Management Cookbook by Shumin Guo for Packt Publishing. He is also a columnist at Software Guru magazine and he has published several scientific papers in international journals and conferences. He is an enthusiast of Lego Robotics and Raspberry Pi in his spare time.
You can follow him on Twitter at https://twitter.com/hmCuesta.
I would like to dedicate this book to my wife Yolanda, my wonderful children Damian and Isaac for all the joy they bring into my life, and to my parents Elena and Miguel for their constant support and love.
I would like to thank my great team at Packt Publishing, particular thanks goes to, Anurag Banerjee, Erol Staveley, Edward Gordon, Anugya Khurana, Neeshma Ramakrishnan, Arwa Manasawala, Manal Pednekar, Pragnesh Bilimoria, and Unnati Shah.
Thanks to my friends, Abel Valle, Oscar Manso, Ivan Cervantes, Agustin Ramos, Dr. Rene Cruz, Dr. Adrian Trueba, and Sergio Ruiz for their helpful suggestions and improvements to my drafts. I would also like to thank the technical reviewers for taking the time to send detailed feedback for the drafts.
I would also like to thank Dr. Armin Mikler for his encouragement and for agreeing to write the foreword of this book. Finally, as an important source of inspiration I would like to mention my mentor and former advisor Dr. Jesus Figueroa-Nazuno.
Mark Kerzner holds degrees in Law, Math, and Computer Science. He has been designing software for many years, and Hadoop-based systems since 2008. He is the President of SHMsoft, a provider of Hadoop applications for various verticals, and a co-author of the Hadoop Illuminated book/project. He has authored and co-authored books and patents.
I would like to acknowledge the help of my colleagues, in particular Sujee Maniyam, and last but not least I would acknowledge the help of my multi-talented family.
Dr. Sampath Kumar works as an assistant professor and head of the Department of Applied Statistics at Telangana University. He has completed M.Sc, M.Phl, and Ph.D. in Statistics. He has five years of teaching experience for PG course. He has more than four years of experience in the corporate sector. His expertise is in statistical data analysis using SPSS, SAS, R, Minitab, MATLAB, and so on. He is an advanced programmer in SAS and matlab software. He has teaching experience in different, applied and pure statistics subjects such as forecasting models, applied regression analysis, multivariate data analysis, operations research, and so on for M.Sc students. He is currently supervising Ph.D. scholars.
Ricky J. Sethi is currently the Director of Research for The Madsci Network and a research scientist at University of Massachusetts Medical Center and UMass Amherst. Dr. Sethi's research tends to be interdisciplinary in nature, relying on machine-learning methods and physics-based models to examine issues in computer vision, social computing, and science learning. He received his B.A. in Molecular and Cellular Biology (Neurobiology)/Physics from the University of California, Berkeley, M.S. in Physics/Business (Information Systems) from the University of Southern California, and Ph.D. in Computer Science (Artificial Intelligence/Computer Vision) from the University of California, Riverside. He has authored or co-authored over 30 peer-reviewed papers or book chapters and was also chosen as an NSF Computing Innovation Fellow at both UCLA and USC's Information Sciences Institute.
Dr. Suchita Tripathi did her Ph.D. and M.Sc. at Allahabad University in Anthropology. She also has skills in computer applications and SPSS data analysis software. She has language proficiency in Hindi, English, and Japanese. She learned primary and intermediate level Japanese language from ICAS Japanese language training school, Sendai, Japan and received various certificates. She is the author of six articles and one book. She had two years of teaching experience in the Department of Anthropology and Tribal Development, GGV Central University, Bilaspur (C.G.). Her major areas of research are Urban Anthropology, Anthropology of Disasters, Linguistic and Archeological Anthropology.
I would like to acknowledge my parents and my lovely family for their moral support, and well wishes.
Dr. Jarrell Waggoner is a software engineer at Groupon, working on internal tools to perform sales analytics and demand forecasting. He completed his Ph.D. in Computer Science and Engineering from the University of South Carolina and has worked on numerous projects in the areas of computer vision and image processing, including an NEH-funded document image processing project, a DARPA competition to build an event recognition system, and an interdisciplinary AFOSR-funded materials science image processing project. He is an ardent supporter of free software, having used a variety of open source languages, operating systems, and frameworks in his research. His open source projects and contributions, along with his research work, can be found on GitHub (https://github.com/malloc47) and on his website (http://www.malloc47.com).
You might want to visit www.PacktPub.com for support files and downloads related to your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.
Practical Data Analysis provides a series of practical projects in order to turn data into insight. It covers a wide range of data analysis tools and algorithms for classification, clustering, visualization, simulation, and forecasting. The goal of this book is to help you understand your data to find patterns, trends, relationships, and insight.
This book contains practical projects that take advantage of the MongoDB, D3.js, and Python language and its ecosystem to present the concepts using code snippets and detailed descriptions.
Chapter 1, Getting Started, discusses the principles of data analysis and the data analysis process.
Chapter 2, Working with Data, explains how to scrub and prepare your data for the analysis and also introduces the use of OpenRefine which is a data cleansing tool.
Chapter 3, Data Visualization, shows how to visualize different kinds of data using D3.js, which is a JavaScript Visualization Framework.
Chapter 4, Text Classification, introduces the binary classification using a Naïve Bayes algorithm to classify spam.
Chapter 5, Similarity-based Image Retrieval, presents a project to find the similarity between images using a dynamic time warping approach.
Chapter 6, Simulation of Stock Prices, explains how to simulate stock prices using Random Walk algorithm, visualized with a D3.js animation.
Chapter 7, Predicting Gold Prices, introduces how Kernel Ridge Regression works and how to use it to predict the gold price using time series.
Chapter 8, Working with Support Vector Machines, describes how to use support vector machines as a classification method.
Chapter 9, Modeling Infectious Disease with Cellular Automata, introduces the basic concepts of computational epidemiology simulation and explains how to implement a cellular automaton to simulate an epidemic outbreak using D3.js and JavaScript.
Chapter 10, Working with Social Graphs, explains how to obtain and visualize your social media graph from Facebook using Gephi.
Chapter 11, Sentiment Analysis of Twitter Data, explains how to use the Twitter API to retrieve data from Twitter. We also see how to improve the text classification to perform a sentiment analysis using the Naïve Bayes algorithm implemented in the Natural Language Toolkit (NLTK).
Chapter 12, Data Processing and Aggregation with MongoDB, introduces the basic operations in MongoDB as well as methods for grouping, filtering, and aggregation.
Chapter 13, Working with MapReduce, illustrates how to use the MapReduce programming model implemented in MongoDB.
Chapter 14, Online Data Analysis with IPython and Wakari, explains how to use the Wakari platform and introduces the basic use of Pandas and PIL with IPython.
Appendix, Setting Up the Infrastructure, provides detailed information on installation of the software tools used in this book.
The basic requirements for this book are as follows:
This book is for software developers, analysts, and computer scientists who want to implement data analysis and visualization in a practical way. The book is also intended to provide a self-contained set of practical projects in order to get insight about different kinds of data such as, time series, numerical, multidimensional, social media graphs, and texts. You are not required to have previous knowledge about data analysis, but some basic knowledge about statistics and a general understanding of Python programming is essential.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the erratasubmissionform link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.
Data analysis is the process in which raw data is ordered and organized, to be used in methods that help to explain the past and predict the future. Data analysis is not about the numbers, it is about making/asking questions, developing explanations, and testing hypotheses. Data Analysis is a multidisciplinary field, which combines Computer Science, Artificial Intelligence & Machine Learning, Statistics & Mathematics, and Knowledge Domain as shown in the following figure:
Computer science creates the tools for data analysis. The vast amount of data generated has made computational analysis critical and has increased the demand for skills such as programming, database administration, network administration, and high-performance computing. Some programming experience in Python (or any high-level programming language) is needed to understand the chapters.
According to Stuart Russell and Peter Norvig:
"[AI] has to do with smart programs, so let's get on and write some."
In other words, AI studies the algorithms that can simulate an intelligent behavior. In data analysis, we use AI to perform those activities that require intelligence such as inference, similarity search, or unsupervised classification.
Machine learning is the study of computer algorithms to learn how to react in a certain situation or recognize patterns. According to Arthur Samuel (1959),
"Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed."
ML has a large amount of algorithms generally split in to three groups; given how the algorithm is training:
Relevant numbers of algorithms are used throughout the book and are combined with practical examples, leading the reader through the process from the data problem to its programming solution.
In January 2009, Google's Chief Economist, Hal Varian said,
"I keep saying the sexy job in the next ten years will be statisticians. People think I'm joking, but who would've guessed that computer engineers would've been the sexy job of the 1990s?"
Statistics is the development and application of methods to collect, analyze, and interpret data.
Data analysis encompasses a variety of statistical techniques such as simulation, Bayesian methods, forecasting, regression, time-series analysis, and clustering.
Data analysis makes use of a lot of mathematical techniques such as linear algebra (vector and matrix, factorization, and eigenvalue), numerical methods, and conditional probability in the algorithms. In this book, all the chapters are self-contained and include the necessary math involved.
One of the most important activities in data analysis is asking questions, and a good understanding of the knowledge domain can give you the expertise and intuition needed to ask good questions. Data analysis is used in almost all the domains such as finance, administration, business, social media, government, and science.
Data are facts of the world. For example, financial transactions, age, temperature, number of steps from my house to my office, are simply numbers. The information appears when we work with those numbers and we can find value and meaning. The information can help us to make informed decisions.
We can talk about knowledge when the data and the information turn into a set of rules to assist the decisions. In fact, we can't store knowledge because it implies theoretical or practical understanding of a subject. However, using predictive analytics, we can simulate an intelligent behavior and provide a good approximation. An example of how to turn data into knowledge is shown in the following figure:
Data is the plural of datum, so it is always treated as plural. We can find data in all the situations of the world around us, in all the structured or unstructured, in continuous or discrete conditions, in weather records, stock market logs, in photo albums, music playlists, or in our Twitter accounts. In fact, data can be seen as the essential raw material of any kind of human activity. According to the Oxford English Dictionary:
Data are known facts or things used as basis for inference or reckoning.
As shown in the following figure, we can see Data in two distinct ways: Categorical and Numerical:
Categorical data are values or observations that can be sorted into groups or categories. There are two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, housing is a categorical variable having two categories (own and rent). An ordinal variable has an established ordering. For example, age as a variable with three orderly categories (young, adult, and elder).
Numerical data are values or observations that can be measured. There are two kinds of numerical values, discrete and continuous. Discrete data are values or observations that can be counted and are distinct and separate. For example, number of lines in a code. Continuous data are values or observations that may take on any value within a finite or infinite interval. For example, an economic time series such as historic gold prices.
The kinds of datasets used in this book are as follows:
For each of the projects in this book, we try to use a different kind of data. This book is trying to give the reader the ability to address different kinds of data problems.
When you have a good understanding of a phenomenon, it is possible to make predictions about it. Data analysis helps us to make this possible through exploring the past and creating predictive models.
The data analysis process is composed of the following steps:
All these activities can be grouped as shown in the following figure:
The problem definition starts with high-level questions such as how to track differences in behavior between groups of customers, or what's going to be the gold price in the next month. Understanding the objectives and requirements from a domain perspective is the key to a successful data analysis project.
Types of data analysis questions are listed as follows:
Data preparation is about how to obtain, clean, normalize, and transform the data into an optimal dataset, trying to avoid any possible data quality issues such as invalid, ambiguous, out-of-range, or missing values. This process can take a lot of your time. In Chapter 2, Working with Data, we go into more detail about working with data, using OpenRefine to address the complicated tasks. Analyzing data that has not been carefully prepared can lead you to highly misleading results.
The characteristics of good data are listed as follows:
Data exploration is essentially looking at the data in a graphical or statistical form trying to find patterns, connections, and relations in the data. Visualization is used to provide overviews in which meaningful patterns may be found.
In Chapter 3, Data Visualization, we present a visualization framework (D3.js) and we implement some examples on how to use visualization as a data exploration tool.
Predictive modeling is a process used in data analysis to create or choose a statistical model trying to best predict the probability of an outcome. In this book, we use a variety of those models and we can group them in three categories based on its outcome:
Chapter
Algorithm
Categorical outcome (Classification)
4
Naïve Bayes Classifier
11
Natural Language Toolkit + Naïve Bayes Classifier
Numerical outcome (Regression)
6
Random Walk
8
Support Vector Machines
9
Cellular Automata
8
Distance Based Approach + k-nearest neighbor
Descriptive modeling (Clustering)
5
Fast Dynamic Time Warping (FDTW) + Distance Metrics
10
Force Layout and Fruchterman-Reingold layout
Another important task we need to accomplish in this step is evaluating the model we chose to be optimal for the particular problem.
The No Free Lunch Theorem proposed by Wolpert in 1996 stated:
"No Free Lunch theorems have shown that learning algorithms cannot be universally good."
The model evaluation helps us to ensure that our analysis is not over-optimistic or over-fitted. In this book, we are going to present two different ways to validate the model:
This is the final step in our analysis process and we need to answer the following questions:
How is it going to present the results?
For example, in tabular reports, 2D plots, dashboards, or infographics.
Where is it going to be deployed?
For example, in hard copy printed, poster, mobile devices, desktop interface, or web.
Each choice will depend on the kind of analysis and a particular data. In the following chapters, we will learn how to use standalone plotting in Python with matplotlib and web visualization with D3.js.
Quantitative and qualitative analysis can be defined as follows:
As shown in the following figure, we can observe the differences between quantitative and qualitative analysis:
Quantitative analytics involves analysis of numerical data. The type of the analysis will depend on the level of measurement. There are four kinds of measurements:
Qualitative analysis can explore the complexity and meaning of social phenomena. Data for qualitative study may include written texts (for example, documents or email) and/or audible and visual data (for example, digital images or sounds). In Chapter 11, Sentiment Analysis of Twitter Data, we present a sentiment analysis from Twitter data as an example of qualitative analysis.
The goal of the