Python Data Analysis Cookbook - Ivan Idris - E-Book

Python Data Analysis Cookbook E-Book

Ivan Idris

0,0
41,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Over 140 practical recipes to help you make sense of your data with ease and build production-ready data apps

About This Book

  • Analyze Big Data sets, create attractive visualizations, and manipulate and process various data types
  • Packed with rich recipes to help you learn and explore amazing algorithms for statistics and machine learning
  • Authored by Ivan Idris, expert in python programming and proud author of eight highly reviewed books

Who This Book Is For

This book teaches Python data analysis at an intermediate level with the goal of transforming you from journeyman to master. Basic Python and data analysis skills and affinity are assumed.

What You Will Learn

  • Set up reproducible data analysis
  • Clean and transform data
  • Apply advanced statistical analysis
  • Create attractive data visualizations
  • Web scrape and work with databases, Hadoop, and Spark
  • Analyze images and time series data
  • Mine text and analyze social networks
  • Use machine learning and evaluate the results
  • Take advantage of parallelism and concurrency

In Detail

Data analysis is a rapidly evolving field and Python is a multi-paradigm programming language suitable for object-oriented application development and functional design patterns. As Python offers a range of tools and libraries for all purposes, it has slowly evolved as the primary language for data science, including topics on: data analysis, visualization, and machine learning.

Python Data Analysis Cookbook focuses on reproducibility and creating production-ready systems. You will start with recipes that set the foundation for data analysis with libraries such as matplotlib, NumPy, and pandas. You will learn to create visualizations by choosing color maps and palettes then dive into statistical data analysis using distribution algorithms and correlations. You'll then help you find your way around different data and numerical problems, get to grips with Spark and HDFS, and then set up migration scripts for web mining.

In this book, you will dive deeper into recipes on spectral analysis, smoothing, and bootstrapping methods. Moving on, you will learn to rank stocks and check market efficiency, then work with metrics and clusters. You will achieve parallelism to improve system performance by using multiple threads and speeding up your code.

By the end of the book, you will be capable of handling various data analysis techniques in Python and devising solutions for problem scenarios.

Style and Approach

The book is written in “cookbook” style striving for high realism in data analysis. Through the recipe-based format, you can read each recipe separately as required and immediately apply the knowledge gained.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 397

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Python Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
Why do you need this book?
Data analysis, data science, big data – what is the big deal?
A brief of history of data analysis with Python
A conjecture about the future
What this book covers
What you need for this book
Who this book is for
Sections
Getting ready
How to do it…
How it works…
There's more…
See also
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Laying the Foundation for Reproducible Data Analysis
Introduction
Setting up Anaconda
Getting ready
How to do it...
There's more...
See also
Installing the Data Science Toolbox
Getting ready
How to do it...
How it works...
See also
Creating a virtual environment with virtualenv and virtualenvwrapper
Getting ready
How to do it...
See also
Sandboxing Python applications with Docker images
Getting ready
How to do it...
How it works...
See also
Keeping track of package versions and history in IPython Notebook
Getting ready
How to do it...
How it works...
See also
Configuring IPython
Getting ready
How to do it...
See also
Learning to log for robust error checking
Getting ready
How to do it...
How it works...
See also
Unit testing your code
Getting ready
How to do it...
How it works...
See also
Configuring pandas
Getting ready
How to do it...
Configuring matplotlib
Getting ready
How to do it...
How it works...
See also
Seeding random number generators and NumPy print options
Getting ready
How to do it...
See also
Standardizing reports, code style, and data access
Getting ready
How to do it...
See also
2. Creating Attractive Data Visualizations
Introduction
Graphing Anscombe's quartet
How to do it...
See also
Choosing seaborn color palettes
How to do it...
See also
Choosing matplotlib color maps
How to do it...
See also
Interacting with IPython Notebook widgets
How to do it...
See also
Viewing a matrix of scatterplots
How to do it...
Visualizing with d3.js via mpld3
Getting ready
How to do it...
Creating heatmaps
Getting ready
How to do it...
See also
Combining box plots and kernel density plots with violin plots
How to do it...
See also
Visualizing network graphs with hive plots
Getting ready
How to do it...
Displaying geographical maps
Getting ready
How to do it...
Using ggplot2-like plots
Getting ready
How to do it...
Highlighting data points with influence plots
How to do it...
See also
3. Statistical Data Analysis and Probability
Introduction
Fitting data to the exponential distribution
How to do it...
How it works…
See also
Fitting aggregated data to the gamma distribution
How to do it...
See also
Fitting aggregated counts to the Poisson distribution
How to do it...
See also
Determining bias
How to do it...
See also
Estimating kernel density
How to do it...
See also
Determining confidence intervals for mean, variance, and standard deviation
How to do it...
See also
Sampling with probability weights
How to do it...
See also
Exploring extreme values
How to do it...
See also
Correlating variables with Pearson's correlation
How to do it...
See also
Correlating variables with the Spearman rank correlation
How to do it...
See also
Correlating a binary and a continuous variable with the point biserial correlation
How to do it...
See also
Evaluating relations between variables with ANOVA
How to do it...
See also
4. Dealing with Data and Numerical Issues
Introduction
Clipping and filtering outliers
How to do it...
See also
Winsorizing data
How to do it...
See also
Measuring central tendency of noisy data
How to do it...
See also
Normalizing with the Box-Cox transformation
How to do it...
How it works
See also
Transforming data with the power ladder
How to do it...
Transforming data with logarithms
How to do it...
Rebinning data
How to do it...
Applying logit() to transform proportions
How to do it...
Fitting a robust linear model
How to do it...
See also
Taking variance into account with weighted least squares
How to do it...
See also
Using arbitrary precision for optimization
Getting ready
How to do it...
See also
Using arbitrary precision for linear algebra
Getting ready
How to do it...
See also
5. Web Mining, Databases, and Big Data
Introduction
Simulating web browsing
Getting ready
How to do it…
See also
Scraping the Web
Getting ready
How to do it…
Dealing with non-ASCII text and HTML entities
Getting ready
How to do it…
See also
Implementing association tables
Getting ready
How to do it…
Setting up database migration scripts
Getting ready
How to do it…
See also
Adding a table column to an existing table
Getting ready
How to do it…
Adding indices after table creation
Getting ready
How to do it…
How it works…
See also
Setting up a test web server
Getting ready
How to do it…
Implementing a star schema with fact and dimension tables
How to do it…
See also
Using HDFS
Getting ready
How to do it…
See also
Setting up Spark
Getting ready
How to do it…
See also
Clustering data with Spark
Getting ready
How to do it…
How it works…
There's more…
See also
6. Signal Processing and Timeseries
Introduction
Spectral analysis with periodograms
How to do it...
See also
Estimating power spectral density with the Welch method
How to do it...
See also
Analyzing peaks
How to do it...
See also
Measuring phase synchronization
How to do it...
See also
Exponential smoothing
How to do it...
See also
Evaluating smoothing
How to do it...
See also
Using the Lomb-Scargle periodogram
How to do it...
See also
Analyzing the frequency spectrum of audio
How to do it...
See also
Analyzing signals with the discrete cosine transform
How to do it...
See also
Block bootstrapping time series data
How to do it...
See also
Moving block bootstrapping time series data
How to do it...
See also
Applying the discrete wavelet transform
Getting started
How to do it...
See also
7. Selecting Stocks with Financial Data Analysis
Introduction
Computing simple and log returns
How to do it...
See also
Ranking stocks with the Sharpe ratio and liquidity
How to do it...
See also
Ranking stocks with the Calmar and Sortino ratios
How to do it...
See also
Analyzing returns statistics
How to do it...
Correlating individual stocks with the broader market
How to do it...
Exploring risk and return
How to do it...
See also
Examining the market with the non-parametric runs test
How to do it...
See also
Testing for random walks
How to do it...
See also
Determining market efficiency with autoregressive models
How to do it...
See also
Creating tables for a stock prices database
How to do it...
Populating the stock prices database
How to do it...
Optimizing an equal weights two-asset portfolio
How to do it...
See also
8. Text Mining and Social Network Analysis
Introduction
Creating a categorized corpus
Getting ready
How to do it...
See also
Tokenizing news articles in sentences and words
Getting ready
How to do it...
See also
Stemming, lemmatizing, filtering, and TF-IDF scores
Getting ready
How to do it...
How it works
See also
Recognizing named entities
Getting ready
How to do it...
How it works
See also
Extracting topics with non-negative matrix factorization
How to do it...
How it works
See also
Implementing a basic terms database
How to do it...
How it works
See also
Computing social network density
Getting ready
How to do it...
See also
Calculating social network closeness centrality
Getting ready
How to do it...
See also
Determining the betweenness centrality
Getting ready
How to do it...
See also
Estimating the average clustering coefficient
Getting ready
How to do it...
See also
Calculating the assortativity coefficient of a graph
Getting ready
How to do it...
See also
Getting the clique number of a graph
Getting ready
How to do it...
See also
Creating a document graph with cosine similarity
How to do it...
See also
9. Ensemble Learning and Dimensionality Reduction
Introduction
Recursively eliminating features
How to do it...
How it works
See also
Applying principal component analysis for dimension reduction
How to do it...
See also
Applying linear discriminant analysis for dimension reduction
How to do it...
See also
Stacking and majority voting for multiple models
How to do it...
See also
Learning with random forests
How to do it...
There's more…
See also
Fitting noisy data with the RANSAC algorithm
How to do it...
See also
Bagging to improve results
How to do it...
See also
Boosting for better learning
How to do it...
See also
Nesting cross-validation
How to do it...
See also
Reusing models with joblib
How to do it...
See also
Hierarchically clustering data
How to do it...
See also
Taking a Theano tour
Getting ready
How to do it...
See also
10. Evaluating Classifiers, Regressors, and Clusters
Introduction
Getting classification straight with the confusion matrix
How to do it...
How it works
See also
Computing precision, recall, and F1-score
How to do it...
See also
Examining a receiver operating characteristic and the area under a curve
How to do it...
See also
Visualizing the goodness of fit
How to do it...
See also
Computing MSE and median absolute error
How to do it...
See also
Evaluating clusters with the mean silhouette coefficient
How to do it...
See also
Comparing results with a dummy classifier
How to do it...
See also
Determining MAPE and MPE
How to do it...
See also
Comparing with a dummy regressor
How to do it...
See also
Calculating the mean absolute error and the residual sum of squares
How to do it...
See also
Examining the kappa of classification
How to do it...
How it works
See also
Taking a look at the Matthews correlation coefficient
How to do it...
See also
11. Analyzing Images
Introduction
Setting up OpenCV
Getting ready
How to do it...
How it works
There's more
Applying Scale-Invariant Feature Transform (SIFT)
Getting ready
How to do it...
See also
Detecting features with SURF
Getting ready
How to do it...
See also
Quantizing colors
Getting ready
How to do it...
See also
Denoising images
Getting ready
How to do it...
See also
Extracting patches from an image
Getting ready
How to do it...
See also
Detecting faces with Haar cascades
Getting ready
How to do it...
See also
Searching for bright stars
Getting ready
How to do it...
See also
Extracting metadata from images
Getting ready
How to do it...
See also
Extracting texture features from images
Getting ready
How to do it...
See also
Applying hierarchical clustering on images
How to do it...
See also
Segmenting images with spectral clustering
How to do it...
See also
12. Parallelism and Performance
Introduction
Just-in-time compiling with Numba
Getting ready
How to do it...
How it works
See also
Speeding up numerical expressions with Numexpr
How to do it...
How it works
See also
Running multiple threads with the threading module
How to do it...
See also
Launching multiple tasks with the concurrent.futures module
How to do it...
See also
Accessing resources asynchronously with the asyncio module
How to do it...
See also
Distributed processing with execnet
Getting ready
How to do it...
See also
Profiling memory usage
Getting ready
How to do it...
See also
Calculating the mean, variance, skewness, and kurtosis on the fly
Getting ready
How to do it...
See also
Caching with a least recently used cache
Getting ready
How to do it...
See also
Caching HTTP requests
Getting ready
How to do it...
See also
Streaming counting with the Count-min sketch
How to do it...
See also
Harnessing the power of the GPU with OpenCL
Getting ready
How to do it...
See also
A. Glossary
B. Function Reference
IPython
Matplotlib
NumPy
pandas
Scikit-learn
SciPy
Seaborn
Statsmodels
C. Online Resources
IPython notebooks and open data
Mathematics and statistics
Presentations
D. Tips and Tricks for Command-Line and Miscellaneous Tools
IPython notebooks
Command-line tools
The alias command
Command-line history
Reproducible sessions
Docker tips
Index

Python Data Analysis Cookbook

Python Data Analysis Cookbook

Copyright © 2016 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2016

Production reference: 1150716

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78528-228-7

www.packtpub.com

Credits

Author

Ivan Idris

Reviewers

Bill Chambers

Alexey Grigorev

Dr. Vahid Mirjalili

Michele Usuelli

Commissioning Editor

Akram Hussain

Acquisition Editor

Prachi Bisht

Content Development Editor

Rohit Singh

Technical Editor

Vivek Pala

Copy Editor

Pranjali Chury

Project Coordinator

Izzat Contractor

Proofreader

Safis Editing

Indexer

Rekha Nair

Graphics

Jason Monteiro

Production Coordinator

Aparna Bhagat

Cover Work

Aparna Bhagat

About the Author

Ivan Idris was born in Bulgaria to Indonesian parents. He moved to the Netherlands and graduated in experimental physics. His graduation thesis had a strong emphasis on applied computer science. After graduating, he worked for several companies as a software developer, data warehouse developer, and QA analyst.

His professional interests are business intelligence, big data, and cloud computing. He enjoys writing clean, testable code and interesting technical articles. He is the author of NumPy Beginner's Guide, NumPy Cookbook, Learning NumPy, and Python Data Analysis, all by Packt Publishing.

About the Reviewers

Bill Chambers is a data scientist from the UC Berkeley School of Information. He's focused on building technical systems and performing large-scale data analysis. At Berkeley, he has worked with everything from data science with Scala and Apache Spark to creating online Python courses for UC Berkeley's master of data science program. Prior to Berkeley, he was a business analyst at a software company where he was charged with the task of integrating multiple software systems and leading internal analytics and reporting. He contributed as a technical reviewer to the book Learning Pandas by Packt Publishing.

Alexey Grigorev is a skilled data scientist and software engineer with more than 5 years of professional experience. Currently, he works as a data scientist at Searchmetrics Inc. In his day-to-day job, he actively uses R and Python for data cleaning, data analysis, and modeling. He has contributed as a technical reviewer to other books on data analysis by Packt Publishing, such as Test-Driven Machine Learning and Mastering Data Analysis with R.

Dr. Vahid Mirjalili is a data scientist with a diverse background in engineering, mathematics, and computer science. Currently, he is working toward his graduate degree in computer science at Michigan State University. With his specialty in data mining, he is very interested in predictive modeling and getting insights from data. As a Python developer, he likes to contribute to the open source community. He has developed Python packages, such as PyClust, for data clustering. Furthermore, he is also focused on making tutorials for different directions of data science, which can be found at his Github repository at http://github.com/mirjalil/DataScience.

The other books that he has reviewed include Python Machine Learning by Sebastian Raschka and Python Machine Learning Cookbook by Parteek Joshi. Furthermore, he is currently working on a book focused on big data analysis, covering the algorithms specifically suited to analyzing massive datasets.

Michele Usuelli is a data scientist, writer, and R enthusiast specializing in the fields of big data and machine learning. He currently works for Microsoft and joined through the acquisition of Revolution Analytics, the leading R-based company that builds a big data package for R. Michele graduated in mathematical engineering, and before Revolution, he worked with a big data start-up and a big publishing company. He is the author of R Machine Learning Essentials and Building a Recommendation System with R.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

 

"Data analysis is Python's killer app"

  --Unknown

This book is the follow-up to Python Data Analysis. The obvious question is, "what does this new book add?" as Python Data Analysis is pretty great (or so I like to believe) already. This book, Python Data Analysis Cookbook, is targeted at slightly more experienced Pythonistas. A year has passed, so we are using newer versions of software and software libraries that I didn't cover in Python Data Analysis. Also, I've had time to rethink and research, and as a result I decided the following:

I need to have a toolbox in order to make my life easier and increase reproducibility. I called the toolbox dautil and made it available via PyPi (which can be installed with pip/easy_install). My soul-searching exercise led me to believe that I need to make it easier to obtain and install the required software. I published a Docker container (pydacbk) with some of the software we need via DockerHub. You can read more about the setup in Chapter 1, Laying the Foundation for Reproducible Data Analysis, and the online chapter. The Docker container is not ideal because it grew quite large, so I had to make some tough decisions. Since the container is not really part of the book, I think it will be appropriate if you contact me directly if you have any issues. However, please keep in mind that I can't change the image drastically.This book uses the IPython Notebook, which has become a standard tool for analysis. I have given some related tips in the online chapter and other books I have written.I am using Python 3 with very few exceptions because Python 2 will not be maintained after 2020.

Why do you need this book?

Some people will tell you that you don't need books, just get yourself an interesting project and figure out the rest as you go along. Although there are plenty of resources out there, this may be a very frustrating road. If you want to make a delicious soup, for example, you can of course ask friends and family, search the Internet, or watch cooking shows. However, your friends and family are not available full time for you and the quality of Internet content varies. And in my humble opinion, Packt Publishing, the reviewers, and I have spent so much time and energy on this book, that I will be surprised if you don't get any value out of it.

Data analysis, data science, big data – what is the big deal?

You probably have seen Venn diagrams depicting data science as the intersection of mathematics/statistics, computer science, and domain expertise. Data analysis is timeless and was there before data science and even before computer science. You could do data analysis with a pen and paper and, in more modern times, with a pocket calculator.

Data analysis has many aspects, with goals such as making decisions or coming up with new hypotheses and questions. The hype, status, and financial rewards surrounding data science and big data remind me of the time when datawarehousing and business intelligence were the buzz words. The ultimate goal of business intelligence and datawarehousing was to build dashboards for management. This involved a lot of politics and organizational aspects, but on the technical side, it was mostly about databases. Data science, on the other hand, is not database-centric and leans heavily on machine learning. Machine learning techniques have become necessary because of the bigger volumes of data. The data growth is caused by the growth of the world population and the rise of new technologies, such as social media and mobile devices. The data growth is, in fact, probably the only trend that we can be sure of continuing. The difference between constructing dashboards and applying machine learning is analogous to the way search engines evolved.

Search engines (if you can call them that) were initially nothing more than well-organized collections of links created manually. Eventually, the automated approach won. Since, in time, more data will be created (and not destroyed), we can expect an increase in automated data analysis.

A brief of history of data analysis with Python

The history of the various Python software libraries is quite interesting. I am not a historian, so the following notes are written from my own perspective:

1989: Guido van Rossum implements the very first version of Python at the CWI in the Netherlands as a Christmas "hobby" project.1995: Jim Hugunin creates Numeric—the predecessor to NumPy.1999: Pearu Peterson wrote f2py as a bridge between Fortran and Python.2000: Python 2.0 is released.2001: The SciPy library is released. Also, Numarray, a competing library of Numeric is created. Fernando Perez releases IPython, which starts out as an "afternoon hack". NLTK is released as a research project.2002: John Hunter creates the Matplotlib library.2005: NumPy is released by Travis Oliphant. NumPy, initially, is Numeric extended with features inspired by Numarray.2006: NumPy 1.0 is released. The first version of SQLAlchemy is released.2007: The scikit-learn project is initiated as a Google Summer of Code project by David Cournapeau. Cython was forked from Pyrex. Cython is later intensively used in pandas and scikit-learn to improve performance.2008: Wes McKinney starts working on pandas. Python 3.0 is released.2011: The IPython 0.12 release introduces the IPython notebook. Packt Publishing releases NumPy 1.5 Beginner's Guide.2012: Packt Publishing releases NumPy Cookbook.2013: Packt Publishing releases NumPy Beginner's Guide, Second Edition.2014: Fernando Perez announces Project Jupyter, which aims to make a language-agnostic notebook. Packt Publishing releases Learning NumPy Array and Python Data Analysis.2015: Packt Publishing releases NumPy Beginner's Guide, Third Edition and NumPy Cookbook, Second Edition.

A conjecture about the future

The future is a bright place, where an incredible amount of data lives in the Cloud and software runs on any imaginable device with an intuitive customizable interface. (I know young people who can't stop talking about how awesome their phone is and how one day we will all be programming on tablets by dragging and dropping). It seems there is a certain angst in the Python community about not being relevant in the future. Of course, the more you have invested in Python, the more it matters.

To figure out what to do, we need to know what makes Python special. A school of thought claims that Python is a glue language gluing C, Fortran, R, Java, and other languages; therefore, we just need better glue. This probably also means "borrowing" features from other languages. Personally, I like the way Python works, its flexible nature, its data structures, and the fact that it has so many libraries and features. I think the future is in more delicious syntactic sugar and just-in-time compilers. Somehow we should be able to continue writing Python code, which automatically is converted for us in concurrent (machine) code. Unseen machinery under the hood manages lower level details and sends data and instructions to CPUs, GPUs, or the Cloud. The code should be able to easily communicate with whatever storage backend we are using. Ideally, all of this magic will be just as convenient as automatic garbage collection. It may sound like an impossible "click of a button" dream, but I think it is worth pursuing.

What this book covers

Chapter 1, Laying the Foundation for Reproducible Data Analysis, is a pretty important chapter, and I recommend that you do not skip it. It explains Anaconda, Docker, unit testing, logging, and other essential elements of reproducible data analysis.

Chapter 2, Creating Attractive Data Visualizations, demonstrates how to visualize data and mentions frequently encountered pitfalls.

Chapter 3, Statistical Data Analysis and Probability, discusses statistical probability distributions and correlation between two variables.

Chapter 4, Dealing with Data and Numerical Issues, is about outliers and other common data issues. Data is almost never perfect, so a large portion of the analysis effort goes into dealing with data imperfections.

Chapter 5, Web Mining, Databases, and Big Data, is light on mathematics, but more focused on technical topics, such as databases, web scraping, and big data.

Chapter 6, Signal Processing and Timeseries, is about time series data, which is abundant and requires special techniques. Usually, we are interested in trends and seasonality or periodicity.

Chapter 7, Selecting Stocks with Financial Data Analysis, focuses on stock investing because stock price data is abundant. This is the only chapter on finance and the content should be at least partially relevant if stocks don't interest you.

Chapter 8, Text Mining and Social Network Analysis, helps you cope with the floods of textual and social media information.

Chapter 9, Ensemble Learning and Dimensionality Reduction, covers ensemble learning, classification and regression algorithms, as well as hierarchical clustering.

Chapter 10, Evaluating Classifiers, Regressors, and Clusters, evaluates the classifiers and regressors from Chapter 9, Ensemble Learning and Dimensionality Reduction, the preceding chapter.

Chapter 11, Analyzing Images, uses the OpenCV library quite a lot to analyze images.

Chapter 12, Parallelism and Performance, is about software performance and I discuss various options to improve performance, including caching and just-in-time compilers.

Appendix A, Glossary, is a brief glossary of technical concepts used throughout the book. The goal is to have a reference that is easy to look up.

Appendix B, Function Reference, is a short reference of functions meant as an extra aid in case you are temporarily unable to look up documentation.

Appendix C, Online Resources, lists resources including presentations, links to documentation, and freely available IPython notebooks and data. This appendix is available as an online chapter.

Appendix D, Tips and Tricks for Command-Line and Miscellaneous Tools, in this book we use various tools such as the IPython notebook, Docker, and Unix shell commands. I give a short list of tips that is not meant to be exhaustive. This appendix is also available as online chapter.

What you need for this book

First, you need a Python 3 distribution. I recommend the full Anaconda distribution as it comes with the majority of the software we need. I tested the code with Python 3.4 and the following packages:

joblib 0.8.4IPython 3.2.1NetworkX 1.9.1NLTK 3.0.2Numexpr 2.3.1pandas 0.16.2SciPy 0.16.0 seaborn 0.6.0sqlalchemy 0.9.9statsmodels 0.6.1matplotlib 1.5.0NumPy 1.10.1scikit-learn 0.17dautil 0.0.1a29

For some recipes, you need to install extra software, but this is explained whenever the software is required.

Who this book is for

This book is hands-on and low on theory. You should have better than beginner Python knowledge and have some knowledge of linear algebra, calculus, machine learning and statistics. Ideally, you would have read Python Data Analysis, but this is not a requirement. I also recommend the following books:

Building Machine Learning Systems with Python by Willi Richert and Luis Pedro Coelho, 2013Learning NumPy Array by Ivan Idris, 2014Learning scikit-learn: Machine Learning in Python by Guillermo Moncecchi, 2013Learning SciPy for Numerical and Scientific Computing by Francisco J. Blanco-Silva, 2013Matplotlib for Python Developers by Sandro Tosi, 2009NumPy Beginner's Guide - Third Edition by Ivan Idris, 2015NumPy Cookbook – Second Edition by Ivan Idris, 2015Parallel Programming with Python by Jan Palach, 2014Python Data Visualization Cookbook by Igor Milovanović, 2013Python for Finance by Yuxing Yan, 2014Python Text Processing with NLTK 2.0 Cookbook by Jacob Perkins, 2010

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also).

To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready

This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

See also

This section provides helpful links to other useful information for the recipe.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/PythonDataAnalysisCookbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Laying the Foundation for Reproducible Data Analysis

In this chapter, we will cover the following recipes:

Setting up AnacondaInstalling the Data Science ToolboxCreating a virtual environment with virtualenv and virtualenvwrapperSandboxing Python applications with Docker imagesKeeping track of package versions and history in IPython NotebooksConfiguring IPythonLearning to log for robust error checkingUnit testing your codeConfiguring pandasConfiguring matplotlibSeeding random number generators and NumPy print optionsStandardizing reports, code style, and data access

Introduction

Reproducible data analysis is a cornerstone of good science. In today's rapidly evolving world of science and technology, reproducibility is a hot topic. Reproducibility is about lowering barriers for other people. It may seem strange or unnecessary, but reproducible analysis is essential to get your work acknowledged by others. If a lot of people confirm your results, it will have a positive effect on your career. However, reproducible analysis is hard. It has important economic consequences, as you can read in Freedman LP, Cockburn IM, Simcoe TS (2015) The Economics of Reproducibility in Preclinical Research. PLoS Biol 13(6): e1002165. doi:10.1371/journal.pbio.1002165.

So reproducibility is important for society and for you, but how does it apply to Python users? Well, we want to lower barriers for others by:

Giving information about the software and hardware we used, including versions.Sharing virtual environments.Logging program behavior.Unit testing the code. This also serves as documentation of sorts.Sharing configuration files.Seeding random generators and making sure program behavior is as deterministic as possible.Standardizing reporting, data access, and code style.

I created the dautil package for this book, which you can install with pip or from the source archive provided in this book's code bundle. If you are in a hurry, run $ python install_ch1.py to install most of the software for this chapter, including dautil. I created a test Docker image, which you can use if you don't want to install anything except Docker (see the recipe, Sandboxing Python applications with Docker images).

Setting up Anaconda

Anaconda is a free Python distribution for data analysis and scientific computing. It has its own package manager, conda. The distribution includes more than 200 Python packages, which makes it very convenient. For casual users, the Miniconda distribution may be the better choice. Miniconda contains the conda package manager and Python. The technical editors use Anaconda, and so do I. But don't worry, I will describe in this book alternative installation instructions for readers who are not using Anaconda. In this recipe, we will install Anaconda and Miniconda and create a virtual environment.

Getting ready

The procedures to install Anaconda and Miniconda are similar. Obviously, Anaconda requires more disk space. Follow the instructions on the Anaconda website at http://conda.pydata.org/docs/install/quick.html (retrieved Mar 2016). First, you have to download the appropriate installer for your operating system and Python version. Sometimes, you can choose between a GUI and a command-line installer. I used the Python 3.4 installer, although my system Python version is v2.7. This is possible because Anaconda comes with its own Python. On my machine, the Anaconda installer created an anaconda directory in my home directory and required about 900 MB. The Miniconda installer installs a miniconda directory in your home directory.

How to do it...

Now that Anaconda or Miniconda is installed, list the packages with the following command:
$ conda list
For reproducibility, it is good to know that we can export packages:
$ conda list --export
The preceding command prints packages and versions on the screen, which you can save in a file. You can install these packages with the following command:
$ conda create -n ch1env --file <export file>

This command also creates an environment named ch1env.

The following command creates a simple testenv environment:
$ conda create --name testenv python=3
On Linux and Mac OS X, switch to this environment with the following command:
$ source activate testenv
On Windows, we don't need source. The syntax to switch back is similar:
$ [source] deactivate
The following command prints export information for the environment in the YAML (explained in the following section) format:
$ conda env export -n testenv
To remove the environment, type the following (note that even after removing, the name of the environment still exists in ~/.conda/environments.txt):
$ conda remove -n testenv --all
Search for a package as follows:
$ conda search numpy

In this example, we searched for the NumPy package. If NumPy is already present, Anaconda shows an asterisk in the output at the corresponding entry.

Update the distribution as follows:
$ conda update conda

There's more...

The .condarc configuration file follows the YAML syntax.

Note

YAML is a human-readable configuration file format with the extension .yaml or .yml. YAML was initially released in 2011, with the latest release in 2009. The YAML homepage is at http://yaml.org/ (retrieved July 2015).

You can find a sample configuration file at http://conda.pydata.org/docs/install/sample-condarc.html (retrieved July 2015). The related documentation is at http://conda.pydata.org/docs/install/config.html (retrieved July 2015).

See also

Martins, L. Felipe (November 2014). IPython Notebook Essentials (1st Edition.). Packt Publishing. p. 190. ISBN 1783988347The conda user cheat sheet at http://conda.pydata.org/docs/_downloads/conda-cheatsheet.pdf (retrieved July 2015)

Installing the Data Science Toolbox

The Data Science Toolbox (DST) is a virtual environment based on Ubuntu for data analysis using Python and R. Since DST is a virtual environment, we can install it on various operating systems. We will install DST locally, which requires VirtualBox and Vagrant. VirtualBox is a virtual machine application originally created by Innotek GmbH in 2007. Vagrant is a wrapper around virtual machine applications such as VirtualBox created by Mitchell Hashimoto.

Getting ready

You need to have in the order of 2 to 3 GB free for VirtualBox, Vagrant, and DST itself. This may vary by operating system.

How to do it...

Installing DST requires the following steps:

Install VirtualBox by downloading an installer for your operating system and architecture from https://www.virtualbox.org/wiki/Downloads (retrieved July 2015) and running it. I installed VirtualBox 4.3.28-100309 myself, but you can just install whatever the most recent VirtualBox version at the time is.Install Vagrant by downloading an installer for your operating system and architecture from https://www.vagrantup.com/downloads.html (retrieved July 2015). I installed Vagrant 1.7.2 and again you can install a more recent version if available.Create a directory to hold the DST and navigate to it with a terminal. Run the following command:
$ vagrant init data-science-toolbox/dst$ vagrant up

The first command creates a VagrantFile configuration file. Most of the content is commented out, but the file does contain links to documentation that might be useful. The second command creates the DST and initiates a download that could take a couple of minutes.

Connect to the virtual environment as follows (on Windows use putty):
$ vagrant ssh
View the preinstalled Python packages with the following command:
vagrant@data-science-toolbox:~$ pip freeze

The list is quite long; in my case it contained 32 packages. The DST Python version as of July 2015 was 2.7.6.

When you are done with the DST, log out and suspend (you can also halt it completely) the VM:
vagrant@data-science-toolbox:~$ logoutConnection to 127.0.0.1 closed.$ vagrant suspend==> default: Saving VM state and suspending execution...

How it works...

Virtual machines (VMs) emulate computers in software. VirtualBox is an application that creates and manages VMs. VirtualBox stores its VMs in your home folder, and this particular VM takes about 2.2 GB of storage.

Ubuntu is an open source Linux operating system, and we are allowed by its license to create virtual machines. Ubuntu has several versions; we can get more info with the lsb_release command:

vagrant@data-science-toolbox:~$ lsb_release -aNo LSB modules are available.Distributor ID: UbuntuDescription: Ubuntu 14.04 LTSRelease: 14.04Codename: trusty

Vagrant used to only work with VirtualBox, but currently it also supports VMware, KVM, Docker, and Amazon EC2. Vagrant calls virtual machines boxes. Some of these boxes are available for everyone at http://www.vagrantbox.es/ (retrieved July 2015).

See also

Run Ubuntu Linux Within Windows Using VirtualBox at http://linux.about.com/od/howtos/ss/Run-Ubuntu-Linux-Within-Windows-Using-VirtualBox.htm#step11 (retrieved July 2015)VirtualBox manual chapter 10 Technical Information at https://www.virtualbox.org/manual/ch10.html (retrieved July 2015)

Creating a virtual environment with virtualenv and virtualenvwrapper

Virtual environments provide dependency isolation for small projects. They also keep your site-packages directory small. Since Python 3.3, virtualenv has been part of the standard Python distribution. The virtualenvwrapper Python project has some extra convenient features for virtual environment management. I will demonstrate virtualenv and virtualenvwrapper functionality in this recipe.

Getting ready

You need Python 3.3 or later. You can install virtualenvwrapper with pip command as follows:

$ [sudo] pip install virtualenvwrapper

On Linux and Mac, it's necessary to do some extra work—specifying a directory for the virtual environments and sourcing a script:

$ export WORKON_HOME=/tmp/envs$ source /usr/local/bin/virtualenvwrapper.sh

Windows has a separate version, which you can install with the following command:

$ pip install virtualenvwrapper-win

How to do it...

Create a virtual environment for a given directory with the pyvenv script part of your Python distribution:
$ pyvenv /tmp/testenv$ lsbin include lib pyvenv.cfg
In this example, we created a testenv directory in the /tmp directory with several directories and a configuration file. The configuration file pyvenv.cfg contains the Python version and the home directory of the Python distribution.Activate the environment on Linux or Mac by sourcing the activate script, for example, with the following command:
$ source bin/activate

On Windows, use the activate.bat file.

You can now install packages in this environment in isolation. When you are done with the environment, switch back on Linux or Mac with the following command:
$ deactivate

On Windows, use the deactivate.bat file.

Alternatively, you could use virtualenvwrapper. Create and switch to a virtual environment with the following command:
vagrant@data-science-toolbox:~$ mkvirtualenv env2
Deactivate the environment with the deactivate command:
(env2)vagrant@data-science-toolbox:~$ deactivate
Delete the environment with the rmvirtualenv command:
vagrant@data-science-toolbox:~$ rmvirtualenv env2

See also

The Python standard library documentation for virtual environments at https://docs.python.org/3/library/venv.html#creating-virtual-environments (retrieved July 2015)The virtualenvwrapper documentation is at https://virtualenvwrapper.readthedocs.org/en/latest/index.html (retrieved July 2015)

Sandboxing Python applications with Docker images

Docker uses Linux kernel features to provide an extra virtualization layer. Docker was created in 2013 by Solomon Hykes. Boot2Docker allows us to install Docker on Windows and Mac OS X too. Boot2Docker uses a VirtualBox VM that contains a Linux environment with Docker. In this recipe, we will set up Docker and download the continuumio/miniconda3 Docker image.

Getting ready

The Docker installation docs are saved at https://docs.docker.com/index.html (retrieved July 2015). I installed Docker 1.7.0 with Boot2Docker. The installer requires about 133 MB. However, if you want to follow the whole recipe, you will need several gigabytes.

How to do it...

Once Boot2Docker is installed, you need to initialize the environment. This is only necessary once, and Linux users don't need this step:
$ boot2docker initLatest release for github.com/boot2docker/boot2docker is v1.7.0Downloading boot2docker ISO image...Success: downloaded https://github.com/boot2docker/boot2docker/releases/download/v1.7.0/boot2docker.iso
In the preceding step, you downloaded a VirtualBox VM to a directory such as /VirtualBox\ VMs/boot2docker-vm/.

The next step for Mac OS X and Windows users is to start the VM:

$ boot2docker start
Check the Docker environment by starting a sample container:
$ docker run hello-world

Note

Some people reported a hopefully temporary issue of not being able to connect. The issue can be resolved by issuing commands with an extra argument, for instance:

$ docker [--tlsverify=false] run hello-world
Docker images can be made public. We can search for such images and download them. In Setting up Anaconda, we installed Anaconda; however, Anaconda and Miniconda Docker images also exist. Use the following command:
$ docker search continuumio
The preceding command shows a list of Docker images from Continuum Analytics – the company that developed Anaconda and Miniconda. Download the Miniconda 3 Docker image as follows (if you prefer using my container, skip this):
$ docker pull continuumio/miniconda3
Start the image with the following command:
$ docker run -t -i continuumio/miniconda3 /bin/bash

We start out as root in the image.

The command $ docker images should list the continuumio/miniconda3 image as well. If you prefer not to install too much software (possibly only Docker and Boot2Docker) for this book, you should use the image I created. It uses the continuumio/miniconda3 image as template. This image allows you to execute Python scripts in the current working directory on your computer, while using installed software from the Docker image:
$ docker run -it -p 8888:8888 -v $(pwd):/usr/data -w /usr/data "ivanidris/pydacbk:latest" python <somefile>.py
You can also run a IPython notebook in your current working directory with the following command: