E-Book
41,99 €

Python Data Analysis Cookbook E-Book

Ivan Idris

0,0

41,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Over 140 practical recipes to help you make sense of your data with ease and build production-ready data apps

About This Book

Analyze Big Data sets, create attractive visualizations, and manipulate and process various data types
Packed with rich recipes to help you learn and explore amazing algorithms for statistics and machine learning
Authored by Ivan Idris, expert in python programming and proud author of eight highly reviewed books

Who This Book Is For

This book teaches Python data analysis at an intermediate level with the goal of transforming you from journeyman to master. Basic Python and data analysis skills and affinity are assumed.

What You Will Learn

Set up reproducible data analysis
Clean and transform data
Apply advanced statistical analysis
Create attractive data visualizations
Web scrape and work with databases, Hadoop, and Spark
Analyze images and time series data
Mine text and analyze social networks
Use machine learning and evaluate the results
Take advantage of parallelism and concurrency

In Detail

Data analysis is a rapidly evolving field and Python is a multi-paradigm programming language suitable for object-oriented application development and functional design patterns. As Python offers a range of tools and libraries for all purposes, it has slowly evolved as the primary language for data science, including topics on: data analysis, visualization, and machine learning.

Python Data Analysis Cookbook focuses on reproducibility and creating production-ready systems. You will start with recipes that set the foundation for data analysis with libraries such as matplotlib, NumPy, and pandas. You will learn to create visualizations by choosing color maps and palettes then dive into statistical data analysis using distribution algorithms and correlations. You'll then help you find your way around different data and numerical problems, get to grips with Spark and HDFS, and then set up migration scripts for web mining.

In this book, you will dive deeper into recipes on spectral analysis, smoothing, and bootstrapping methods. Moving on, you will learn to rank stocks and check market efficiency, then work with metrics and clusters. You will achieve parallelism to improve system performance by using multiple threads and speeding up your code.

By the end of the book, you will be capable of handling various data analysis techniques in Python and devising solutions for problem scenarios.

Style and Approach

The book is written in “cookbook” style striving for high realism in data analysis. Through the recipe-based format, you can read each recipe separately as required and immediately apply the knowledge gained.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 397

Veröffentlichungsjahr: 2016

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Python Data Analysis Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Preface

Why do you need this book?

Data analysis, data science, big data – what is the big deal?

A brief of history of data analysis with Python

A conjecture about the future

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Laying the Foundation for Reproducible Data Analysis

Introduction

Setting up Anaconda

Getting ready

How to do it...

There's more...

See also

Installing the Data Science Toolbox

Getting ready

How to do it...

How it works...

See also

Creating a virtual environment with virtualenv and virtualenvwrapper

Getting ready

How to do it...

See also

Sandboxing Python applications with Docker images

Getting ready

How to do it...

How it works...

See also

Keeping track of package versions and history in IPython Notebook

Getting ready

How to do it...

How it works...

See also

Configuring IPython

Getting ready

How to do it...

See also

Learning to log for robust error checking

Getting ready

How to do it...

How it works...

See also

Unit testing your code

Getting ready

How to do it...

How it works...

See also

Configuring pandas

Getting ready

How to do it...

Configuring matplotlib

Getting ready

How to do it...

How it works...

See also

Seeding random number generators and NumPy print options

Getting ready

How to do it...

See also

Standardizing reports, code style, and data access

Getting ready

How to do it...

See also

2. Creating Attractive Data Visualizations

Introduction

Graphing Anscombe's quartet

How to do it...

See also

Choosing seaborn color palettes

How to do it...

See also

Choosing matplotlib color maps

How to do it...

See also

Interacting with IPython Notebook widgets

How to do it...

See also

Viewing a matrix of scatterplots

How to do it...

Visualizing with d3.js via mpld3

Getting ready

How to do it...

Creating heatmaps

Getting ready

How to do it...

See also

Combining box plots and kernel density plots with violin plots

How to do it...

See also

Visualizing network graphs with hive plots

Getting ready

How to do it...

Displaying geographical maps

Getting ready

How to do it...

Using ggplot2-like plots

Getting ready

How to do it...

Highlighting data points with influence plots

How to do it...

See also

3. Statistical Data Analysis and Probability

Introduction

Fitting data to the exponential distribution

How to do it...

How it works…

See also

Fitting aggregated data to the gamma distribution

How to do it...

See also

Fitting aggregated counts to the Poisson distribution

How to do it...

See also

Determining bias

How to do it...

See also

Estimating kernel density

How to do it...

See also

Determining confidence intervals for mean, variance, and standard deviation

How to do it...

See also

Sampling with probability weights

How to do it...

See also

Exploring extreme values

How to do it...

See also

Correlating variables with Pearson's correlation

How to do it...

See also

Correlating variables with the Spearman rank correlation

How to do it...

See also

Correlating a binary and a continuous variable with the point biserial correlation

How to do it...

See also

Evaluating relations between variables with ANOVA

How to do it...

See also

4. Dealing with Data and Numerical Issues

Introduction

Clipping and filtering outliers

How to do it...

See also

Winsorizing data

How to do it...

See also

Measuring central tendency of noisy data

How to do it...

See also

Normalizing with the Box-Cox transformation

How to do it...

How it works

See also

Transforming data with the power ladder

How to do it...

Transforming data with logarithms

How to do it...

Rebinning data

How to do it...

Applying logit() to transform proportions

How to do it...

Fitting a robust linear model

How to do it...

See also

Taking variance into account with weighted least squares

How to do it...

See also

Using arbitrary precision for optimization

Getting ready

How to do it...

See also

Using arbitrary precision for linear algebra

Getting ready

How to do it...

See also

5. Web Mining, Databases, and Big Data

Introduction

Simulating web browsing

Getting ready

How to do it…

See also

Scraping the Web

Getting ready

How to do it…

Dealing with non-ASCII text and HTML entities

Getting ready

How to do it…

See also

Implementing association tables

Getting ready

How to do it…

Setting up database migration scripts

Getting ready

How to do it…

See also

Adding a table column to an existing table

Getting ready

How to do it…

Adding indices after table creation

Getting ready

How to do it…

How it works…

See also

Setting up a test web server

Getting ready

How to do it…

Implementing a star schema with fact and dimension tables

How to do it…

See also

Using HDFS

Getting ready

How to do it…

See also

Setting up Spark

Getting ready

How to do it…

See also

Clustering data with Spark

Getting ready

How to do it…

How it works…

There's more…

See also

6. Signal Processing and Timeseries

Introduction

Spectral analysis with periodograms

How to do it...

See also

Estimating power spectral density with the Welch method

How to do it...

See also

Analyzing peaks

How to do it...

See also

Measuring phase synchronization

How to do it...

See also

Exponential smoothing

How to do it...

See also

Evaluating smoothing

How to do it...

See also

Using the Lomb-Scargle periodogram

How to do it...

See also

Analyzing the frequency spectrum of audio

How to do it...

See also

Analyzing signals with the discrete cosine transform

How to do it...

See also

Block bootstrapping time series data

How to do it...

See also

Moving block bootstrapping time series data

How to do it...

See also

Applying the discrete wavelet transform

Getting started

How to do it...

See also

7. Selecting Stocks with Financial Data Analysis

Introduction

Computing simple and log returns

How to do it...

See also

Ranking stocks with the Sharpe ratio and liquidity

How to do it...

See also

Ranking stocks with the Calmar and Sortino ratios

How to do it...

See also

Analyzing returns statistics

How to do it...

Correlating individual stocks with the broader market

How to do it...

Exploring risk and return

How to do it...

See also

Examining the market with the non-parametric runs test

How to do it...

See also

Testing for random walks

How to do it...

See also

Determining market efficiency with autoregressive models

How to do it...

See also

Creating tables for a stock prices database

How to do it...

Populating the stock prices database

How to do it...

Optimizing an equal weights two-asset portfolio

How to do it...

See also

8. Text Mining and Social Network Analysis

Introduction

Creating a categorized corpus

Getting ready

How to do it...

See also

Tokenizing news articles in sentences and words

Getting ready

How to do it...

See also

Stemming, lemmatizing, filtering, and TF-IDF scores

Getting ready

How to do it...

How it works

See also

Recognizing named entities

Getting ready

How to do it...

How it works

See also

Extracting topics with non-negative matrix factorization

How to do it...

How it works

See also

Implementing a basic terms database

How to do it...

How it works

See also

Computing social network density

Getting ready

How to do it...

See also

Calculating social network closeness centrality

Getting ready

How to do it...

See also

Determining the betweenness centrality

Getting ready

How to do it...

See also

Estimating the average clustering coefficient

Getting ready

How to do it...

See also

Calculating the assortativity coefficient of a graph

Getting ready

How to do it...

See also

Getting the clique number of a graph

Getting ready

How to do it...

See also

Creating a document graph with cosine similarity

How to do it...

See also

9. Ensemble Learning and Dimensionality Reduction

Introduction

Recursively eliminating features

How to do it...

How it works

See also

Applying principal component analysis for dimension reduction

How to do it...

See also

Applying linear discriminant analysis for dimension reduction

How to do it...

See also

Stacking and majority voting for multiple models

How to do it...

See also

Learning with random forests

How to do it...

There's more…

See also

Fitting noisy data with the RANSAC algorithm

How to do it...

See also

Bagging to improve results

How to do it...

See also

Boosting for better learning

How to do it...

See also

Nesting cross-validation

How to do it...

See also

Reusing models with joblib

How to do it...

See also

Hierarchically clustering data

How to do it...

See also

Taking a Theano tour

Getting ready

How to do it...

See also

10. Evaluating Classifiers, Regressors, and Clusters

Introduction

Getting classification straight with the confusion matrix

How to do it...

How it works

See also

Computing precision, recall, and F1-score

How to do it...

See also

Examining a receiver operating characteristic and the area under a curve

How to do it...

See also

Visualizing the goodness of fit

How to do it...

See also

Computing MSE and median absolute error

How to do it...

See also

Evaluating clusters with the mean silhouette coefficient

How to do it...

See also

Comparing results with a dummy classifier

How to do it...

See also

Determining MAPE and MPE

How to do it...

See also

Comparing with a dummy regressor

How to do it...

See also

Calculating the mean absolute error and the residual sum of squares

How to do it...

See also

Examining the kappa of classification

How to do it...

How it works

See also

Taking a look at the Matthews correlation coefficient

How to do it...

See also

11. Analyzing Images

Introduction

Setting up OpenCV

Getting ready

How to do it...

How it works

There's more

Applying Scale-Invariant Feature Transform (SIFT)

Getting ready

How to do it...

See also

Detecting features with SURF

Getting ready

How to do it...

See also

Quantizing colors

Getting ready

How to do it...

See also

Denoising images

Getting ready

How to do it...

See also

Extracting patches from an image

Getting ready

How to do it...

See also

Detecting faces with Haar cascades

Getting ready

How to do it...

See also

Searching for bright stars

Getting ready

How to do it...

See also

Extracting metadata from images

Getting ready

How to do it...

See also

Extracting texture features from images

Getting ready

How to do it...

See also

Applying hierarchical clustering on images

How to do it...

See also

Segmenting images with spectral clustering

How to do it...

See also

12. Parallelism and Performance

Introduction

Just-in-time compiling with Numba

Getting ready

How to do it...

How it works

See also

Speeding up numerical expressions with Numexpr

How to do it...

How it works

See also

Running multiple threads with the threading module

How to do it...

See also

Launching multiple tasks with the concurrent.futures module

How to do it...

See also

Accessing resources asynchronously with the asyncio module

How to do it...

See also

Distributed processing with execnet

Getting ready

How to do it...

See also

Profiling memory usage

Getting ready

How to do it...

See also

Calculating the mean, variance, skewness, and kurtosis on the fly

Getting ready

How to do it...

See also

Caching with a least recently used cache

Getting ready

How to do it...

See also

Caching HTTP requests

Getting ready

How to do it...

See also

Streaming counting with the Count-min sketch

How to do it...

See also

Harnessing the power of the GPU with OpenCL

Getting ready

How to do it...

See also

A. Glossary

B. Function Reference

IPython

Matplotlib

NumPy

pandas

Scikit-learn

SciPy

Seaborn

Statsmodels

C. Online Resources

IPython notebooks and open data

Mathematics and statistics

Presentations

D. Tips and Tricks for Command-Line and Miscellaneous Tools

IPython notebooks

Command-line tools

The alias command

Command-line history

Reproducible sessions

Docker tips

Index

Python Data Analysis Cookbook

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2016

Production reference: 1150716

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78528-228-7

www.packtpub.com

Credits

Author

Ivan Idris

Reviewers

Bill Chambers

Alexey Grigorev

Dr. Vahid Mirjalili

Michele Usuelli

Commissioning Editor

Akram Hussain

Acquisition Editor

Prachi Bisht

Content Development Editor

Rohit Singh

Technical Editor

Vivek Pala

Copy Editor

Pranjali Chury

Project Coordinator

Izzat Contractor

Proofreader

Safis Editing

Indexer

Rekha Nair

Graphics

Jason Monteiro

Production Coordinator

Aparna Bhagat

Cover Work

Aparna Bhagat

About the Author

Ivan Idris was born in Bulgaria to Indonesian parents. He moved to the Netherlands and graduated in experimental physics. His graduation thesis had a strong emphasis on applied computer science. After graduating, he worked for several companies as a software developer, data warehouse developer, and QA analyst.

His professional interests are business intelligence, big data, and cloud computing. He enjoys writing clean, testable code and interesting technical articles. He is the author of NumPy Beginner's Guide, NumPy Cookbook, Learning NumPy, and Python Data Analysis, all by Packt Publishing.

About the Reviewers

Bill Chambers is a data scientist from the UC Berkeley School of Information. He's focused on building technical systems and performing large-scale data analysis. At Berkeley, he has worked with everything from data science with Scala and Apache Spark to creating online Python courses for UC Berkeley's master of data science program. Prior to Berkeley, he was a business analyst at a software company where he was charged with the task of integrating multiple software systems and leading internal analytics and reporting. He contributed as a technical reviewer to the book Learning Pandas by Packt Publishing.

Alexey Grigorev is a skilled data scientist and software engineer with more than 5 years of professional experience. Currently, he works as a data scientist at Searchmetrics Inc. In his day-to-day job, he actively uses R and Python for data cleaning, data analysis, and modeling. He has contributed as a technical reviewer to other books on data analysis by Packt Publishing, such as Test-Driven Machine Learning and Mastering Data Analysis with R.

Dr. Vahid Mirjalili is a data scientist with a diverse background in engineering, mathematics, and computer science. Currently, he is working toward his graduate degree in computer science at Michigan State University. With his specialty in data mining, he is very interested in predictive modeling and getting insights from data. As a Python developer, he likes to contribute to the open source community. He has developed Python packages, such as PyClust, for data clustering. Furthermore, he is also focused on making tutorials for different directions of data science, which can be found at his Github repository at http://github.com/mirjalil/DataScience.

The other books that he has reviewed include Python Machine Learning by Sebastian Raschka and Python Machine Learning Cookbook by Parteek Joshi. Furthermore, he is currently working on a book focused on big data analysis, covering the algorithms specifically suited to analyzing massive datasets.

Michele Usuelli is a data scientist, writer, and R enthusiast specializing in the fields of big data and machine learning. He currently works for Microsoft and joined through the acquisition of Revolution Analytics, the leading R-based company that builds a big data package for R. Michele graduated in mathematical engineering, and before Revolution, he worked with a big data start-up and a big publishing company. He is the author of R Machine Learning Essentials and Building a Recommendation System with R.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

"Data analysis is Python's killer app"

--Unknown

This book is the follow-up to Python Data Analysis. The obvious question is, "what does this new book add?" as Python Data Analysis is pretty great (or so I like to believe) already. This book, Python Data Analysis Cookbook, is targeted at slightly more experienced Pythonistas. A year has passed, so we are using newer versions of software and software libraries that I didn't cover in Python Data Analysis. Also, I've had time to rethink and research, and as a result I decided the following:

I need to have a toolbox in order to make my life easier and increase reproducibility. I called the toolbox dautil and made it available via PyPi (which can be installed with pip/easy_install). My soul-searching exercise led me to believe that I need to make it easier to obtain and install the required software. I published a Docker container (pydacbk) with some of the software we need via DockerHub. You can read more about the setup in Chapter 1, Laying the Foundation for Reproducible Data Analysis, and the online chapter. The Docker container is not ideal because it grew quite large, so I had to make some tough decisions. Since the container is not really part of the book, I think it will be appropriate if you contact me directly if you have any issues. However, please keep in mind that I can't change the image drastically.This book uses the IPython Notebook, which has become a standard tool for analysis. I have given some related tips in the online chapter and other books I have written.I am using Python 3 with very few exceptions because Python 2 will not be maintained after 2020.

Why do you need this book?

Some people will tell you that you don't need books, just get yourself an interesting project and figure out the rest as you go along. Although there are plenty of resources out there, this may be a very frustrating road. If you want to make a delicious soup, for example, you can of course ask friends and family, search the Internet, or watch cooking shows. However, your friends and family are not available full time for you and the quality of Internet content varies. And in my humble opinion, Packt Publishing, the reviewers, and I have spent so much time and energy on this book, that I will be surprised if you don't get any value out of it.

Data analysis, data science, big data – what is the big deal?

You probably have seen Venn diagrams depicting data science as the intersection of mathematics/statistics, computer science, and domain expertise. Data analysis is timeless and was there before data science and even before computer science. You could do data analysis with a pen and paper and, in more modern times, with a pocket calculator.

Data analysis has many aspects, with goals such as making decisions or coming up with new hypotheses and questions. The hype, status, and financial rewards surrounding data science and big data remind me of the time when datawarehousing and business intelligence were the buzz words. The ultimate goal of business intelligence and datawarehousing was to build dashboards for management. This involved a lot of politics and organizational aspects, but on the technical side, it was mostly about databases. Data science, on the other hand, is not database-centric and leans heavily on machine learning. Machine learning techniques have become necessary because of the bigger volumes of data. The data growth is caused by the growth of the world population and the rise of new technologies, such as social media and mobile devices. The data growth is, in fact, probably the only trend that we can be sure of continuing. The difference between constructing dashboards and applying machine learning is analogous to the way search engines evolved.

Search engines (if you can call them that) were initially nothing more than well-organized collections of links created manually. Eventually, the automated approach won. Since, in time, more data will be created (and not destroyed), we can expect an increase in automated data analysis.

A brief of history of data analysis with Python

The history of the various Python software libraries is quite interesting. I am not a historian, so the following notes are written from my own perspective:

1989: Guido van Rossum implements the very first version of Python at the CWI in the Netherlands as a Christmas "hobby" project.1995: Jim Hugunin creates Numeric—the predecessor to NumPy.1999: Pearu Peterson wrote f2py as a bridge between Fortran and Python.2000: Python 2.0 is released.2001: The SciPy library is released. Also, Numarray, a competing library of Numeric is created. Fernando Perez releases IPython, which starts out as an "afternoon hack". NLTK is released as a research project.2002: John Hunter creates the Matplotlib library.2005: NumPy is released by Travis Oliphant. NumPy, initially, is Numeric extended with features inspired by Numarray.2006: NumPy 1.0 is released. The first version of SQLAlchemy is released.2007: The scikit-learn project is initiated as a Google Summer of Code project by David Cournapeau. Cython was forked from Pyrex. Cython is later intensively used in pandas and scikit-learn to improve performance.2008: Wes McKinney starts working on pandas. Python 3.0 is released.2011: The IPython 0.12 release introduces the IPython notebook. Packt Publishing releases NumPy 1.5 Beginner's Guide.2012: Packt Publishing releases NumPy Cookbook.2013: Packt Publishing releases NumPy Beginner's Guide, Second Edition.2014: Fernando Perez announces Project Jupyter, which aims to make a language-agnostic notebook. Packt Publishing releases Learning NumPy Array and Python Data Analysis.2015: Packt Publishing releases NumPy Beginner's Guide, Third Edition and NumPy Cookbook, Second Edition.

A conjecture about the future

The future is a bright place, where an incredible amount of data lives in the Cloud and software runs on any imaginable device with an intuitive customizable interface. (I know young people who can't stop talking about how awesome their phone is and how one day we will all be programming on tablets by dragging and dropping). It seems there is a certain angst in the Python community about not being relevant in the future. Of course, the more you have invested in Python, the more it matters.

To figure out what to do, we need to know what makes Python special. A school of thought claims that Python is a glue language gluing C, Fortran, R, Java, and other languages; therefore, we just need better glue. This probably also means "borrowing" features from other languages. Personally, I like the way Python works, its flexible nature, its data structures, and the fact that it has so many libraries and features. I think the future is in more delicious syntactic sugar and just-in-time compilers. Somehow we should be able to continue writing Python code, which automatically is converted for us in concurrent (machine) code. Unseen machinery under the hood manages lower level details and sends data and instructions to CPUs, GPUs, or the Cloud. The code should be able to easily communicate with whatever storage backend we are using. Ideally, all of this magic will be just as convenient as automatic garbage collection. It may sound like an impossible "click of a button" dream, but I think it is worth pursuing.

What this book covers

Chapter 1, Laying the Foundation for Reproducible Data Analysis, is a pretty important chapter, and I recommend that you do not skip it. It explains Anaconda, Docker, unit testing, logging, and other essential elements of reproducible data analysis.

Chapter 2, Creating Attractive Data Visualizations, demonstrates how to visualize data and mentions frequently encountered pitfalls.

Chapter 3, Statistical Data Analysis and Probability, discusses statistical probability distributions and correlation between two variables.

Chapter 4, Dealing with Data and Numerical Issues, is about outliers and other common data issues. Data is almost never perfect, so a large portion of the analysis effort goes into dealing with data imperfections.

Chapter 5, Web Mining, Databases, and Big Data, is light on mathematics, but more focused on technical topics, such as databases, web scraping, and big data.

Chapter 6, Signal Processing and Timeseries, is about time series data, which is abundant and requires special techniques. Usually, we are interested in trends and seasonality or periodicity.

Chapter 7, Selecting Stocks with Financial Data Analysis, focuses on stock investing because stock price data is abundant. This is the only chapter on finance and the content should be at least partially relevant if stocks don't interest you.

Chapter 8, Text Mining and Social Network Analysis, helps you cope with the floods of textual and social media information.

Chapter 9, Ensemble Learning and Dimensionality Reduction, covers ensemble learning, classification and regression algorithms, as well as hierarchical clustering.

Chapter 10, Evaluating Classifiers, Regressors, and Clusters, evaluates the classifiers and regressors from Chapter 9, Ensemble Learning and Dimensionality Reduction, the preceding chapter.

Chapter 11, Analyzing Images, uses the OpenCV library quite a lot to analyze images.

Chapter 12, Parallelism and Performance, is about software performance and I discuss various options to improve performance, including caching and just-in-time compilers.

Appendix A, Glossary, is a brief glossary of technical concepts used throughout the book. The goal is to have a reference that is easy to look up.

Appendix B, Function Reference, is a short reference of functions meant as an extra aid in case you are temporarily unable to look up documentation.

Appendix C, Online Resources, lists resources including presentations, links to documentation, and freely available IPython notebooks and data. This appendix is available as an online chapter.

Appendix D, Tips and Tricks for Command-Line and Miscellaneous Tools, in this book we use various tools such as the IPython notebook, Docker, and Unix shell commands. I give a short list of tips that is not meant to be exhaustive. This appendix is also available as online chapter.

What you need for this book

First, you need a Python 3 distribution. I recommend the full Anaconda distribution as it comes with the majority of the software we need. I tested the code with Python 3.4 and the following packages:

joblib 0.8.4IPython 3.2.1NetworkX 1.9.1NLTK 3.0.2Numexpr 2.3.1pandas 0.16.2SciPy 0.16.0 seaborn 0.6.0sqlalchemy 0.9.9statsmodels 0.6.1matplotlib 1.5.0NumPy 1.10.1scikit-learn 0.17dautil 0.0.1a29

For some recipes, you need to install extra software, but this is explained whenever the software is required.

Who this book is for

This book is hands-on and low on theory. You should have better than beginner Python knowledge and have some knowledge of linear algebra, calculus, machine learning and statistics. Ideally, you would have read Python Data Analysis, but this is not a requirement. I also recommend the following books:

Building Machine Learning Systems with Python by Willi Richert and Luis Pedro Coelho, 2013Learning NumPy Array by Ivan Idris, 2014Learning scikit-learn: Machine Learning in Python by Guillermo Moncecchi, 2013Learning SciPy for Numerical and Scientific Computing by Francisco J. Blanco-Silva, 2013Matplotlib for Python Developers by Sandro Tosi, 2009NumPy Beginner's Guide - Third Edition by Ivan Idris, 2015NumPy Cookbook – Second Edition by Ivan Idris, 2015Parallel Programming with Python by Jan Palach, 2014Python Data Visualization Cookbook by Igor Milovanović, 2013Python for Finance by Yuxing Yan, 2014Python Text Processing with NLTK 2.0 Cookbook by Jacob Perkins, 2010

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also).

To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready

This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/PythonDataAnalysisCookbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Laying the Foundation for Reproducible Data Analysis

In this chapter, we will cover the following recipes:

Setting up AnacondaInstalling the Data Science ToolboxCreating a virtual environment with virtualenv and virtualenvwrapperSandboxing Python applications with Docker imagesKeeping track of package versions and history in IPython NotebooksConfiguring IPythonLearning to log for robust error checkingUnit testing your codeConfiguring pandasConfiguring matplotlibSeeding random number generators and NumPy print optionsStandardizing reports, code style, and data access

Introduction

Reproducible data analysis is a cornerstone of good science. In today's rapidly evolving world of science and technology, reproducibility is a hot topic. Reproducibility is about lowering barriers for other people. It may seem strange or unnecessary, but reproducible analysis is essential to get your work acknowledged by others. If a lot of people confirm your results, it will have a positive effect on your career. However, reproducible analysis is hard. It has important economic consequences, as you can read in Freedman LP, Cockburn IM, Simcoe TS (2015) The Economics of Reproducibility in Preclinical Research. PLoS Biol 13(6): e1002165. doi:10.1371/journal.pbio.1002165.

So reproducibility is important for society and for you, but how does it apply to Python users? Well, we want to lower barriers for others by:

Giving information about the software and hardware we used, including versions.Sharing virtual environments.Logging program behavior.Unit testing the code. This also serves as documentation of sorts.Sharing configuration files.Seeding random generators and making sure program behavior is as deterministic as possible.Standardizing reporting, data access, and code style.

I created the dautil package for this book, which you can install with pip or from the source archive provided in this book's code bundle. If you are in a hurry, run $ python install_ch1.py to install most of the software for this chapter, including dautil. I created a test Docker image, which you can use if you don't want to install anything except Docker (see the recipe, Sandboxing Python applications with Docker images).

Setting up Anaconda

Anaconda is a free Python distribution for data analysis and scientific computing. It has its own package manager, conda. The distribution includes more than 200 Python packages, which makes it very convenient. For casual users, the Miniconda distribution may be the better choice. Miniconda contains the conda package manager and Python. The technical editors use Anaconda, and so do I. But don't worry, I will describe in this book alternative installation instructions for readers who are not using Anaconda. In this recipe, we will install Anaconda and Miniconda and create a virtual environment.

Getting ready

The procedures to install Anaconda and Miniconda are similar. Obviously, Anaconda requires more disk space. Follow the instructions on the Anaconda website at http://conda.pydata.org/docs/install/quick.html (retrieved Mar 2016). First, you have to download the appropriate installer for your operating system and Python version. Sometimes, you can choose between a GUI and a command-line installer. I used the Python 3.4 installer, although my system Python version is v2.7. This is possible because Anaconda comes with its own Python. On my machine, the Anaconda installer created an anaconda directory in my home directory and required about 900 MB. The Miniconda installer installs a miniconda directory in your home directory.

How to do it...

Now that Anaconda or Miniconda is installed, list the packages with the following command:

$ conda list

For reproducibility, it is good to know that we can export packages:

$ conda list --export

The preceding command prints packages and versions on the screen, which you can save in a file. You can install these packages with the following command:

$ conda create -n ch1env --file <export file>

This command also creates an environment named ch1env.

The following command creates a simple testenv environment:

$ conda create --name testenv python=3

On Linux and Mac OS X, switch to this environment with the following command:

$ source activate testenv

On Windows, we don't need source. The syntax to switch back is similar:

$ [source] deactivate

The following command prints export information for the environment in the YAML (explained in the following section) format:

$ conda env export -n testenv

To remove the environment, type the following (note that even after removing, the name of the environment still exists in ~/.conda/environments.txt):

$ conda remove -n testenv --all

Search for a package as follows:

$ conda search numpy

In this example, we searched for the NumPy package. If NumPy is already present, Anaconda shows an asterisk in the output at the corresponding entry.

Update the distribution as follows:

$ conda update conda

There's more...

The .condarc configuration file follows the YAML syntax.

Note

YAML is a human-readable configuration file format with the extension .yaml or .yml. YAML was initially released in 2011, with the latest release in 2009. The YAML homepage is at http://yaml.org/ (retrieved July 2015).

You can find a sample configuration file at http://conda.pydata.org/docs/install/sample-condarc.html (retrieved July 2015). The related documentation is at http://conda.pydata.org/docs/install/config.html (retrieved July 2015).

Installing the Data Science Toolbox

The Data Science Toolbox (DST) is a virtual environment based on Ubuntu for data analysis using Python and R. Since DST is a virtual environment, we can install it on various operating systems. We will install DST locally, which requires VirtualBox and Vagrant. VirtualBox is a virtual machine application originally created by Innotek GmbH in 2007. Vagrant is a wrapper around virtual machine applications such as VirtualBox created by Mitchell Hashimoto.

Getting ready

You need to have in the order of 2 to 3 GB free for VirtualBox, Vagrant, and DST itself. This may vary by operating system.

How to do it...

Installing DST requires the following steps:

Install VirtualBox by downloading an installer for your operating system and architecture from https://www.virtualbox.org/wiki/Downloads (retrieved July 2015) and running it. I installed VirtualBox 4.3.28-100309 myself, but you can just install whatever the most recent VirtualBox version at the time is.Install Vagrant by downloading an installer for your operating system and architecture from https://www.vagrantup.com/downloads.html (retrieved July 2015). I installed Vagrant 1.7.2 and again you can install a more recent version if available.Create a directory to hold the DST and navigate to it with a terminal. Run the following command:

$ vagrant init data-science-toolbox/dst$ vagrant up

The first command creates a VagrantFile configuration file. Most of the content is commented out, but the file does contain links to documentation that might be useful. The second command creates the DST and initiates a download that could take a couple of minutes.

Connect to the virtual environment as follows (on Windows use putty):

$ vagrant ssh

View the preinstalled Python packages with the following command:

vagrant@data-science-toolbox:~$ pip freeze

The list is quite long; in my case it contained 32 packages. The DST Python version as of July 2015 was 2.7.6.

When you are done with the DST, log out and suspend (you can also halt it completely) the VM:

vagrant@data-science-toolbox:~$ logoutConnection to 127.0.0.1 closed.$ vagrant suspend==> default: Saving VM state and suspending execution...

How it works...

Virtual machines (VMs) emulate computers in software. VirtualBox is an application that creates and manages VMs. VirtualBox stores its VMs in your home folder, and this particular VM takes about 2.2 GB of storage.

Ubuntu is an open source Linux operating system, and we are allowed by its license to create virtual machines. Ubuntu has several versions; we can get more info with the lsb_release command:

vagrant@data-science-toolbox:~$ lsb_release -aNo LSB modules are available.Distributor ID: UbuntuDescription: Ubuntu 14.04 LTSRelease: 14.04Codename: trusty

Vagrant used to only work with VirtualBox, but currently it also supports VMware, KVM, Docker, and Amazon EC2. Vagrant calls virtual machines boxes. Some of these boxes are available for everyone at http://www.vagrantbox.es/ (retrieved July 2015).

Creating a virtual environment with virtualenv and virtualenvwrapper

Virtual environments provide dependency isolation for small projects. They also keep your site-packages directory small. Since Python 3.3, virtualenv has been part of the standard Python distribution. The virtualenvwrapper Python project has some extra convenient features for virtual environment management. I will demonstrate virtualenv and virtualenvwrapper functionality in this recipe.

Getting ready

You need Python 3.3 or later. You can install virtualenvwrapper with pip command as follows:

$ [sudo] pip install virtualenvwrapper

On Linux and Mac, it's necessary to do some extra work—specifying a directory for the virtual environments and sourcing a script:

$ export WORKON_HOME=/tmp/envs$ source /usr/local/bin/virtualenvwrapper.sh

Windows has a separate version, which you can install with the following command:

$ pip install virtualenvwrapper-win

How to do it...

Create a virtual environment for a given directory with the pyvenv script part of your Python distribution:

$ pyvenv /tmp/testenv$ lsbin include lib pyvenv.cfg

In this example, we created a testenv directory in the /tmp directory with several directories and a configuration file. The configuration file pyvenv.cfg contains the Python version and the home directory of the Python distribution.Activate the environment on Linux or Mac by sourcing the activate script, for example, with the following command:

$ source bin/activate

On Windows, use the activate.bat file.

You can now install packages in this environment in isolation. When you are done with the environment, switch back on Linux or Mac with the following command:

$ deactivate

On Windows, use the deactivate.bat file.

Alternatively, you could use virtualenvwrapper. Create and switch to a virtual environment with the following command:

vagrant@data-science-toolbox:~$ mkvirtualenv env2

Deactivate the environment with the deactivate command:

(env2)vagrant@data-science-toolbox:~$ deactivate

Delete the environment with the rmvirtualenv command:

vagrant@data-science-toolbox:~$ rmvirtualenv env2

Sandboxing Python applications with Docker images

Docker uses Linux kernel features to provide an extra virtualization layer. Docker was created in 2013 by Solomon Hykes. Boot2Docker allows us to install Docker on Windows and Mac OS X too. Boot2Docker uses a VirtualBox VM that contains a Linux environment with Docker. In this recipe, we will set up Docker and download the continuumio/miniconda3 Docker image.

Getting ready

The Docker installation docs are saved at https://docs.docker.com/index.html (retrieved July 2015). I installed Docker 1.7.0 with Boot2Docker. The installer requires about 133 MB. However, if you want to follow the whole recipe, you will need several gigabytes.

How to do it...

Once Boot2Docker is installed, you need to initialize the environment. This is only necessary once, and Linux users don't need this step:

$ boot2docker initLatest release for github.com/boot2docker/boot2docker is v1.7.0Downloading boot2docker ISO image...Success: downloaded https://github.com/boot2docker/boot2docker/releases/download/v1.7.0/boot2docker.iso

In the preceding step, you downloaded a VirtualBox VM to a directory such as /VirtualBox\ VMs/boot2docker-vm/.

The next step for Mac OS X and Windows users is to start the VM:

$ boot2docker start

Check the Docker environment by starting a sample container:

$ docker run hello-world

Note

Some people reported a hopefully temporary issue of not being able to connect. The issue can be resolved by issuing commands with an extra argument, for instance:

$ docker [--tlsverify=false] run hello-world

Docker images can be made public. We can search for such images and download them. In Setting up Anaconda, we installed Anaconda; however, Anaconda and Miniconda Docker images also exist. Use the following command:

$ docker search continuumio

The preceding command shows a list of Docker images from Continuum Analytics – the company that developed Anaconda and Miniconda. Download the Miniconda 3 Docker image as follows (if you prefer using my container, skip this):

$ docker pull continuumio/miniconda3

Start the image with the following command:

$ docker run -t -i continuumio/miniconda3 /bin/bash

We start out as root in the image.

The command $ docker images should list the continuumio/miniconda3 image as well. If you prefer not to install too much software (possibly only Docker and Boot2Docker) for this book, you should use the image I created. It uses the continuumio/miniconda3 image as template. This image allows you to execute Python scripts in the current working directory on your computer, while using installed software from the Docker image:

$ docker run -it -p 8888:8888 -v $(pwd):/usr/data -w /usr/data "ivanidris/pydacbk:latest" python <somefile>.py

You can also run a IPython notebook in your current working directory with the following command:

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Python Data Analysis Cookbook E-Book

Ivan Idris

About This Book

Who This Book Is For

What You Will Learn

In Detail

Style and Approach

Table of Contents

Python Data Analysis Cookbook

Python Data Analysis Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Preface

Why do you need this book?

Data analysis, data science, big data – what is the big deal?

A brief of history of data analysis with Python

A conjecture about the future

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Chapter 1. Laying the Foundation for Reproducible Data Analysis

Introduction

Setting up Anaconda

Getting ready

How to do it...

There's more...

Note

See also

Installing the Data Science Toolbox

Getting ready

How to do it...

How it works...

See also

Creating a virtual environment with virtualenv and virtualenvwrapper

Getting ready

How to do it...

See also

Sandboxing Python applications with Docker images

Getting ready

How to do it...

Note