Pandas Cookbook - Theodore Petrou - E-Book

Pandas Cookbook E-Book

Theodore Petrou

0,0
39,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book will provide you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas 0.20. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way.

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands like one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through practical situations that you are highly likely to encounter.

Many advanced recipes combine several different features across the pandas 0.20 library to generate results.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 475

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Pandas Cookbook
Recipes for Scientific Computing, Time Series Analysis and Data Visualization using Python
Theodore Petrou

BIRMINGHAM - MUMBAI

Pandas Cookbook

Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: October 2017

Production reference: 2171219

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78439-387-8

www.packtpub.com

Credits

Author

Theodore Petrou

Copy Editor

Tasneem Fatehi

Reviewers

Sonali Dayal

Kuntal Ganguly

Shilpi Saxena

Project Coordinator

Manthan Patel

Commissioning Editor

Veena Pagare

Proofreader

Safis Editing

Acquisition Editor

Tushar Gupta

Indexer

Tejal Daruwale Soni

Content Development Editor

Snehal Kolte

Graphics

Tania Dutta

Technical EditorSayli Nikalje

Production Coordinator

Deepika Naik

About the Author

Theodore Petrou is a data scientist and the founder of Dunder Data, a professional educational company focusing on exploratory data analysis. He is also the head of Houston Data Science, a meetup group with more than 2,000 members that has the primary goal of getting local data enthusiasts together in the same room to practice data science. Before founding Dunder Data, Ted was a data scientist at Schlumberger, a large oil services company, where he spent the vast majority of his time exploring data.

Some of his projects included using targeted sentiment analysis to discover the root cause of part failure from engineer text, developing customized client/server dashboarding applications, and real-time web services to avoid the mispricing of sales items. Ted received his masters degree in statistics from Rice University, and used his analytical skills to play poker professionally and teach math before becoming a data scientist. Ted is a strong supporter of learning through practice and can often be found answering questions about pandas on Stack Overflow.

Acknowledgement

I would first like to thank my wife, Eleni, and two young children, Penelope, and Niko, who endured extended periods of time without me as I wrote.

I’d also like to thank Sonali Dayal, whose constant feedback helped immensely in structuring the content of the book to improve its effectiveness. Thank you to Roy Keyes, who is the most exceptional data scientist I know and whose collaboration made Houston Data Science possible. Thank you to Scott Boston, an extremely skilled pandas user for developing ideas for recipes. Thank you very much to Kim Williams, Randolph Adami, Kevin Higgins, and Vishwanath Avasarala, who took a chance on me during my professional career when I had little to no experience. Thanks to my fellow coworker at Schlumberger, Micah Miller, for his critical, honest, and instructive feedback on anything that we developed together and his constant pursuit to move toward Python.

Thank you to Phu Ngo, who critically challenges and sharpens my thinking more than anyone. Thank you to my brother, Dean Petrou, for being right by my side as we developed our analytical skills through poker and again through business. Thank you to my sister, Stephanie Burton, for always knowing what I’m thinking and making sure that I’m aware of it. Thank you to my mother, Sofia Petrou, for her ceaseless love, support, and endless math puzzles that challenged me as a child. And thank you to my father, Steve Petrou, who, although no longer here, remains close to my heart and continues to encourage me every day.

About the Reviewers

Sonali Dayal is a masters candidate in biostatistics at the University of California, Berkeley. Previously, she has worked as a freelance software and data science engineer for early stage start-ups, where she built supervised and unsupervised machine learning models as well as data pipelines and interactive data analytics dashboards. She received her bachelor of science (B.S.) in biochemistry from Virginia Tech in 2011.

Kuntal Ganguly is a big data machine learning engineer focused on building large-scale data-driven systems using big data frameworks and machine learning. He has around 7 years of experience building several big data and machine learning applications.

Kuntal provides solutions to AWS customers in building real-time analytics systems using managed cloud services and open source Hadoop ecosystem technologies such as Spark, Kafka, Storm, Solr, and so on, along with machine learning and deep learning frameworks such as scikit-learn, TensorFlow, Keras, and BigDL. He enjoys hands-on software development, and has single-handedly conceived, architectured, developed, and deployed several large scale distributed applications. He is a machine learning and deep learning practitioner and very passionate about building intelligent applications.

Kuntal is the author of the books: Learning Generative Adversarial Network and R Data Analysis Cookbook - Second Edition, Packt Publishing.

Shilpi Saxena is a seasoned professional who leads in management with an edge of being a technology evangelist--she is an engineer who has exposure to a variety of domains (machine-to-machine space, healthcare, telecom, hiring, and manufacturing). She has experience in all aspects of the conception and execution of enterprise solutions. She has been architecturing, managing, and delivering solutions in the big data space for the last 3 years, handling high performance geographically distributed teams of elite engineers. Shilpi has around 12+ years (3 years in the big data space) experience in the development and execution of various facets of enterprise solutions, both in the product/services dimensions of the software industry. An engineer by degree and profession who has worn various hats--developer, technical leader, product owner, tech manager--and has seen all the flavors that the industry has to offer. She has architectured and worked through some of the pioneer production implementation in big data on Storm and Impala with auto scaling in AWS. LinkedIn: http://in.linkedin.com/pub/shilpi-saxena/4/552/a30

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1784393878. If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Title Page

Copyright

Pandas Cookbook

Credits

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Running a Jupyter Notebook

Who this book is for

How to get the most out of this book

Conventions

Assumptions for every recipe

Dataset Descriptions

Sections

Getting ready

How to do it...

How it works...

There's more...

See also

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Pandas Foundations

Introduction

Dissecting the anatomy of a DataFrame

Getting ready

How to do it...

How it works...

There's more...

See also

Accessing the main DataFrame components

Getting ready

How to do it...

How it works...

There's more...

See also

Understanding data types

Getting ready

How to do it...

How it works...

There's more...

See also

Selecting a single column of data as a Series

Getting ready

How to do it...

How it works...

There's more...

See also

Calling Series methods

Getting ready

How to do it...

How it works...

There's more...

See also

Working with operators on a Series

Getting ready

How to do it...

How it works...

There's more...

See also

Chaining Series methods together

Getting ready

How to do it...

How it works...

There's more...

Making the index meaningful

Getting ready

How to do it...

How it works...

There's more...

See also

Renaming row and column names

Getting ready

How to do it...

How it works...

There's more...

Creating and deleting columns

Getting ready

How to do it...

How it works...

There's more...

See also

Essential DataFrame Operations

Introduction

Selecting multiple DataFrame columns

Getting ready

How to do it...

How it works...

There's more...

Selecting columns with methods

Getting ready

How it works...

How it works...

There's more...

See also

Ordering column names sensibly

Getting ready

How to do it...

How it works...

There's more...

See also

Operating on the entire DataFrame

Getting ready

How to do it...

How it works...

There's more...

Chaining DataFrame methods together

Getting ready

How to do it...

How it works...

There's more...

See also

Working with operators on a DataFrame

Getting ready

How to do it...

How it works...

There's more...

See also

Comparing missing values

Getting ready

How to do it...

How it works...

There's more...

Transposing the direction of a DataFrame operation

Getting ready

How to do it...

How it works...

There's more...

See also

Determining college campus diversity

Getting ready

How to do it...

How it works...

There's more...

See also

Beginning Data Analysis

Introduction

Developing a data analysis routine

Getting ready

How to do it...

How it works...

There's more...

Data dictionaries

See also

Reducing memory by changing data types

Getting ready

How to do it...

How it works...

There's more...

See also

Selecting the smallest of the largest

Getting ready

How to do it...

How it works...

There's more...

Selecting the largest of each group by sorting

Getting ready

How to do it...

How it works...

There's more...

Replicating nlargest with sort_values

Getting ready

How to do it...

How it works...

There's more...

Calculating a trailing stop order price

Getting ready

How to do it...

How it works...

There's more...

See also

Selecting Subsets of Data

Introduction

Selecting Series data

Getting ready

How to do it...

How it works...

There's more...

See also

Selecting DataFrame rows

Getting ready

How to do it...

How it works...

There's more...

See also

Selecting DataFrame rows and columns simultaneously

Getting ready

How to do it...

How it works...

There's more...

Selecting data with both integers and labels

Getting ready

How to do it...

How it works...

There's more...

See also

Speeding up scalar selection

Getting ready

How to do it...

How it works...

There's more...

Slicing rows lazily

Getting ready

How to do it...

How it works...

There's more...

Slicing lexicographically

Getting ready

How to do it...

How it works...

There's more...

Boolean Indexing

Introduction

Calculating boolean statistics

Getting ready

How to do it...

How it works...

There's more...

See also

Constructing multiple boolean conditions

Getting ready

How to do it...

How it works...

There's more...

See also

Filtering with boolean indexing

Getting ready

How to do it...

How it works...

There's more...

See also

Replicating boolean indexing with index selection

Getting ready

How to do it...

How it works...

There's more...

Selecting with unique and sorted indexes

Getting ready

How to do it...

How it works...

There's more...

See also

Gaining perspective on stock prices

Getting ready

How to do it...

How it works...

There's more...

See also

Translating SQL WHERE clauses

Getting ready

How to do it...

How it works...

There's more...

See also

Determining the normality of stock market returns

Getting ready

How to do it...

How it works...

There's more...

See also

Improving readability of boolean indexing with the query method

Getting ready

How to do it...

How it works...

There's more...

See also

Preserving Series with the where method

Getting ready

How to do it...

How it works...

There's more...

See also

Masking DataFrame rows

Getting ready

How to do it...

How it works...

There's more...

See also

Selecting with booleans, integer location, and labels

Getting ready

How to do it...

How it works...

There's more...

See also

Index Alignment

Introduction

Examining the Index object

Getting ready

How to do it...

How it works...

There's more...

See also

Producing Cartesian products

Getting ready

How to do it...

How it works...

There's more...

See also

Exploding indexes

Getting ready

How to do it...

How it works...

There's more...

Filling values with unequal indexes

Getting ready

How to do it...

How it works...

There's more...

Appending columns from different DataFrames

Getting ready

How to do it...

How it works...

There's more...

See also

Highlighting the maximum value from each column

Getting ready

How to do it...

How it works...

There's more...

See also

Replicating idxmax with method chaining

Getting ready

How to do it...

How it works...

There's more...

Finding the most common maximum

Getting ready

How to do it...

How it works...

There's more...

Grouping for Aggregation, Filtration, and Transformation

Introduction

Defining an aggregation

Getting ready

How to do it...

How it works...

There's more...

See also

Grouping and aggregating with multiple columns and functions

Getting ready

How to do it...

How it works...

There's more...

Removing the MultiIndex after grouping

Getting ready

How to do it...

How it works...

There's more...

Customizing an aggregation function

Getting ready

How to do it...

How it works...

There's more...

Customizing aggregating functions with *args and **kwargs

Getting ready

How to do it...

How it works...

There's more...

See also

Examining the groupby object

Getting ready

How to do it...

How it works...

There's more...

See also

Filtering for states with a minority majority

Getting ready

How to do it...

How it works...

There's more...

See also

Transforming through a weight loss bet

Getting ready

How to do it...

How it works...

There's more...

See also

Calculating weighted mean SAT scores per state with apply

Getting ready

How to do it...

How it works...

There's more...

See also

Grouping by continuous variables

Getting ready

How to do it...

How it works...

There's more...

See also

Counting the total number of flights between cities

Getting ready

How to do it...

How it works...

There's more...

See also

Finding the longest streak of on-time flights

Getting ready

How to do it...

How it works...

There's more...

See also

Restructuring Data into a Tidy Form

Introduction

Tidying variable values as column names with stack

Getting ready

How to do it...

How it works...

There's more...

See also

Tidying variable values as column names with melt

Getting ready

How to do it...

How it works...

There's more...

See also

Stacking multiple groups of variables simultaneously

Getting ready

How to do it...

How it works...

There's more...

See also

Inverting stacked data

Getting ready

How to do it...

How it works...

There's more...

See also

Unstacking after a groupby aggregation

Getting ready

How to do it...

How it works...

There's more...

See also

Replicating pivot_table with a groupby aggregation

Getting ready

How to do it...

How it works...

There's more...

Renaming axis levels for easy reshaping

Getting ready

How to do it...

How it works...

There's more...

Tidying when multiple variables are stored as column names

Getting ready...

How to do it...

How it works...

There's more...

See also

Tidying when multiple variables are stored as column values

Getting ready

How to do it...

How it works...

There's more...

See also

Tidying when two or more values are stored in the same cell

Getting ready...

How to do it..

How it works...

There's more...

Tidying when variables are stored in column names and values

Getting ready

How to do it...

How it works...

There's more...

Tidying when multiple observational units are stored in the same table

Getting ready

How to do it...

How it works...

There's more...

See also

Combining Pandas Objects

Introduction

Appending new rows to DataFrames

Getting ready

How to do it...

How it works...

There's more...

Concatenating multiple DataFrames together

Getting ready

How to do it...

How it works...

There's more...

Comparing President Trump's and Obama's approval ratings

Getting ready

How to do it...

How it works...

There's more...

See also

Understanding the differences between concat, join, and merge

Getting ready

How to do it...

How it works...

There's more...

See also

Connecting to SQL databases

Getting ready

How to do it...

How it works...

There's more...

See also

Time Series Analysis

Introduction

Understanding the difference between Python and pandas date tools

Getting ready

How to do it...

How it works...

There's more...

See also

Slicing time series intelligently

Getting ready

How to do it...

How it works...

There's more...

See also

Using methods that only work with a DatetimeIndex

Getting ready

How to do it...

How it works...

There's more...

See also

Counting the number of weekly crimes

Getting ready

How to do it...

How it works...

There's more...

See also

Aggregating weekly crime and traffic accidents separately

Getting ready

How to do it...

How it works...

There's more...

Measuring crime by weekday and year

Getting ready

How to do it...

How it works...

There's more...

See also

Grouping with anonymous functions with a DatetimeIndex

Getting ready

How to do it...

How it works...

There's more...

See also

Grouping by a Timestamp and another column

Getting ready

How to do it...

How it works...

There's more...

Finding the last time crime was 20% lower with merge_asof

Getting ready

How to do it...

How it works...

There's more...

Visualization with Matplotlib, Pandas, and Seaborn

Introduction

Getting started with matplotlib

Getting ready

Object-oriented guide to matplotlib

How to do it...

How it works...

There's more...

See also

Visualizing data with matplotlib

Getting ready

How to do it...

How it works...

There's more...

See also

Plotting basics with pandas

Getting ready

How to do it..

How it works...

There's more...

See also

Visualizing the flights dataset

Getting ready

How to do it...

How it works...

See also

Stacking area charts to discover emerging trends

Getting ready

How to do it...

How it works...

There's more...

Understanding the differences between seaborn and pandas

Getting ready

How to do it...

How it works...

See also

Doing multivariate analysis with seaborn Grids

Getting ready

How to do it...

How it works...

There's more...

Uncovering Simpson's paradox in the diamonds dataset with seaborn

Getting ready

How to do it...

How it works...

There's more...

Preface

The popularity of data science has skyrocketed since it was calledThe Sexiest Job of the 21st Century by the Harvard Review in 2012. It was ranked as the number one job by Glassdoor in both 2016 and 2017. Fueling this skyrocketing popularity for data science is the demand from industry. Several applications have made big splashes in the news, such as Netflix making better movie recommendations, IBM Watson defeating humans at Jeopardy, Tesla building self-driving cars, Major League Baseball teams finding undervalued prospects, and Google learning to identify cats on the internet.

Nearly every industry is finding ways to use data science to build new technology or provide deeper insights. Due to such noteworthy successes, an ever-present aura of hype seems to encapsulate data science. Most of the scientific progress backing this hype stems from the field of machine learning, which produces the algorithms that make the predictions responsible for artificial intelligence.

The fundamental building block for all machine learning algorithms is, of course, data. As companies have realized this, there is no shortage of it. The business intelligence company, Domo, estimates that 90% of the world's data has been created in just the last two years.Although machine learning gets all the attention, it is completely reliant on the quality of the data that it is fed. Before data ever reaches the input layers of a machine learning algorithm, it must be prepared, and for data to be prepared properly, it needs to be explored thoroughly for basic understanding and to identify inaccuracies. Before data can be explored, it needs to be captured.

To summarize, we can cast the data science pipeline into three stages--data capturing, data exploration, and machine learning. There are a vast array of tools available to complete each stage of the pipeline. Pandas is the dominant tool in the scientific Python ecosystem for data exploration and analysis. It is tremendously capable of inspecting, cleaning, tidying, filtering, transforming, aggregating, and even visualizing (with some help) all types of data. It is not a tool for initially capturing the data, nor is it a tool to build machine learning models.

For many data analysts and scientists who use Python, the vast majority of their work will be done using pandas. This is likely because the initial data exploration and preparation tend to take the most time. Some entire projects consist only of data exploration and have no machine learning component. Data scientists spend so much time on this stage that a timeless lore has arisen--Data scientists spend 80% of their time cleaning the data and the other 20% complaining about cleaning the data.

Although there is an abundance of open source and free programming languages available to do data exploration, the field is currently dominated by just two players, Python and R. The two languages have vastly different syntax but are both very capable of doing data analysis and machine learning. One measure of popularity is the number of questions asked on the popular Q&A site, Stack Overflow (https://insights.stackoverflow.com/trends):

While this is not a true measure of usage, it is clear that both Python and R have become increasingly popular, likely due to their data science capabilities. It is interesting to note that the percentage of Python questions remained constant until the year 2012, when data science took off. What is probably most astonishing about this graph is that pandas questions now make up a whopping one percent of all the newest questions on Stack Overflow.

One of the reasons why Python has become a language of choice for data science is that it is a fairly easy language to learn and develop, and so it has a low barrier to entry. It is also free and open source, able to run on a variety of hardware and software, and a breeze to get up and running. It has a very large and active community with a substantial amount of free resources online. In my opinion, Python is one of the most fun languages to develop programs with. The syntax is so clear, concise, and intuitive but like all languages, takes quite a long time to master.

As Python was not built for data analysis like R, the syntax may not come as naturally as it does for some other Python libraries. This actually might be part of the reason why there are so many Stack Overflow questions on it. Despite its tremendous capabilities, pandas code can often be poorly written. One of the main aims of this book is to show performant and idiomatic pandas code.

For all its greatness, Stack Overflow, unfortunately perpetuates misinformation and is a source for lots of poorly written pandas. This is actually not the fault of Stack Overflow or its community. Pandas is an open source project and has had numerous major changes, even recently, as it approaches its tenth year of existence in 2018. The upside of open source, though, is that new features get added to it all the time.

The recipes in this book were formulated through my experience working as a data scientist, building and hosting several week-long data exploration bootcamps, answering several hundred questions on Stack Overflow, and building tutorials for my local meetup group. The recipes not only offer idiomatic solutions to common data problems, but also take you on journeys through many real-world datasets, where surprising insights are often discovered. These recipes will also help you master the pandas library, which will give you a gigantic boost in productivity. There is a huge difference between those who have only cursory knowledge of pandas and those who have it mastered. There are so many interesting and fun tricks to solve your data problems that only become apparent if you truly know the library inside and out. Personally, I find pandas to be a delightful and fun tool to analyze data with, and I hope you enjoy your journey along with me. If you have questions, please feel free to reach out to me on Twitter: @TedPetrou.

What this book covers

Chapter 1, Pandas Foundations, covers the anatomy and vocabulary used to identify the components of the two main pandas data structures, the Series and the DataFrame. Each column must have exactly one type of data, and each of these data types is covered. You will learn how to unleash the power of the Series and the DataFrame by calling and chaining together their methods.

Chapter 2, Essential DataFrame Operations, focuses on the most crucial and common operations that you will perform during data analysis.

Chapter 3, Beginning Data Analysis, helps you develop a routine to get started after reading in your data. Other interesting discoveries will be made.

Chapter 4, Selecting Subsets of Data, covers the many varied and potentially confusing ways of selecting different subsets of data.

Chapter 5, Boolean Indexing, covers the process of querying your data to select subsets of it based on Boolean conditions.

Chapter 6, Index Alignment, targets the very important and often misunderstood index object. Misuse of the Index is responsible for lots of erroneous results, and these recipes show you how to use it correctly to deliver powerful results.

Chapter 7, Grouping for Aggregation, Filtration, and Transformation, covers the powerful grouping capabilities that are almost always necessary during a data analysis. You will build customized functions to apply to your groups.

Chapter 8, Restructuring Data into Tidy Form, explains what tidy data is and why it’s so important, and then it shows you how to transform many different forms of messy datasets into tidy ones.

Chapter 9, Combining Pandas Objects, covers the many available methods to combine DataFrames and Series vertically or horizontally. We will also do some web-scraping to compare President Trump's and Obama's approval rating and connect to an SQL relational database.

Chapter 10, Time Series Analysis, covers advanced and powerful time series capabilities to dissect by any dimension of time possible.

Chapter 11, Visualization with Matplotlib, Pandas, and Seaborn, introduces the matplotlib library, which is responsible for all of the plotting in pandas. We will then shift focus to the pandas plot method and, finally, to the seaborn library, which is capable of producing aesthetically pleasing visualizations not directly available in pandas.

What you need for this book

Pandas is a third-party package for the Python programming language and, as of the printing of this book, is on version 0.20. Currently, Python has two major supported releases, versions 2.7 and 3.6. Python 3 is the future, and it is now highly recommended that all scientific computing users of Python use it, as Python 2 will no longer be supported in 2020. All examples in this book have been run and tested with pandas 0.20 on Python 3.6.

In addition to pandas, you will need to have the matplotlib version 2.0 and seaborn version 0.8 visualization libraries installed. A major dependence for pandas is the NumPy library, which forms the basis of most of the popular Python scientific computing libraries.

There are a wide variety of ways in which you can install pandas and the rest of the libraries mentioned on your computer, but by far the simplest method is to install the Anaconda distribution. Created by Continuum Analytics, it packages together all the popular libraries for scientific computing in a single downloadable file available on Windows, Mac OSX, and Linux. Visit the download page to get the Anaconda distribution (https://www.anaconda.com/download).

In addition to all the scientific computing libraries, the Anaconda distribution comes with Jupyter Notebook, which is a browser-based program for developing in Python, among many other languages. All of the recipes for this book were developed inside of a Jupyter Notebook and all of the individual notebooks for each chapter will be available for you to use.

It is possible to install all the necessary libraries for this book without the use of the Anaconda distribution. For those that are interested, visit the pandas Installation page (http://pandas.pydata.org/pandas-docs/stable/install.html).

Running a Jupyter Notebook

The suggested method to work through the content of this book is to have a Jupyter Notebook up and running so that you can run the code while reading through the recipes. This allows you to go exploring on your own and gain a deeper understanding than by just reading the book alone.

Assuming that you have installed the Anaconda distribution on your machine, you have two options available to start the Jupyter Notebook:

Use the program Anaconda Navigator

Run the

jupyter notebook

command

from the Terminal/Command Prompt

The Anaconda Navigator is a GUI-based tool that allows you to find all the different software provided by Anaconda with ease. Running the program will give you a screen like this:

As you can see, there are many programs available to you. Click Launch to open the Jupyter Notebook. A new tab will open in your browser, showing you a list of folders and files in your home directory:

Instead of using the Anaconda Navigator, you can launch Jupyter Notebook by opening up your Terminal/Command Prompt and running the jupyter notebookcommand like this:

It is not necessary to run this command from your home directory. You can run it from any location, and the contents in the browser will reflect that location.

Although we have now started the Jupyter Notebook program, we haven't actually launched a single individual notebook where we can start developing in Python. To do so, you can click on the New button on the right-hand side of the page, which will drop down a list of all the possible kernels available for you to use. If you just downloaded Anaconda, then you will only have a single kernel available to you (Python 3). After selecting the Python 3 kernel, a new tab will open in the browser, where you can start writing Python code:

You can, of course, open previously created notebooks instead of beginning a new one. To do so, simply navigate through the filesystem provided in the Jupyter Notebook browser home page and select the notebook you want to open. All Jupyter Notebook files end in .ipynb. For instance, when you navigate to the location of the notebook files for this book, you will see all of them like this:

Who this book is for

This book contains nearly 100 recipes, ranging from very simple to advanced. All recipes strive to be written in clear, concise, and modern idiomatic pandas code. The How it works... sections contain extremely detailed descriptions of the intricacies of each step of the recipe. Often, in the There's more... section, you will get what may seem like an entirely new recipe. This book is densely packed with an extraordinary amount of pandas code.

As a generalization, the recipes in the first six chapters tend to be simpler and more focused on the fundamental and essential operations of pandas than the last five chapters, which focus on more advanced operations and are more project-driven. Due to the wide range of complexity, this book can be useful to both the novice and everyday user alike. It has been my experience that even those who use pandas regularly will not master it without being exposed to idiomatic pandas code. This is somewhat fostered by the breadth that pandas offers. There are almost always multiple ways of completing the same operation, which can have users get the result they want but in a very inefficient manner. It is not uncommon to see an order of magnitude or more performance difference between two sets of pandas solutions to the same problem.

The only real prerequisite for this book is fundamental knowledge of Python. It is assumed that the reader is familiar with all the common built-in data containers in Python, such as lists, sets, dictionaries, and tuples.

How to get the most out of this book

There are a couple of things you can do to get the most out of this book. First, and most importantly, you should download all the code, which will be stored in Jupyter Notebooks. While reading through each recipe, run each step of code in the notebook. Make sure you explore on your own as you run through the code. Second, have the pandas official documentation open (http://pandas.pydata.org/pandas-docs/stable/) in one of your browser tabs. The pandas documentation is an excellent resource containing over 1,000 pages of material. There are examples for most of the pandas operations in the documentation, and they will often be directly linked from the See also section. While it covers the basics of most operations, it does so with trivial examples and fake data that don't reflect situations that you are likely to encounter when analyzing datasets from the real world.

Dataset Descriptions

There are about two dozen datasets that are used throughout this book. It can be very helpful to have background information on each dataset as you complete the steps in the recipes. A detailed description of each dataset may be found in the dataset_descriptions Jupyter Notebook found at https://github.com/PacktPublishing/Pandas-Cookbook. For each datastet, there will be a list of the columns, information about each column and notes on how the data was procured.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it…, How it works…, There's more…, and See also).

To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready

This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

How to do it...

This section contains the steps required to follow the recipe.

How it works...

This section usually consists of a detailed explanation of what happened in the previous section.

There's more...

This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

See also

This section provides helpful links to other useful information for the recipe.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important to us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Pandas-Cookbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/PandasCookbook_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books--maybe a mistake in the text or the code--we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Pandas Foundations

In this chapter, we will cover the following:

Dissecting the anatomy of a DataFrame

Accessing the main DataFrame components

Understanding data types

Selecting a single column of data as a Series

Calling Series methods

Working with operators on a Series

Chaining Series methods together

Making the index meaningful

Renaming row and column names

Creating and deleting columns

Introduction

The goal of this chapter is to introduce a foundation of pandas by thoroughly inspecting the Series and DataFrame data structures. It is vital for pandas users to know each component of the Series and the DataFrame, and to understand that each column of data in pandas holds precisely one data type.

In this chapter, you will learn how to select a single column of data from a DataFrame, which is returned as a Series. Working with this one-dimensional object makes it easy to show how different methods and operators work. Many Series methods return another Series as output. This leads to the possibility of calling further methods in succession, which is known as method chaining.

The Index component of the Series and DataFrame is what separates pandas from most other data analysis libraries and is the key to understanding how many operations work. We will get a glimpse of this powerful object when we use it as a meaningful label for Series values. The final two recipes contain simple tasks that frequently occur during a data analysis.

Dissecting the anatomy of a DataFrame

Before diving deep into pandas, it is worth knowing the components of the DataFrame. Visually, the outputted display of a pandas DataFrame (in a Jupyter Notebook) appears to be nothing more than an ordinary table of data consisting of rows and columns. Hiding beneath the surface are the three components--the index, columns, and data (also known as values) that you must be aware of in order to maximize the DataFrame's full potential.

Getting ready

This recipe reads in the movie dataset into a pandas DataFrame and provides a labeled diagram of all its major components.

How it works...

Pandas first reads the data from disk into memory and into a DataFrame using the excellent and versatile read_csv function. The output for both the columns and the index is in bold font, which makes them easy to identify. By convention, the terms index label and column name refer to the individual members of the index and columns, respectively. The term index refers to all the index labels as a whole just as the term columns refers to all the column names as a whole.

The columns and the index serve a particular purpose, and that is to provide labels for the columns and rows of the DataFrame. These labels allow for direct and easy access to different subsets of data. When multiple Series or DataFrames are combined, the indexes align first before any calculation occurs. Collectively, the columns and the index are known as the axes.

A DataFrame has two axes--a vertical axis (the index) and a horizontal axis(the columns). Pandas borrows convention from NumPy and uses the integers 0/1 as another way of referring to the vertical/horizontal axis.

DataFrame data (values) is always in regular font and is an entirely separate component from the columns or index. Pandas uses NaN (not a number) to represent missing values. Notice that even though the color column has only string values, it uses NaN to represent a missing value.

The three consecutive dots in the middle of the columns indicate that there is at least one column that exists but is not displayed due to the number of columns exceeding the predefined display limits.

The Python standard library contains the csv module, which can be used to parse and read in data. The pandas read_csv function offers a powerful increase in performance and functionality over this module.

There's more...

The head method accepts a single parameter, n, which controls the number of rows displayed. Similarly, the tail method returns the last n rows.

See also

Pandas official documentation of the

read_csv

function (

http://bit.ly/2vtJQ9A

)

Accessing the main DataFrame components

Each of the three DataFrame components--the index, columns, and data--may be accessed directly from a DataFrame. Each of these components is itself a Python object with its own unique attributes and methods. It will often be the case that you would like to perform operations on the individual components and not on the DataFrame as a whole.

Getting ready

This recipe extracts the index, columns, and the data of the DataFrame into separate variables, and then shows how the columns and index are inherited from the same object.

How it works...

You may access the three main components of a DataFrame with the index, columns, and values attributes. The output of the columns attribute appears to be just a sequence of the column names. This sequence of column names is technically an Index object. The output of the function type is the fully qualified class name of the object.

The fully qualified class name of the object for the variable columns is pandas.core.indexes.base.Index. It begins with the package name, which is followed by a path of modules and ends with the name of the type. A common way of referring to objects is to include the package name followed by the name of the object type. In this instance, we would refer to the columns as a pandas Index object.

The built-in subclass function checks whether the first argument inherits from the second. The Index and RangeIndex objects are very similar, and in fact, pandas has a number of similar objects reserved specifically for either the index or the columns. The index and the columns must both be some kind of Index object. Essentially, the index and the columns represent the same thing, but along different axes. They’re occasionally referred to as the row index and column index.

In this context, the Index objects refer to all the possible objects that can be used for the index or columns. They are all subclasses of pd.Index. Here is the complete list of the Index objects: CategoricalIndex, MultiIndex, IntervalIndex, Int64Index, UInt64Index, Float64Index, RangeIndex, TimedeltaIndex, DatetimeIndex, PeriodIndex.

A RangeIndex is a special type of Index object that is analogous to Python's range object. Its entire sequence of values is not loaded into memory until it is necessary to do so, thereby saving memory. It is completely defined by its start, stop, and step values.

There's more...

When possible, Index objects are implemented using hash tables that allow for very fast selection and data alignment. They are similar to Python sets in that they support operations such as intersection and union, but are dissimilar because they are ordered with duplicates allowed.

Python dictionaries and sets are also implemented with hash tables that allow for membership checking to happen very fast in constant time, regardless of the size of the object.

Notice how the values DataFrame attribute returned a NumPy n-dimensional array, or ndarray. Most of pandas relies heavily on the ndarray. Beneath the index, columns, and data are NumPy ndarrays. They could be considered the base object for pandas that many other objects are built upon. To see this, we can look at the values of the index and columns:

>>> index.valuesarray([ 0, 1, 2, ..., 4913, 4914, 4915])>>> columns.valuesarray(['color', 'director_name', 'num_critic_for_reviews', ... 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'], dtype=object)

See also

Pandas official documentation

of

Indexing and Selecting data

(

http://bit.ly/2vm8f12

)

A look inside pandas design and development

slide deck from pandas author, Wes McKinney (

http://bit.ly/2u4YVLi

)

Understanding data types

In very broad terms, data may be classified as either continuous or categorical. Continuous data is always numeric and represents some kind of measurement, such as height, wage, or salary. Continuous data can take on an infinite number of possibilities. Categorical data, on the other hand, represents discrete, finite amounts of values such as car color, type of poker hand, or brand of cereal.

Pandas does not broadly classify data as either continuous or categorical. Instead, it has precise technical definitions for many distinct data types. The following table contains all pandas data types, with their string equivalents, and some notes on each type:

Common data type name

NumPy/pandas object

Pandas string name

Notes

Boolean

np.bool

bool

Stored as a single byte.

Integer

np.int

int

Defaulted to 64 bits. Unsigned ints are also available - np.uint.

Float

np.float

float

Defaulted to 64 bits.

Complex

np.complex

complex

Rarely seen in data analysis.

Object

np.object

O, object

Typically strings but is a catch-all for columns with multiple different types or other Python objects (tuples, lists, dicts, and so on).

Datetime

np.datetime64, pd.Timestamp

datetime64

Specific moment in time with nanosecond precision.

Timedelta

np.timedelta64, pd.Timedelta

timedelta64

An amount of time, from days to nanoseconds.

Categorical

pd.Categorical

category

Specific only to pandas. Useful for object columns with relatively few unique values.

Getting ready

In this recipe, we display the data type of each column in a DataFrame. It is crucial to know the type of data held in each column as it fundamentally changes the kind of operations that are possible with it.

How it works...

Each DataFrame column must be exactly one type. For instance, every value in the column aspect_ratio is a 64-bit float, and every value in movie_facebook_likes is a 64-bit integer. Pandas defaults its core numeric types, integers, and floats to 64 bits regardless of the size necessary for all data to fit in memory. Even if a column consists entirely of the integer value 0, the data type will still be int64. get_dtype_counts is a convenience method for directly returning the count of all the data types in the DataFrame.

Homogeneous data is another term for referring to columns that all have the same type. DataFrames as a whole may contain heterogeneous data