39,59 €
This book will provide you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas 0.20. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way.
The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands like one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through practical situations that you are highly likely to encounter.
Many advanced recipes combine several different features across the pandas 0.20 library to generate results.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 475
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2017
Production reference: 2171219
ISBN 978-1-78439-387-8
www.packtpub.com
Author
Theodore Petrou
Copy Editor
Tasneem Fatehi
Reviewers
Sonali Dayal
Kuntal Ganguly
Shilpi Saxena
Project Coordinator
Manthan Patel
Commissioning Editor
Veena Pagare
Proofreader
Safis Editing
Acquisition Editor
Tushar Gupta
Indexer
Tejal Daruwale Soni
Content Development Editor
Snehal Kolte
Graphics
Tania Dutta
Technical EditorSayli Nikalje
Production Coordinator
Deepika Naik
Theodore Petrou is a data scientist and the founder of Dunder Data, a professional educational company focusing on exploratory data analysis. He is also the head of Houston Data Science, a meetup group with more than 2,000 members that has the primary goal of getting local data enthusiasts together in the same room to practice data science. Before founding Dunder Data, Ted was a data scientist at Schlumberger, a large oil services company, where he spent the vast majority of his time exploring data.
Some of his projects included using targeted sentiment analysis to discover the root cause of part failure from engineer text, developing customized client/server dashboarding applications, and real-time web services to avoid the mispricing of sales items. Ted received his masters degree in statistics from Rice University, and used his analytical skills to play poker professionally and teach math before becoming a data scientist. Ted is a strong supporter of learning through practice and can often be found answering questions about pandas on Stack Overflow.
I would first like to thank my wife, Eleni, and two young children, Penelope, and Niko, who endured extended periods of time without me as I wrote.
I’d also like to thank Sonali Dayal, whose constant feedback helped immensely in structuring the content of the book to improve its effectiveness. Thank you to Roy Keyes, who is the most exceptional data scientist I know and whose collaboration made Houston Data Science possible. Thank you to Scott Boston, an extremely skilled pandas user for developing ideas for recipes. Thank you very much to Kim Williams, Randolph Adami, Kevin Higgins, and Vishwanath Avasarala, who took a chance on me during my professional career when I had little to no experience. Thanks to my fellow coworker at Schlumberger, Micah Miller, for his critical, honest, and instructive feedback on anything that we developed together and his constant pursuit to move toward Python.
Thank you to Phu Ngo, who critically challenges and sharpens my thinking more than anyone. Thank you to my brother, Dean Petrou, for being right by my side as we developed our analytical skills through poker and again through business. Thank you to my sister, Stephanie Burton, for always knowing what I’m thinking and making sure that I’m aware of it. Thank you to my mother, Sofia Petrou, for her ceaseless love, support, and endless math puzzles that challenged me as a child. And thank you to my father, Steve Petrou, who, although no longer here, remains close to my heart and continues to encourage me every day.
Sonali Dayal is a masters candidate in biostatistics at the University of California, Berkeley. Previously, she has worked as a freelance software and data science engineer for early stage start-ups, where she built supervised and unsupervised machine learning models as well as data pipelines and interactive data analytics dashboards. She received her bachelor of science (B.S.) in biochemistry from Virginia Tech in 2011.
Kuntal Ganguly is a big data machine learning engineer focused on building large-scale data-driven systems using big data frameworks and machine learning. He has around 7 years of experience building several big data and machine learning applications.
Kuntal provides solutions to AWS customers in building real-time analytics systems using managed cloud services and open source Hadoop ecosystem technologies such as Spark, Kafka, Storm, Solr, and so on, along with machine learning and deep learning frameworks such as scikit-learn, TensorFlow, Keras, and BigDL. He enjoys hands-on software development, and has single-handedly conceived, architectured, developed, and deployed several large scale distributed applications. He is a machine learning and deep learning practitioner and very passionate about building intelligent applications.
Kuntal is the author of the books: Learning Generative Adversarial Network and R Data Analysis Cookbook - Second Edition, Packt Publishing.
Shilpi Saxena is a seasoned professional who leads in management with an edge of being a technology evangelist--she is an engineer who has exposure to a variety of domains (machine-to-machine space, healthcare, telecom, hiring, and manufacturing). She has experience in all aspects of the conception and execution of enterprise solutions. She has been architecturing, managing, and delivering solutions in the big data space for the last 3 years, handling high performance geographically distributed teams of elite engineers. Shilpi has around 12+ years (3 years in the big data space) experience in the development and execution of various facets of enterprise solutions, both in the product/services dimensions of the software industry. An engineer by degree and profession who has worn various hats--developer, technical leader, product owner, tech manager--and has seen all the flavors that the industry has to offer. She has architectured and worked through some of the pioneer production implementation in big data on Storm and Impala with auto scaling in AWS. LinkedIn: http://in.linkedin.com/pub/shilpi-saxena/4/552/a30
For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1784393878. If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Title Page
Copyright
Pandas Cookbook
Credits
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Running a Jupyter Notebook
Who this book is for
How to get the most out of this book
Conventions
Assumptions for every recipe
Dataset Descriptions
Sections
Getting ready
How to do it...
How it works...
There's more...
See also
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Pandas Foundations
Introduction
Dissecting the anatomy of a DataFrame
Getting ready
How to do it...
How it works...
There's more...
See also
Accessing the main DataFrame components
Getting ready
How to do it...
How it works...
There's more...
See also
Understanding data types
Getting ready
How to do it...
How it works...
There's more...
See also
Selecting a single column of data as a Series
Getting ready
How to do it...
How it works...
There's more...
See also
Calling Series methods
Getting ready
How to do it...
How it works...
There's more...
See also
Working with operators on a Series
Getting ready
How to do it...
How it works...
There's more...
See also
Chaining Series methods together
Getting ready
How to do it...
How it works...
There's more...
Making the index meaningful
Getting ready
How to do it...
How it works...
There's more...
See also
Renaming row and column names
Getting ready
How to do it...
How it works...
There's more...
Creating and deleting columns
Getting ready
How to do it...
How it works...
There's more...
See also
Essential DataFrame Operations
Introduction
Selecting multiple DataFrame columns
Getting ready
How to do it...
How it works...
There's more...
Selecting columns with methods
Getting ready
How it works...
How it works...
There's more...
See also
Ordering column names sensibly
Getting ready
How to do it...
How it works...
There's more...
See also
Operating on the entire DataFrame
Getting ready
How to do it...
How it works...
There's more...
Chaining DataFrame methods together
Getting ready
How to do it...
How it works...
There's more...
See also
Working with operators on a DataFrame
Getting ready
How to do it...
How it works...
There's more...
See also
Comparing missing values
Getting ready
How to do it...
How it works...
There's more...
Transposing the direction of a DataFrame operation
Getting ready
How to do it...
How it works...
There's more...
See also
Determining college campus diversity
Getting ready
How to do it...
How it works...
There's more...
See also
Beginning Data Analysis
Introduction
Developing a data analysis routine
Getting ready
How to do it...
How it works...
There's more...
Data dictionaries
See also
Reducing memory by changing data types
Getting ready
How to do it...
How it works...
There's more...
See also
Selecting the smallest of the largest
Getting ready
How to do it...
How it works...
There's more...
Selecting the largest of each group by sorting
Getting ready
How to do it...
How it works...
There's more...
Replicating nlargest with sort_values
Getting ready
How to do it...
How it works...
There's more...
Calculating a trailing stop order price
Getting ready
How to do it...
How it works...
There's more...
See also
Selecting Subsets of Data
Introduction
Selecting Series data
Getting ready
How to do it...
How it works...
There's more...
See also
Selecting DataFrame rows
Getting ready
How to do it...
How it works...
There's more...
See also
Selecting DataFrame rows and columns simultaneously
Getting ready
How to do it...
How it works...
There's more...
Selecting data with both integers and labels
Getting ready
How to do it...
How it works...
There's more...
See also
Speeding up scalar selection
Getting ready
How to do it...
How it works...
There's more...
Slicing rows lazily
Getting ready
How to do it...
How it works...
There's more...
Slicing lexicographically
Getting ready
How to do it...
How it works...
There's more...
Boolean Indexing
Introduction
Calculating boolean statistics
Getting ready
How to do it...
How it works...
There's more...
See also
Constructing multiple boolean conditions
Getting ready
How to do it...
How it works...
There's more...
See also
Filtering with boolean indexing
Getting ready
How to do it...
How it works...
There's more...
See also
Replicating boolean indexing with index selection
Getting ready
How to do it...
How it works...
There's more...
Selecting with unique and sorted indexes
Getting ready
How to do it...
How it works...
There's more...
See also
Gaining perspective on stock prices
Getting ready
How to do it...
How it works...
There's more...
See also
Translating SQL WHERE clauses
Getting ready
How to do it...
How it works...
There's more...
See also
Determining the normality of stock market returns
Getting ready
How to do it...
How it works...
There's more...
See also
Improving readability of boolean indexing with the query method
Getting ready
How to do it...
How it works...
There's more...
See also
Preserving Series with the where method
Getting ready
How to do it...
How it works...
There's more...
See also
Masking DataFrame rows
Getting ready
How to do it...
How it works...
There's more...
See also
Selecting with booleans, integer location, and labels
Getting ready
How to do it...
How it works...
There's more...
See also
Index Alignment
Introduction
Examining the Index object
Getting ready
How to do it...
How it works...
There's more...
See also
Producing Cartesian products
Getting ready
How to do it...
How it works...
There's more...
See also
Exploding indexes
Getting ready
How to do it...
How it works...
There's more...
Filling values with unequal indexes
Getting ready
How to do it...
How it works...
There's more...
Appending columns from different DataFrames
Getting ready
How to do it...
How it works...
There's more...
See also
Highlighting the maximum value from each column
Getting ready
How to do it...
How it works...
There's more...
See also
Replicating idxmax with method chaining
Getting ready
How to do it...
How it works...
There's more...
Finding the most common maximum
Getting ready
How to do it...
How it works...
There's more...
Grouping for Aggregation, Filtration, and Transformation
Introduction
Defining an aggregation
Getting ready
How to do it...
How it works...
There's more...
See also
Grouping and aggregating with multiple columns and functions
Getting ready
How to do it...
How it works...
There's more...
Removing the MultiIndex after grouping
Getting ready
How to do it...
How it works...
There's more...
Customizing an aggregation function
Getting ready
How to do it...
How it works...
There's more...
Customizing aggregating functions with *args and **kwargs
Getting ready
How to do it...
How it works...
There's more...
See also
Examining the groupby object
Getting ready
How to do it...
How it works...
There's more...
See also
Filtering for states with a minority majority
Getting ready
How to do it...
How it works...
There's more...
See also
Transforming through a weight loss bet
Getting ready
How to do it...
How it works...
There's more...
See also
Calculating weighted mean SAT scores per state with apply
Getting ready
How to do it...
How it works...
There's more...
See also
Grouping by continuous variables
Getting ready
How to do it...
How it works...
There's more...
See also
Counting the total number of flights between cities
Getting ready
How to do it...
How it works...
There's more...
See also
Finding the longest streak of on-time flights
Getting ready
How to do it...
How it works...
There's more...
See also
Restructuring Data into a Tidy Form
Introduction
Tidying variable values as column names with stack
Getting ready
How to do it...
How it works...
There's more...
See also
Tidying variable values as column names with melt
Getting ready
How to do it...
How it works...
There's more...
See also
Stacking multiple groups of variables simultaneously
Getting ready
How to do it...
How it works...
There's more...
See also
Inverting stacked data
Getting ready
How to do it...
How it works...
There's more...
See also
Unstacking after a groupby aggregation
Getting ready
How to do it...
How it works...
There's more...
See also
Replicating pivot_table with a groupby aggregation
Getting ready
How to do it...
How it works...
There's more...
Renaming axis levels for easy reshaping
Getting ready
How to do it...
How it works...
There's more...
Tidying when multiple variables are stored as column names
Getting ready...
How to do it...
How it works...
There's more...
See also
Tidying when multiple variables are stored as column values
Getting ready
How to do it...
How it works...
There's more...
See also
Tidying when two or more values are stored in the same cell
Getting ready...
How to do it..
How it works...
There's more...
Tidying when variables are stored in column names and values
Getting ready
How to do it...
How it works...
There's more...
Tidying when multiple observational units are stored in the same table
Getting ready
How to do it...
How it works...
There's more...
See also
Combining Pandas Objects
Introduction
Appending new rows to DataFrames
Getting ready
How to do it...
How it works...
There's more...
Concatenating multiple DataFrames together
Getting ready
How to do it...
How it works...
There's more...
Comparing President Trump's and Obama's approval ratings
Getting ready
How to do it...
How it works...
There's more...
See also
Understanding the differences between concat, join, and merge
Getting ready
How to do it...
How it works...
There's more...
See also
Connecting to SQL databases
Getting ready
How to do it...
How it works...
There's more...
See also
Time Series Analysis
Introduction
Understanding the difference between Python and pandas date tools
Getting ready
How to do it...
How it works...
There's more...
See also
Slicing time series intelligently
Getting ready
How to do it...
How it works...
There's more...
See also
Using methods that only work with a DatetimeIndex
Getting ready
How to do it...
How it works...
There's more...
See also
Counting the number of weekly crimes
Getting ready
How to do it...
How it works...
There's more...
See also
Aggregating weekly crime and traffic accidents separately
Getting ready
How to do it...
How it works...
There's more...
Measuring crime by weekday and year
Getting ready
How to do it...
How it works...
There's more...
See also
Grouping with anonymous functions with a DatetimeIndex
Getting ready
How to do it...
How it works...
There's more...
See also
Grouping by a Timestamp and another column
Getting ready
How to do it...
How it works...
There's more...
Finding the last time crime was 20% lower with merge_asof
Getting ready
How to do it...
How it works...
There's more...
Visualization with Matplotlib, Pandas, and Seaborn
Introduction
Getting started with matplotlib
Getting ready
Object-oriented guide to matplotlib
How to do it...
How it works...
There's more...
See also
Visualizing data with matplotlib
Getting ready
How to do it...
How it works...
There's more...
See also
Plotting basics with pandas
Getting ready
How to do it..
How it works...
There's more...
See also
Visualizing the flights dataset
Getting ready
How to do it...
How it works...
See also
Stacking area charts to discover emerging trends
Getting ready
How to do it...
How it works...
There's more...
Understanding the differences between seaborn and pandas
Getting ready
How to do it...
How it works...
See also
Doing multivariate analysis with seaborn Grids
Getting ready
How to do it...
How it works...
There's more...
Uncovering Simpson's paradox in the diamonds dataset with seaborn
Getting ready
How to do it...
How it works...
There's more...
The popularity of data science has skyrocketed since it was calledThe Sexiest Job of the 21st Century by the Harvard Review in 2012. It was ranked as the number one job by Glassdoor in both 2016 and 2017. Fueling this skyrocketing popularity for data science is the demand from industry. Several applications have made big splashes in the news, such as Netflix making better movie recommendations, IBM Watson defeating humans at Jeopardy, Tesla building self-driving cars, Major League Baseball teams finding undervalued prospects, and Google learning to identify cats on the internet.
Nearly every industry is finding ways to use data science to build new technology or provide deeper insights. Due to such noteworthy successes, an ever-present aura of hype seems to encapsulate data science. Most of the scientific progress backing this hype stems from the field of machine learning, which produces the algorithms that make the predictions responsible for artificial intelligence.
The fundamental building block for all machine learning algorithms is, of course, data. As companies have realized this, there is no shortage of it. The business intelligence company, Domo, estimates that 90% of the world's data has been created in just the last two years.Although machine learning gets all the attention, it is completely reliant on the quality of the data that it is fed. Before data ever reaches the input layers of a machine learning algorithm, it must be prepared, and for data to be prepared properly, it needs to be explored thoroughly for basic understanding and to identify inaccuracies. Before data can be explored, it needs to be captured.
To summarize, we can cast the data science pipeline into three stages--data capturing, data exploration, and machine learning. There are a vast array of tools available to complete each stage of the pipeline. Pandas is the dominant tool in the scientific Python ecosystem for data exploration and analysis. It is tremendously capable of inspecting, cleaning, tidying, filtering, transforming, aggregating, and even visualizing (with some help) all types of data. It is not a tool for initially capturing the data, nor is it a tool to build machine learning models.
For many data analysts and scientists who use Python, the vast majority of their work will be done using pandas. This is likely because the initial data exploration and preparation tend to take the most time. Some entire projects consist only of data exploration and have no machine learning component. Data scientists spend so much time on this stage that a timeless lore has arisen--Data scientists spend 80% of their time cleaning the data and the other 20% complaining about cleaning the data.
Although there is an abundance of open source and free programming languages available to do data exploration, the field is currently dominated by just two players, Python and R. The two languages have vastly different syntax but are both very capable of doing data analysis and machine learning. One measure of popularity is the number of questions asked on the popular Q&A site, Stack Overflow (https://insights.stackoverflow.com/trends):
While this is not a true measure of usage, it is clear that both Python and R have become increasingly popular, likely due to their data science capabilities. It is interesting to note that the percentage of Python questions remained constant until the year 2012, when data science took off. What is probably most astonishing about this graph is that pandas questions now make up a whopping one percent of all the newest questions on Stack Overflow.
One of the reasons why Python has become a language of choice for data science is that it is a fairly easy language to learn and develop, and so it has a low barrier to entry. It is also free and open source, able to run on a variety of hardware and software, and a breeze to get up and running. It has a very large and active community with a substantial amount of free resources online. In my opinion, Python is one of the most fun languages to develop programs with. The syntax is so clear, concise, and intuitive but like all languages, takes quite a long time to master.
As Python was not built for data analysis like R, the syntax may not come as naturally as it does for some other Python libraries. This actually might be part of the reason why there are so many Stack Overflow questions on it. Despite its tremendous capabilities, pandas code can often be poorly written. One of the main aims of this book is to show performant and idiomatic pandas code.
For all its greatness, Stack Overflow, unfortunately perpetuates misinformation and is a source for lots of poorly written pandas. This is actually not the fault of Stack Overflow or its community. Pandas is an open source project and has had numerous major changes, even recently, as it approaches its tenth year of existence in 2018. The upside of open source, though, is that new features get added to it all the time.
The recipes in this book were formulated through my experience working as a data scientist, building and hosting several week-long data exploration bootcamps, answering several hundred questions on Stack Overflow, and building tutorials for my local meetup group. The recipes not only offer idiomatic solutions to common data problems, but also take you on journeys through many real-world datasets, where surprising insights are often discovered. These recipes will also help you master the pandas library, which will give you a gigantic boost in productivity. There is a huge difference between those who have only cursory knowledge of pandas and those who have it mastered. There are so many interesting and fun tricks to solve your data problems that only become apparent if you truly know the library inside and out. Personally, I find pandas to be a delightful and fun tool to analyze data with, and I hope you enjoy your journey along with me. If you have questions, please feel free to reach out to me on Twitter: @TedPetrou.
Chapter 1, Pandas Foundations, covers the anatomy and vocabulary used to identify the components of the two main pandas data structures, the Series and the DataFrame. Each column must have exactly one type of data, and each of these data types is covered. You will learn how to unleash the power of the Series and the DataFrame by calling and chaining together their methods.
Chapter 2, Essential DataFrame Operations, focuses on the most crucial and common operations that you will perform during data analysis.
Chapter 3, Beginning Data Analysis, helps you develop a routine to get started after reading in your data. Other interesting discoveries will be made.
Chapter 4, Selecting Subsets of Data, covers the many varied and potentially confusing ways of selecting different subsets of data.
Chapter 5, Boolean Indexing, covers the process of querying your data to select subsets of it based on Boolean conditions.
Chapter 6, Index Alignment, targets the very important and often misunderstood index object. Misuse of the Index is responsible for lots of erroneous results, and these recipes show you how to use it correctly to deliver powerful results.
Chapter 7, Grouping for Aggregation, Filtration, and Transformation, covers the powerful grouping capabilities that are almost always necessary during a data analysis. You will build customized functions to apply to your groups.
Chapter 8, Restructuring Data into Tidy Form, explains what tidy data is and why it’s so important, and then it shows you how to transform many different forms of messy datasets into tidy ones.
Chapter 9, Combining Pandas Objects, covers the many available methods to combine DataFrames and Series vertically or horizontally. We will also do some web-scraping to compare President Trump's and Obama's approval rating and connect to an SQL relational database.
Chapter 10, Time Series Analysis, covers advanced and powerful time series capabilities to dissect by any dimension of time possible.
Chapter 11, Visualization with Matplotlib, Pandas, and Seaborn, introduces the matplotlib library, which is responsible for all of the plotting in pandas. We will then shift focus to the pandas plot method and, finally, to the seaborn library, which is capable of producing aesthetically pleasing visualizations not directly available in pandas.
Pandas is a third-party package for the Python programming language and, as of the printing of this book, is on version 0.20. Currently, Python has two major supported releases, versions 2.7 and 3.6. Python 3 is the future, and it is now highly recommended that all scientific computing users of Python use it, as Python 2 will no longer be supported in 2020. All examples in this book have been run and tested with pandas 0.20 on Python 3.6.
In addition to pandas, you will need to have the matplotlib version 2.0 and seaborn version 0.8 visualization libraries installed. A major dependence for pandas is the NumPy library, which forms the basis of most of the popular Python scientific computing libraries.
There are a wide variety of ways in which you can install pandas and the rest of the libraries mentioned on your computer, but by far the simplest method is to install the Anaconda distribution. Created by Continuum Analytics, it packages together all the popular libraries for scientific computing in a single downloadable file available on Windows, Mac OSX, and Linux. Visit the download page to get the Anaconda distribution (https://www.anaconda.com/download).
In addition to all the scientific computing libraries, the Anaconda distribution comes with Jupyter Notebook, which is a browser-based program for developing in Python, among many other languages. All of the recipes for this book were developed inside of a Jupyter Notebook and all of the individual notebooks for each chapter will be available for you to use.
It is possible to install all the necessary libraries for this book without the use of the Anaconda distribution. For those that are interested, visit the pandas Installation page (http://pandas.pydata.org/pandas-docs/stable/install.html).
The suggested method to work through the content of this book is to have a Jupyter Notebook up and running so that you can run the code while reading through the recipes. This allows you to go exploring on your own and gain a deeper understanding than by just reading the book alone.
Assuming that you have installed the Anaconda distribution on your machine, you have two options available to start the Jupyter Notebook:
Use the program Anaconda Navigator
Run the
jupyter notebook
command
from the Terminal/Command Prompt
The Anaconda Navigator is a GUI-based tool that allows you to find all the different software provided by Anaconda with ease. Running the program will give you a screen like this:
As you can see, there are many programs available to you. Click Launch to open the Jupyter Notebook. A new tab will open in your browser, showing you a list of folders and files in your home directory:
Instead of using the Anaconda Navigator, you can launch Jupyter Notebook by opening up your Terminal/Command Prompt and running the jupyter notebookcommand like this:
It is not necessary to run this command from your home directory. You can run it from any location, and the contents in the browser will reflect that location.
Although we have now started the Jupyter Notebook program, we haven't actually launched a single individual notebook where we can start developing in Python. To do so, you can click on the New button on the right-hand side of the page, which will drop down a list of all the possible kernels available for you to use. If you just downloaded Anaconda, then you will only have a single kernel available to you (Python 3). After selecting the Python 3 kernel, a new tab will open in the browser, where you can start writing Python code:
You can, of course, open previously created notebooks instead of beginning a new one. To do so, simply navigate through the filesystem provided in the Jupyter Notebook browser home page and select the notebook you want to open. All Jupyter Notebook files end in .ipynb. For instance, when you navigate to the location of the notebook files for this book, you will see all of them like this:
This book contains nearly 100 recipes, ranging from very simple to advanced. All recipes strive to be written in clear, concise, and modern idiomatic pandas code. The How it works... sections contain extremely detailed descriptions of the intricacies of each step of the recipe. Often, in the There's more... section, you will get what may seem like an entirely new recipe. This book is densely packed with an extraordinary amount of pandas code.
As a generalization, the recipes in the first six chapters tend to be simpler and more focused on the fundamental and essential operations of pandas than the last five chapters, which focus on more advanced operations and are more project-driven. Due to the wide range of complexity, this book can be useful to both the novice and everyday user alike. It has been my experience that even those who use pandas regularly will not master it without being exposed to idiomatic pandas code. This is somewhat fostered by the breadth that pandas offers. There are almost always multiple ways of completing the same operation, which can have users get the result they want but in a very inefficient manner. It is not uncommon to see an order of magnitude or more performance difference between two sets of pandas solutions to the same problem.
The only real prerequisite for this book is fundamental knowledge of Python. It is assumed that the reader is familiar with all the common built-in data containers in Python, such as lists, sets, dictionaries, and tuples.
There are a couple of things you can do to get the most out of this book. First, and most importantly, you should download all the code, which will be stored in Jupyter Notebooks. While reading through each recipe, run each step of code in the notebook. Make sure you explore on your own as you run through the code. Second, have the pandas official documentation open (http://pandas.pydata.org/pandas-docs/stable/) in one of your browser tabs. The pandas documentation is an excellent resource containing over 1,000 pages of material. There are examples for most of the pandas operations in the documentation, and they will often be directly linked from the See also section. While it covers the basics of most operations, it does so with trivial examples and fake data that don't reflect situations that you are likely to encounter when analyzing datasets from the real world.
There are about two dozen datasets that are used throughout this book. It can be very helpful to have background information on each dataset as you complete the steps in the recipes. A detailed description of each dataset may be found in the dataset_descriptions Jupyter Notebook found at https://github.com/PacktPublishing/Pandas-Cookbook. For each datastet, there will be a list of the columns, information about each column and notes on how the data was procured.
In this book, you will find several headings that appear frequently (Getting ready, How to do it…, How it works…, There's more…, and See also).
To give clear instructions on how to complete a recipe, we use these sections as follows:
This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.
This section contains the steps required to follow the recipe.
This section usually consists of a detailed explanation of what happened in the previous section.
This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.
This section provides helpful links to other useful information for the recipe.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important to us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Pandas-Cookbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/PandasCookbook_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books--maybe a mistake in the text or the code--we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
In this chapter, we will cover the following:
Dissecting the anatomy of a DataFrame
Accessing the main DataFrame components
Understanding data types
Selecting a single column of data as a Series
Calling Series methods
Working with operators on a Series
Chaining Series methods together
Making the index meaningful
Renaming row and column names
Creating and deleting columns
The goal of this chapter is to introduce a foundation of pandas by thoroughly inspecting the Series and DataFrame data structures. It is vital for pandas users to know each component of the Series and the DataFrame, and to understand that each column of data in pandas holds precisely one data type.
In this chapter, you will learn how to select a single column of data from a DataFrame, which is returned as a Series. Working with this one-dimensional object makes it easy to show how different methods and operators work. Many Series methods return another Series as output. This leads to the possibility of calling further methods in succession, which is known as method chaining.
The Index component of the Series and DataFrame is what separates pandas from most other data analysis libraries and is the key to understanding how many operations work. We will get a glimpse of this powerful object when we use it as a meaningful label for Series values. The final two recipes contain simple tasks that frequently occur during a data analysis.
Before diving deep into pandas, it is worth knowing the components of the DataFrame. Visually, the outputted display of a pandas DataFrame (in a Jupyter Notebook) appears to be nothing more than an ordinary table of data consisting of rows and columns. Hiding beneath the surface are the three components--the index, columns, and data (also known as values) that you must be aware of in order to maximize the DataFrame's full potential.
This recipe reads in the movie dataset into a pandas DataFrame and provides a labeled diagram of all its major components.
Pandas first reads the data from disk into memory and into a DataFrame using the excellent and versatile read_csv function. The output for both the columns and the index is in bold font, which makes them easy to identify. By convention, the terms index label and column name refer to the individual members of the index and columns, respectively. The term index refers to all the index labels as a whole just as the term columns refers to all the column names as a whole.
The columns and the index serve a particular purpose, and that is to provide labels for the columns and rows of the DataFrame. These labels allow for direct and easy access to different subsets of data. When multiple Series or DataFrames are combined, the indexes align first before any calculation occurs. Collectively, the columns and the index are known as the axes.
DataFrame data (values) is always in regular font and is an entirely separate component from the columns or index. Pandas uses NaN (not a number) to represent missing values. Notice that even though the color column has only string values, it uses NaN to represent a missing value.
The three consecutive dots in the middle of the columns indicate that there is at least one column that exists but is not displayed due to the number of columns exceeding the predefined display limits.
The head method accepts a single parameter, n, which controls the number of rows displayed. Similarly, the tail method returns the last n rows.
Pandas official documentation of the
read_csv
function (
http://bit.ly/2vtJQ9A
)
Each of the three DataFrame components--the index, columns, and data--may be accessed directly from a DataFrame. Each of these components is itself a Python object with its own unique attributes and methods. It will often be the case that you would like to perform operations on the individual components and not on the DataFrame as a whole.
This recipe extracts the index, columns, and the data of the DataFrame into separate variables, and then shows how the columns and index are inherited from the same object.
You may access the three main components of a DataFrame with the index, columns, and values attributes. The output of the columns attribute appears to be just a sequence of the column names. This sequence of column names is technically an Index object. The output of the function type is the fully qualified class name of the object.
The built-in subclass function checks whether the first argument inherits from the second. The Index and RangeIndex objects are very similar, and in fact, pandas has a number of similar objects reserved specifically for either the index or the columns. The index and the columns must both be some kind of Index object. Essentially, the index and the columns represent the same thing, but along different axes. They’re occasionally referred to as the row index and column index.
A RangeIndex is a special type of Index object that is analogous to Python's range object. Its entire sequence of values is not loaded into memory until it is necessary to do so, thereby saving memory. It is completely defined by its start, stop, and step values.
When possible, Index objects are implemented using hash tables that allow for very fast selection and data alignment. They are similar to Python sets in that they support operations such as intersection and union, but are dissimilar because they are ordered with duplicates allowed.
Notice how the values DataFrame attribute returned a NumPy n-dimensional array, or ndarray. Most of pandas relies heavily on the ndarray. Beneath the index, columns, and data are NumPy ndarrays. They could be considered the base object for pandas that many other objects are built upon. To see this, we can look at the values of the index and columns:
>>> index.valuesarray([ 0, 1, 2, ..., 4913, 4914, 4915])>>> columns.valuesarray(['color', 'director_name', 'num_critic_for_reviews', ... 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'], dtype=object)
Pandas official documentation
of
Indexing and Selecting data
(
http://bit.ly/2vm8f12
)
A look inside pandas design and development
slide deck from pandas author, Wes McKinney (
http://bit.ly/2u4YVLi
)
In very broad terms, data may be classified as either continuous or categorical. Continuous data is always numeric and represents some kind of measurement, such as height, wage, or salary. Continuous data can take on an infinite number of possibilities. Categorical data, on the other hand, represents discrete, finite amounts of values such as car color, type of poker hand, or brand of cereal.
Pandas does not broadly classify data as either continuous or categorical. Instead, it has precise technical definitions for many distinct data types. The following table contains all pandas data types, with their string equivalents, and some notes on each type:
Common data type name
NumPy/pandas object
Pandas string name
Notes
Boolean
bool
Stored as a single byte.
Integer
int
Defaulted to 64 bits. Unsigned ints are also available - np.uint.
Float
np.float
float
Defaulted to 64 bits.
Complex
np.complex
complex
Rarely seen in data analysis.
Object
np.object
O, object
Typically strings but is a catch-all for columns with multiple different types or other Python objects (tuples, lists, dicts, and so on).
Datetime
np.datetime64, pd.Timestamp
datetime64
Specific moment in time with nanosecond precision.
Timedelta
np.timedelta64, pd.Timedelta
timedelta64
An amount of time, from days to nanoseconds.
Categorical
pd.Categorical
category
Specific only to pandas. Useful for object columns with relatively few unique values.
In this recipe, we display the data type of each column in a DataFrame. It is crucial to know the type of data held in each column as it fundamentally changes the kind of operations that are possible with it.
Each DataFrame column must be exactly one type. For instance, every value in the column aspect_ratio is a 64-bit float, and every value in movie_facebook_likes is a 64-bit integer. Pandas defaults its core numeric types, integers, and floats to 64 bits regardless of the size necessary for all data to fit in memory. Even if a column consists entirely of the integer value 0, the data type will still be int64. get_dtype_counts is a convenience method for directly returning the count of all the data types in the DataFrame.