Learning pandas - Second Edition - Michael Heydt - E-Book

Learning pandas - Second Edition E-Book

Michael Heydt

0,0
45,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Get to grips with pandas—a versatile and high-performance Python library for data manipulation, analysis, and discovery

About This Book

  • Get comfortable using pandas and Python as an effective data exploration and analysis tool
  • Explore pandas through a framework of data analysis, with an explanation of how pandas is well suited for the various stages in a data analysis process
  • A comprehensive guide to pandas with many of clear and practical examples to help you get up and using pandas

Who This Book Is For

This book is ideal for data scientists, data analysts, Python programmers who want to plunge into data analysis using pandas, and anyone with a curiosity about analyzing data. Some knowledge of statistics and programming will be helpful to get the most out of this book but not strictly required. Prior exposure to pandas is also not required.

What You Will Learn

  • Understand how data analysts and scientists think about of the processes of gathering and understanding data
  • Learn how pandas can be used to support the end-to-end process of data analysis
  • Use pandas Series and DataFrame objects to represent single and multivariate data
  • Slicing and dicing data with pandas, as well as combining, grouping, and aggregating data from multiple sources
  • How to access data from external sources such as files, databases, and web services
  • Represent and manipulate time-series data and the many of the intricacies involved with this type of data
  • How to visualize statistical information
  • How to use pandas to solve several common data representation and analysis problems within finance

In Detail

You will learn how to use pandas to perform data analysis in Python. You will start with an overview of data analysis and iteratively progress from modeling data, to accessing data from remote sources, performing numeric and statistical analysis, through indexing and performing aggregate analysis, and finally to visualizing statistical data and applying pandas to finance.

With the knowledge you gain from this book, you will quickly learn pandas and how it can empower you in the exciting world of data manipulation, analysis and science.

Style and approach

  • Step-by-step instruction on using pandas within an end-to-end framework of performing data analysis
  • Practical demonstration of using Python and pandas using interactive and incremental examples

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 287

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



 

Learning pandas

Second Edition

 

 

 

 

 

 

   

               

High-performance data manipulation and analysis in Python

 

 

 

 

 

 

                   

Michael Heydt

 

 

 

 

 

 

 

 

 

       BIRMINGHAM - MUMBAI

 

Learning pandas

Second Edition

 

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: April 2015

Second edition: June 2017

 

Production reference: 1300617

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

 

ISBN 978-1-78712-313-7www.packtpub.com

Credits

Authors

 

Michael Heydt

Copy Editors

 

Safis Editing

Reviewers

 

Sonali Dayal

Nicola Rainiero

Project Coordinator

 

Nidhi Joshi

Commissioning Editor

 

Amey Varangaonkar

Proofreader

 

Safis Editing

Acquisition Editor

 

Tushar Gupta

 

Indexer

 

Aishwarya Gangawane

 

Content Development Editor

 

Aishwarya Pandere

Graphics

 

Tania Dutta

Technical Editor

 

Prasad Ramesh

Production Coordinator

 

Melwyn Dsa

 

About the Author

Michael Heydt is a technologist, entrepreneur, and educator with decades of professional software development and financial and commodities trading experience. He has worked extensively on Wall Street specializing in the development of distributed, actor-based, high-performance, and high-availability trading systems. He is currently founder of Micro Trading Services, a company that focuses on creating cloud and micro service-based software solutions for finance and commodities trading. He holds a master's in science in mathematics and computer science from Drexel University, and an executive master's of technology management from the University of Pennsylvania School of Applied Science and the Wharton School of Business.

 

I would really like to thank the team at Packt for continuously pushing me to create and revise this and my other books. I would also like to greatly thank my family for putting up with me disappearing for months on end during my sparse free time to indulge in creating this content. They are my true inspiration.

About the Reviewers

Sonali Dayal is a freelance data scientist in the San Francisco Bay Area. Her work on building analytical models and data pipelines influences major product and financial decisions for clients. Previously, she has worked as a freelance software and data science engineer for early stage startups, where she built supervised and unsupervised machine learning models, as well as interactive data analytics dashboards. She received her BS in biochemistry from Virginia Tech in 2011.

I'd like to thank the team at Packt for the opportunity to review this book and their support throughout the process.

 

Nicola Rainiero is a civil geotechnical engineer with a background in the construction industry as a self-employed designer engineer. He is also specialized in renewable energy and has collaborated with the Sant Anna University of Pisa for two European projects, REGEOCITIES and PRISCA, using qualitative and quantitative data analysis techniques.

He has the ambition to simplifying his work with open software, using and developing new ones. Sometimes obtaining good results, other less good.

A special thanks to Packt Publishing for this opportunity to participate in the review of this book. I thank my family, especially my parents, for their physical and moral support.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.comand as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787123138.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

pandas and Data Analysis

Introducing pandas

Data manipulation, analysis, science, and pandas

Data manipulation

Data analysis

Data science

Where does pandas fit?

The process of data analysis

The process

Ideation

Retrieval

Preparation

Exploration

Modeling

Presentation

Reproduction

A note on being iterative and agile

Relating the book to the process

Concepts of data and analysis in our tour of pandas

Types of data

Structured

Unstructured

Semi-structured

Variables

Categorical

Continuous

Discrete

Time series data

General concepts of analysis and statistics

Quantitative versus qualitative data/analysis

Single and multivariate analysis

Descriptive statistics

Inferential statistics

Stochastic models

Probability and Bayesian statistics

Correlation

Regression

Other Python libraries of value with pandas

Numeric and scientific computing - NumPy and SciPy

Statistical analysis – StatsModels

Machine learning – scikit-learn

PyMC - stochastic Bayesian modeling

Data visualization - matplotlib and seaborn

Matplotlib

Seaborn

Summary

Up and Running with pandas

Installation of Anaconda

IPython and Jupyter Notebook

IPython

Jupyter Notebook

Introducing the pandas Series and DataFrame

Importing pandas

The pandas Series

The pandas DataFrame

Loading data from files into a DataFrame

Visualization

Summary

Representing Univariate Data with the Series

Configuring pandas

Creating a Series

Creating a Series using Python lists and dictionaries

Creation using NumPy functions

Creation using a scalar value

The .index and .values properties

The size and shape of a Series

Specifying an index at creation

Heads, tails, and takes

Retrieving values in a Series by label or position

Lookup by label using the [] operator and the .ix[] property

Explicit lookup by position with .iloc[]

Explicit lookup by labels with .loc[]

Slicing a Series into subsets

Alignment via index labels

Performing Boolean selection

Re-indexing a Series

Modifying a Series in-place

Summary

Representing Tabular and Multivariate Data with the DataFrame

Configuring pandas

Creating DataFrame objects

Creating a DataFrame using NumPy function results

Creating a DataFrame using a Python dictionary and pandas Series objects

Creating a DataFrame from a CSV file

Accessing data within a DataFrame

Selecting the columns of a DataFrame

Selecting rows of a DataFrame

Scalar lookup by label or location using .at[] and .iat[]

Slicing using the [ ] operator

Selecting rows using Boolean selection

Selecting across both rows and columns

Summary

Manipulating DataFrame Structure

Configuring pandas

Renaming columns

Adding new columns with [] and .insert()

Adding columns through enlargement

Adding columns using concatenation

Reordering columns

Replacing the contents of a column

Deleting columns

Appending new rows

Concatenating rows

Adding and replacing rows via enlargement

Removing rows using .drop()

Removing rows using Boolean selection

Removing rows using a slice

Summary

Indexing Data

Configuring pandas

The importance of indexes

The pandas index types

The fundamental type - Index

Integer index labels using Int64Index and RangeIndex

Floating-point labels using Float64Index

Representing discrete intervals using IntervalIndex

Categorical values as an index - CategoricalIndex

Indexing by date and time using DatetimeIndex

Indexing periods of time using PeriodIndex

Working with Indexes

Creating and using an index with a Series or DataFrame

Selecting values using an index

Moving data to and from the index

Reindexing a pandas object

Hierarchical indexing

Summary

Categorical Data

Configuring pandas

Creating Categoricals

Renaming categories

Appending new categories

Removing categories

Removing unused categories

Setting categories

Descriptive information of a Categorical

Munging school grades

Summary

Numerical and Statistical Methods

Configuring pandas

Performing numerical methods on pandas objects

Performing arithmetic on a DataFrame or Series

Getting the counts of values

Determining unique values (and their counts)

Finding minimum and maximum values

Locating the n-smallest and n-largest values

Calculating accumulated values

Performing statistical processes on pandas objects

Retrieving summary descriptive statistics

Measuring central tendency: mean, median, and mode

Calculating the mean

Finding the median

Determining the mode

Calculating variance and standard deviation

Measuring variance

Finding the standard deviation

Determining covariance and correlation

Calculating covariance

Determining correlation

Performing discretization and quantiling of data

Calculating the rank of values

Calculating the percent change at each sample of a series

Performing moving-window operations

Executing random sampling of data

Summary

Accessing Data

Configuring pandas

Working with CSV and text/tabular format data

Examining the sample CSV data set

Reading a CSV file into a DataFrame

Specifying the index column when reading a CSV file

Data type inference and specification

Specifying column names

Specifying specific columns to load

Saving DataFrame to a CSV file

Working with general field-delimited data

Handling variants of formats in field-delimited data

Reading and writing data in Excel format

Reading and writing JSON files

Reading HTML data from the web

Reading and writing HDF5 format files

Accessing CSV data on the web

Reading and writing from/to SQL databases

Reading data from remote data services

Reading stock data from Yahoo! and Google Finance

Retrieving options data from Google Finance

Reading economic data from the Federal Reserve Bank of St. Louis

Accessing Kenneth French's data

Reading from the World Bank

Summary

Tidying Up Your Data

Configuring pandas

What is tidying your data?

How to work with missing data

Determining NaN values in pandas objects

Selecting out or dropping missing data

Handling of NaN values in mathematical operations

Filling in missing data

Forward and backward filling of missing values

Filling using index labels

Performing interpolation of missing values

Handling duplicate data

Transforming data

Mapping data into different values

Replacing values

Applying functions to transform data

Summary

Combining, Relating, and Reshaping Data

Configuring pandas

Concatenating data in multiple objects

Understanding the default semantics of concatenation

Switching axes of alignment

Specifying join type

Appending versus concatenation

Ignoring the index labels

Merging and joining data

Merging data from multiple pandas objects

Specifying the join semantics of a merge operation

Pivoting data to and from value and indexes

Stacking and unstacking

Stacking using non-hierarchical indexes

Unstacking using hierarchical indexes

Melting data to and from long and wide format

Performance benefits of stacked data

Summary

Data Aggregation

Configuring pandas

The split, apply, and combine (SAC) pattern

Data for the examples

Splitting data

Grouping by a single column's values

Accessing the results of a grouping

Grouping using multiple columns

Grouping using index levels

Applying aggregate functions, transforms, and filters

Applying aggregation functions to groups

Transforming groups of data

The general process of transformation

Filling missing values with the mean of the group

Calculating normalized z-scores with a transformation

Filtering groups from aggregation

Summary

Time-Series Modelling

Setting up the IPython notebook

Representation of dates, time, and intervals

The datetime, day, and time objects

Representing a point in time with a Timestamp

Using a Timedelta to represent a time interval

Introducing time-series data

Indexing using DatetimeIndex

Creating time-series with specific frequencies

Calculating new dates using offsets

Representing data intervals with date offsets

Anchored offsets

Representing durations of time using Period

Modelling an interval of time with a Period

Indexing using the PeriodIndex

Handling holidays using calendars

Normalizing timestamps using time zones

Manipulating time-series data

Shifting and lagging

Performing frequency conversion on a time-series

Up and down resampling of a time-series

Time-series moving-window operations

Summary

Visualization

Configuring pandas

Plotting basics with pandas

Creating time-series charts

Adorning and styling your time-series plot

Adding a title and changing axes labels

Specifying the legend content and position

Specifying line colors, styles, thickness, and markers

Specifying tick mark locations and tick labels

Formatting axes' tick date labels using formatters

Common plots used in statistical analyses

Showing relative differences with bar plots

Picturing distributions of data with histograms

Depicting distributions of categorical data with box and whisker charts

Demonstrating cumulative totals with area plots

Relationships between two variables with scatter plots

Estimates of distribution with the kernel density plot

Correlations between multiple variables with the scatter plot matrix

Strengths of relationships in multiple variables with heatmaps

Manually rendering multiple plots in a single chart

Summary

Historical Stock Price Analysis

Setting up the IPython notebook

Obtaining and organizing stock data from Google

Plotting time-series prices

Plotting volume-series data

Calculating the simple daily percentage change in closing price

Calculating simple daily cumulative returns of a stock

Resampling data from daily to monthly returns

Analyzing distribution of returns

Performing a moving-average calculation

Comparison of average daily returns across stocks

Correlation of stocks based on the daily percentage change of the closing price

Calculating the volatility of stocks

Determining risk relative to expected returns

Summary

Preface

Pandas is a popular Python package used for practical, real-world data analysis. It provides efficient, fast, and high-performance data structures that make data exploration and analysis very easy. This learner's guide will help you through a comprehensive set of features provided by the pandas library to perform efficient data manipulation and analysis.

What this book covers

Chapter 1 , pandas and Data Analysis, is a hands-on introduction to the key features of pandas. The idea of this chapter is to provide some context for using pandas in the context of statistics and data science. The chapter will get into several concepts in data science and show how they are supported by pandas. This will set a context for each of the subsequent chapters, mentioning each chapter relates to both data science and data science processes.

Chapter 2, Up and Running with pandas, instructs the reader on obtain and install pandas, and to get introduce a few of the basic concepts in pandas. We will also look at how the examples are presented using iPython and Juypter notebook.

Chapter 3, Representing Univariate Data with the Series, walks the reader through the use of the pandas Series, which provides 1-dimensional, indexed data representations. The reader will learn about how to create Series objects and how to manipulate data held within. They will also learn about indexes and alignment of data, and about how the Series can be used to slice data.

Chapter 4, Representing Tabular and Multivariate Data with the DataFrame, walks the reader through the basic use of the pandas DataFrame, which provides and indexes multivariate data representations. This chapter will instruct the reader to be able to create DataFrame objects using various sets of static data, and how to perform selection of specific columns and rows within. Complex queries, manipulation, and indexing will be now handled in the following chapter.

Chapter 5, Manipulation and Indexing of DataFrame objects, expands on the previous chapter and instructs you on how to perform more complex manipulations of a DataFrame. We start by learning how to add, remove, and delete columns and rows; modify data within a DataFrame (or created a modified copy); perform calculations on data within; create hierarchical indexes; and also calculate common statistical results upon DataFrame contents.

Chapter 6, Indexing Data, shows how data can be loaded and saved from external sources into both Series and DataFrame objects. The chapter also covers data access from multiple sources such as files, http servers, database systems, and web services. Also covered is the processing of data in CSV, HTML, and JSON formats.

Chapter 7, Categorical Data, instructs the reader on how to use the various tools provided by pandas for managing dirty and missing data.

Chapter 8, Numerical and Statistical Methods, covers various techniques for combining, splitting, joining, and merging of data located in multiple pandas objects, and then demonstrates on how to reshape data using concepts such as pivots, stacking, and melting.

Chapter 9, Accessing Data, talks about grouping and performing aggregate data analysis. In pandas, this is often referred to as the split-apply-combine pattern. The reader will learn about using this pattern to group data in various different configurations and also apply aggregate functions to calculate results upon each group of data.

Chapter 10, Tidying Up Your Data, explains how to organize data in a tidy form, that is usable for data analysis.

Chapter 11, Combining, Relating and Reshaping Data, tells the readers how they can take data in multiple pandas objects and combine them, through concepts such as joins, merges and concatenation.

Chapter 12, Data Aggregation, dives into the integration of pandas with matplotlib to visualize pandas data. The chapter will demonstrate how to present many common statistical and financial data visualizations including bar charts, histograms, scatter plots, area plots, density plots, and heat maps.

Chapter 13, Time-Series Modeling, covers representing time series data in pandas. This chapter will cover the extensive capabilities provided by pandas for facilitating analysis of time series data.

Chapter 14, Visualization, teaches you how to create data visualizations based upon data stored in pandas data structures. We start with the basics learning, how to create a simple chart from data and control several of the attributes of the chart (such as legends, labels, and colors). We examine the creation of several common types of plot used to represent different types of data that are use those plot types to convey meaning in the underlying data. We also learn how to integrate pandas with D3.js so that we can create rich web-based visualizations.

Chapter 15, Historical Stock Price Analysis, shows you how to apply pandas to basic financial problems. It will focus on data obtained from Yahoo! Finance, and will demonstrate a number of financial concepts in financial data such as calculating returns, moving averages, volatility, and several other concepts. The student will also learns how to apply data visualization to these financial concepts.

What you need for this book

This book assumes some familiarity with programming concepts, but those without programming experience, or specifically Python programming experience, will be comfortable with the examples as they focus on pandas constructs more than Python or programming. The examples are based on Anaconda Python 2.7 and pandas 0.15.1. If you do not have either installed, guidance will be given in Chapter 2, Up and Running with pandas, regarding installing pandas on installing both on Windows, OSX, and Ubuntu systems. For those notinterested in installing any software, instruction is also given on using the Warkari.io online Python data analysis service.

Who this book is for

This book is ideal for data scientists, data analysts, and Python programmers who want to plunge into data analysis using pandas, and anyone curious about analyzing data. Some knowledge of statistics and programming will help you to get the most out of this book but that's not strictly required. Prior exposure to pandas is also not required.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Pandas-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the codewe would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www. packtpub. com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www. packtpub. com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

pandas and Data Analysis

Welcome to Learning pandas! In this book, we will go on a journey that will see us learning pandas, an open source data analysis library for the Python programming language. The pandas library provides high-performance and easy-to-use data structures and analysis tools built with Python. pandas brings to Python many good things from the statistical programming language R, specifically data frame objects and R packages such as plyr and reshape2, and places them in a single library that you can use from within Python.

In this first chapter, we will take the time to understand pandas and how it fits into the bigger picture of data analysis. This will give the reader who is interested in pandas a feeling for its place in the bigger picture of data analysis instead of having a complete focus on the details of using pandas. The goal is that while learning pandas you also learn why those features exist in support of performing data analysis tasks.

So, let's jump in. In this chapter, we will cover:

What pandas is, why it was created, and what it gives you

How pandas relates to data analysis and data science

The processes involved in data analysis and how it is supported by pandas

General concepts of data and analytics

Basic concepts of data analysis and statistical analysis

Types of data and their applicability to pandas

Other libraries in the Python ecosystem that you will likely use with pandas

Introducing pandas

pandas is a Python library containing high-level data structures and tools that have been created to help Python programmers to perform powerful data analysis. The ultimate purpose of pandas is to help you quickly discover information in data, with information being defined as an underlying meaning.

The development of pandas was begun in 2008 by Wes McKinney; it was open sourced in 2009. pandas is currently supported and actively developed by various organizations and contributors.

pandas was initially designed with finance in mind specifically with its ability around time series data manipulation and processing historical stock information. The processing of financial information has many challenges, the following being a few:

Representing security data, such as a stock's price, as it changes over time

Matching the measurement of multiple streams of data at identical times

Determining the relationship (correlation) of two or more streams of data

Representing times and dates as first-class entities

Converting the period of samples of data, either up or down

To do this processing, a tool was needed that allows us to retrieve, index, clean and tidy, reshape, combine, slice, and perform various analyses on both single- and multidimensional data, including heterogeneous-typed data that is automatically aligned along a set of common index labels. This is where pandas comes in, having been created with many useful and powerful features such as the following:

Fast and efficient

Series

and

DataFrame

objects for data manipulation with integrated indexing

Intelligent data alignment using indexes and labels

Integrated handling of missing data

Facilities for converting messy data into orderly data (tidying)

Built-in tools for reading and writing data between in-memory data structures and files, databases, and web services

The ability to process data stored in many common formats such as CSV, Excel, HDF5, and JSON

Flexible reshaping and pivoting of sets of data

Smart label-based slicing, fancy indexing, and subsetting of large datasets

Columns can be inserted and deleted from data structures for size mutability

Aggregating or transforming data with a powerful data grouping facility to perform split-apply-combine on datasets

High-performance merging and joining of datasets

Hierarchical indexing facilitating working with high-dimensional data in a lower-dimensional data structure

Extensive features for time series data, including date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting, and lagging

Highly optimized for performance, with critical code paths written in

Cython

or C

The robust feature set, combined with its seamless integration with Python and other tools within the Python ecosystem, has given pandas wide adoption in many domains. It is in use in a wide variety of academic and commercial domains, including finance, neurosciences, economics, statistics, advertising, and web analytic. It has become one of the most preferred tools for data scientists to represent data for manipulation and analysis.

Python has long been exceptional for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain -specific language such as R. This is very important, as those familiar with Python, a more generalized programming language than R (more a statistical package), gain many data representation and manipulation features of R while remaining entirely within an incredibly rich Python ecosystem.

Combined with IPython, Jupyter notebooks, and a wide range of other libraries, the environment for performing data analysis in Python excels in performance, productivity, and the ability to collaborate, compared to many other tools. This has led to the widespread adoption of pandas by many users in many industries.

Data manipulation, analysis, science, and pandas

We live in a world in which massive amounts of data are produced and stored every day. This data comes from a plethora of information systems, devices, and sensors. Almost everything you do, and items you use to do it, produces data which can be, or is, captured.

This has been greatly enabled by the ubiquitous nature of services that are connected to networks, and by the great increases in data storage facilities; this, combined with the ever-decreasing cost of storage, has made capturing and storing even the most trivial of data effective.

This has led to massive amounts of data being piled up and ready for access. But this data is spread out all over cyber-space, and is cannot actually be referred to as information. It tends to be a collected collection of the recording of events, whether financial, of your interactions with social networks, or of your personal health monitor tracking your heartbeat throughout the day. This data is stored in all kinds of formats, is located in scattered places, and beyond its raw nature does give much insight.

Logically, the overall process can be broken into three major areas of discipline:

Data manipulation

Data analysis

Data science

These three disciplines can and do have a lot of overlap. Where each ends and the others begin is open to interpretation. For the purposes of this book we will define each as in the following sections.

Data manipulation

Data is distributed all over the planet. It is stored in different formats. It has widely varied levels of quality. Because of this there is a need for tools and processes for pulling data together and into a form that can be used for decision making. This requires many different tasks and capabilities from a tool that manipulates data in preparation for analysis. The features needed from such a tool include:

Programmability for reuse and sharing

Access to data from external sources

Storing data locally

Indexing data for efficient retrieval

Alignment of data in different sets based upon attributes

Combining data in different sets

Transformation of data into other representations

Cleaning data from cruft

Effective handling of bad data

Grouping data into common baskets

Aggregation of data of like characteristics

Application of functions to calculate meaning or perform transformations

Query and slicing to explore pieces of the whole

Restructuring into other forms

Modeling distinct categories of data such as categorical, continuous, discrete, and time series

Resampling data to different frequencies

There are many data manipulation tools in existence. Each differs in support for the items on this list, how they are deployed, and how they are utilized by their users. These tools include relational databases (SQL Server, Oracle), spreadsheets (Excel), event processing systems (such as Spark), and more generic tools such as R and pandas.