45,59 €
Get to grips with pandas—a versatile and high-performance Python library for data manipulation, analysis, and discovery
This book is ideal for data scientists, data analysts, Python programmers who want to plunge into data analysis using pandas, and anyone with a curiosity about analyzing data. Some knowledge of statistics and programming will be helpful to get the most out of this book but not strictly required. Prior exposure to pandas is also not required.
You will learn how to use pandas to perform data analysis in Python. You will start with an overview of data analysis and iteratively progress from modeling data, to accessing data from remote sources, performing numeric and statistical analysis, through indexing and performing aggregate analysis, and finally to visualizing statistical data and applying pandas to finance.
With the knowledge you gain from this book, you will quickly learn pandas and how it can empower you in the exciting world of data manipulation, analysis and science.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 287
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: April 2015
Second edition: June 2017
Production reference: 1300617
ISBN 978-1-78712-313-7www.packtpub.com
Authors
Michael Heydt
Copy Editors
Safis Editing
Reviewers
Sonali Dayal
Nicola Rainiero
Project Coordinator
Nidhi Joshi
Commissioning Editor
Amey Varangaonkar
Proofreader
Safis Editing
Acquisition Editor
Tushar Gupta
Indexer
Aishwarya Gangawane
Content Development Editor
Aishwarya Pandere
Graphics
Tania Dutta
Technical Editor
Prasad Ramesh
Production Coordinator
Melwyn Dsa
Michael Heydt is a technologist, entrepreneur, and educator with decades of professional software development and financial and commodities trading experience. He has worked extensively on Wall Street specializing in the development of distributed, actor-based, high-performance, and high-availability trading systems. He is currently founder of Micro Trading Services, a company that focuses on creating cloud and micro service-based software solutions for finance and commodities trading. He holds a master's in science in mathematics and computer science from Drexel University, and an executive master's of technology management from the University of Pennsylvania School of Applied Science and the Wharton School of Business.
Sonali Dayal is a freelance data scientist in the San Francisco Bay Area. Her work on building analytical models and data pipelines influences major product and financial decisions for clients. Previously, she has worked as a freelance software and data science engineer for early stage startups, where she built supervised and unsupervised machine learning models, as well as interactive data analytics dashboards. She received her BS in biochemistry from Virginia Tech in 2011.
Nicola Rainiero is a civil geotechnical engineer with a background in the construction industry as a self-employed designer engineer. He is also specialized in renewable energy and has collaborated with the Sant Anna University of Pisa for two European projects, REGEOCITIES and PRISCA, using qualitative and quantitative data analysis techniques.
He has the ambition to simplifying his work with open software, using and developing new ones. Sometimes obtaining good results, other less good.
For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.comand as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787123138.
If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
pandas and Data Analysis
Introducing pandas
Data manipulation, analysis, science, and pandas
Data manipulation
Data analysis
Data science
Where does pandas fit?
The process of data analysis
The process
Ideation
Retrieval
Preparation
Exploration
Modeling
Presentation
Reproduction
A note on being iterative and agile
Relating the book to the process
Concepts of data and analysis in our tour of pandas
Types of data
Structured
Unstructured
Semi-structured
Variables
Categorical
Continuous
Discrete
Time series data
General concepts of analysis and statistics
Quantitative versus qualitative data/analysis
Single and multivariate analysis
Descriptive statistics
Inferential statistics
Stochastic models
Probability and Bayesian statistics
Correlation
Regression
Other Python libraries of value with pandas
Numeric and scientific computing - NumPy and SciPy
Statistical analysis – StatsModels
Machine learning – scikit-learn
PyMC - stochastic Bayesian modeling
Data visualization - matplotlib and seaborn
Matplotlib
Seaborn
Summary
Up and Running with pandas
Installation of Anaconda
IPython and Jupyter Notebook
IPython
Jupyter Notebook
Introducing the pandas Series and DataFrame
Importing pandas
The pandas Series
The pandas DataFrame
Loading data from files into a DataFrame
Visualization
Summary
Representing Univariate Data with the Series
Configuring pandas
Creating a Series
Creating a Series using Python lists and dictionaries
Creation using NumPy functions
Creation using a scalar value
The .index and .values properties
The size and shape of a Series
Specifying an index at creation
Heads, tails, and takes
Retrieving values in a Series by label or position
Lookup by label using the [] operator and the .ix[] property
Explicit lookup by position with .iloc[]
Explicit lookup by labels with .loc[]
Slicing a Series into subsets
Alignment via index labels
Performing Boolean selection
Re-indexing a Series
Modifying a Series in-place
Summary
Representing Tabular and Multivariate Data with the DataFrame
Configuring pandas
Creating DataFrame objects
Creating a DataFrame using NumPy function results
Creating a DataFrame using a Python dictionary and pandas Series objects
Creating a DataFrame from a CSV file
Accessing data within a DataFrame
Selecting the columns of a DataFrame
Selecting rows of a DataFrame
Scalar lookup by label or location using .at[] and .iat[]
Slicing using the [ ] operator
Selecting rows using Boolean selection
Selecting across both rows and columns
Summary
Manipulating DataFrame Structure
Configuring pandas
Renaming columns
Adding new columns with [] and .insert()
Adding columns through enlargement
Adding columns using concatenation
Reordering columns
Replacing the contents of a column
Deleting columns
Appending new rows
Concatenating rows
Adding and replacing rows via enlargement
Removing rows using .drop()
Removing rows using Boolean selection
Removing rows using a slice
Summary
Indexing Data
Configuring pandas
The importance of indexes
The pandas index types
The fundamental type - Index
Integer index labels using Int64Index and RangeIndex
Floating-point labels using Float64Index
Representing discrete intervals using IntervalIndex
Categorical values as an index - CategoricalIndex
Indexing by date and time using DatetimeIndex
Indexing periods of time using PeriodIndex
Working with Indexes
Creating and using an index with a Series or DataFrame
Selecting values using an index
Moving data to and from the index
Reindexing a pandas object
Hierarchical indexing
Summary
Categorical Data
Configuring pandas
Creating Categoricals
Renaming categories
Appending new categories
Removing categories
Removing unused categories
Setting categories
Descriptive information of a Categorical
Munging school grades
Summary
Numerical and Statistical Methods
Configuring pandas
Performing numerical methods on pandas objects
Performing arithmetic on a DataFrame or Series
Getting the counts of values
Determining unique values (and their counts)
Finding minimum and maximum values
Locating the n-smallest and n-largest values
Calculating accumulated values
Performing statistical processes on pandas objects
Retrieving summary descriptive statistics
Measuring central tendency: mean, median, and mode
Calculating the mean
Finding the median
Determining the mode
Calculating variance and standard deviation
Measuring variance
Finding the standard deviation
Determining covariance and correlation
Calculating covariance
Determining correlation
Performing discretization and quantiling of data
Calculating the rank of values
Calculating the percent change at each sample of a series
Performing moving-window operations
Executing random sampling of data
Summary
Accessing Data
Configuring pandas
Working with CSV and text/tabular format data
Examining the sample CSV data set
Reading a CSV file into a DataFrame
Specifying the index column when reading a CSV file
Data type inference and specification
Specifying column names
Specifying specific columns to load
Saving DataFrame to a CSV file
Working with general field-delimited data
Handling variants of formats in field-delimited data
Reading and writing data in Excel format
Reading and writing JSON files
Reading HTML data from the web
Reading and writing HDF5 format files
Accessing CSV data on the web
Reading and writing from/to SQL databases
Reading data from remote data services
Reading stock data from Yahoo! and Google Finance
Retrieving options data from Google Finance
Reading economic data from the Federal Reserve Bank of St. Louis
Accessing Kenneth French's data
Reading from the World Bank
Summary
Tidying Up Your Data
Configuring pandas
What is tidying your data?
How to work with missing data
Determining NaN values in pandas objects
Selecting out or dropping missing data
Handling of NaN values in mathematical operations
Filling in missing data
Forward and backward filling of missing values
Filling using index labels
Performing interpolation of missing values
Handling duplicate data
Transforming data
Mapping data into different values
Replacing values
Applying functions to transform data
Summary
Combining, Relating, and Reshaping Data
Configuring pandas
Concatenating data in multiple objects
Understanding the default semantics of concatenation
Switching axes of alignment
Specifying join type
Appending versus concatenation
Ignoring the index labels
Merging and joining data
Merging data from multiple pandas objects
Specifying the join semantics of a merge operation
Pivoting data to and from value and indexes
Stacking and unstacking
Stacking using non-hierarchical indexes
Unstacking using hierarchical indexes
Melting data to and from long and wide format
Performance benefits of stacked data
Summary
Data Aggregation
Configuring pandas
The split, apply, and combine (SAC) pattern
Data for the examples
Splitting data
Grouping by a single column's values
Accessing the results of a grouping
Grouping using multiple columns
Grouping using index levels
Applying aggregate functions, transforms, and filters
Applying aggregation functions to groups
Transforming groups of data
The general process of transformation
Filling missing values with the mean of the group
Calculating normalized z-scores with a transformation
Filtering groups from aggregation
Summary
Time-Series Modelling
Setting up the IPython notebook
Representation of dates, time, and intervals
The datetime, day, and time objects
Representing a point in time with a Timestamp
Using a Timedelta to represent a time interval
Introducing time-series data
Indexing using DatetimeIndex
Creating time-series with specific frequencies
Calculating new dates using offsets
Representing data intervals with date offsets
Anchored offsets
Representing durations of time using Period
Modelling an interval of time with a Period
Indexing using the PeriodIndex
Handling holidays using calendars
Normalizing timestamps using time zones
Manipulating time-series data
Shifting and lagging
Performing frequency conversion on a time-series
Up and down resampling of a time-series
Time-series moving-window operations
Summary
Visualization
Configuring pandas
Plotting basics with pandas
Creating time-series charts
Adorning and styling your time-series plot
Adding a title and changing axes labels
Specifying the legend content and position
Specifying line colors, styles, thickness, and markers
Specifying tick mark locations and tick labels
Formatting axes' tick date labels using formatters
Common plots used in statistical analyses
Showing relative differences with bar plots
Picturing distributions of data with histograms
Depicting distributions of categorical data with box and whisker charts
Demonstrating cumulative totals with area plots
Relationships between two variables with scatter plots
Estimates of distribution with the kernel density plot
Correlations between multiple variables with the scatter plot matrix
Strengths of relationships in multiple variables with heatmaps
Manually rendering multiple plots in a single chart
Summary
Historical Stock Price Analysis
Setting up the IPython notebook
Obtaining and organizing stock data from Google
Plotting time-series prices
Plotting volume-series data
Calculating the simple daily percentage change in closing price
Calculating simple daily cumulative returns of a stock
Resampling data from daily to monthly returns
Analyzing distribution of returns
Performing a moving-average calculation
Comparison of average daily returns across stocks
Correlation of stocks based on the daily percentage change of the closing price
Calculating the volatility of stocks
Determining risk relative to expected returns
Summary
Pandas is a popular Python package used for practical, real-world data analysis. It provides efficient, fast, and high-performance data structures that make data exploration and analysis very easy. This learner's guide will help you through a comprehensive set of features provided by the pandas library to perform efficient data manipulation and analysis.
Chapter 1 , pandas and Data Analysis, is a hands-on introduction to the key features of pandas. The idea of this chapter is to provide some context for using pandas in the context of statistics and data science. The chapter will get into several concepts in data science and show how they are supported by pandas. This will set a context for each of the subsequent chapters, mentioning each chapter relates to both data science and data science processes.
Chapter 2, Up and Running with pandas, instructs the reader on obtain and install pandas, and to get introduce a few of the basic concepts in pandas. We will also look at how the examples are presented using iPython and Juypter notebook.
Chapter 3, Representing Univariate Data with the Series, walks the reader through the use of the pandas Series, which provides 1-dimensional, indexed data representations. The reader will learn about how to create Series objects and how to manipulate data held within. They will also learn about indexes and alignment of data, and about how the Series can be used to slice data.
Chapter 4, Representing Tabular and Multivariate Data with the DataFrame, walks the reader through the basic use of the pandas DataFrame, which provides and indexes multivariate data representations. This chapter will instruct the reader to be able to create DataFrame objects using various sets of static data, and how to perform selection of specific columns and rows within. Complex queries, manipulation, and indexing will be now handled in the following chapter.
Chapter 5, Manipulation and Indexing of DataFrame objects, expands on the previous chapter and instructs you on how to perform more complex manipulations of a DataFrame. We start by learning how to add, remove, and delete columns and rows; modify data within a DataFrame (or created a modified copy); perform calculations on data within; create hierarchical indexes; and also calculate common statistical results upon DataFrame contents.
Chapter 6, Indexing Data, shows how data can be loaded and saved from external sources into both Series and DataFrame objects. The chapter also covers data access from multiple sources such as files, http servers, database systems, and web services. Also covered is the processing of data in CSV, HTML, and JSON formats.
Chapter 7, Categorical Data, instructs the reader on how to use the various tools provided by pandas for managing dirty and missing data.
Chapter 8, Numerical and Statistical Methods, covers various techniques for combining, splitting, joining, and merging of data located in multiple pandas objects, and then demonstrates on how to reshape data using concepts such as pivots, stacking, and melting.
Chapter 9, Accessing Data, talks about grouping and performing aggregate data analysis. In pandas, this is often referred to as the split-apply-combine pattern. The reader will learn about using this pattern to group data in various different configurations and also apply aggregate functions to calculate results upon each group of data.
Chapter 10, Tidying Up Your Data, explains how to organize data in a tidy form, that is usable for data analysis.
Chapter 11, Combining, Relating and Reshaping Data, tells the readers how they can take data in multiple pandas objects and combine them, through concepts such as joins, merges and concatenation.
Chapter 12, Data Aggregation, dives into the integration of pandas with matplotlib to visualize pandas data. The chapter will demonstrate how to present many common statistical and financial data visualizations including bar charts, histograms, scatter plots, area plots, density plots, and heat maps.
Chapter 13, Time-Series Modeling, covers representing time series data in pandas. This chapter will cover the extensive capabilities provided by pandas for facilitating analysis of time series data.
Chapter 14, Visualization, teaches you how to create data visualizations based upon data stored in pandas data structures. We start with the basics learning, how to create a simple chart from data and control several of the attributes of the chart (such as legends, labels, and colors). We examine the creation of several common types of plot used to represent different types of data that are use those plot types to convey meaning in the underlying data. We also learn how to integrate pandas with D3.js so that we can create rich web-based visualizations.
Chapter 15, Historical Stock Price Analysis, shows you how to apply pandas to basic financial problems. It will focus on data obtained from Yahoo! Finance, and will demonstrate a number of financial concepts in financial data such as calculating returns, moving averages, volatility, and several other concepts. The student will also learns how to apply data visualization to these financial concepts.
This book assumes some familiarity with programming concepts, but those without programming experience, or specifically Python programming experience, will be comfortable with the examples as they focus on pandas constructs more than Python or programming. The examples are based on Anaconda Python 2.7 and pandas 0.15.1. If you do not have either installed, guidance will be given in Chapter 2, Up and Running with pandas, regarding installing pandas on installing both on Windows, OSX, and Ubuntu systems. For those notinterested in installing any software, instruction is also given on using the Warkari.io online Python data analysis service.
This book is ideal for data scientists, data analysts, and Python programmers who want to plunge into data analysis using pandas, and anyone curious about analyzing data. Some knowledge of statistics and programming will help you to get the most out of this book but that's not strictly required. Prior exposure to pandas is also not required.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Pandas-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the codewe would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www. packtpub. com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www. packtpub. com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
Welcome to Learning pandas! In this book, we will go on a journey that will see us learning pandas, an open source data analysis library for the Python programming language. The pandas library provides high-performance and easy-to-use data structures and analysis tools built with Python. pandas brings to Python many good things from the statistical programming language R, specifically data frame objects and R packages such as plyr and reshape2, and places them in a single library that you can use from within Python.
In this first chapter, we will take the time to understand pandas and how it fits into the bigger picture of data analysis. This will give the reader who is interested in pandas a feeling for its place in the bigger picture of data analysis instead of having a complete focus on the details of using pandas. The goal is that while learning pandas you also learn why those features exist in support of performing data analysis tasks.
So, let's jump in. In this chapter, we will cover:
What pandas is, why it was created, and what it gives you
How pandas relates to data analysis and data science
The processes involved in data analysis and how it is supported by pandas
General concepts of data and analytics
Basic concepts of data analysis and statistical analysis
Types of data and their applicability to pandas
Other libraries in the Python ecosystem that you will likely use with pandas
pandas is a Python library containing high-level data structures and tools that have been created to help Python programmers to perform powerful data analysis. The ultimate purpose of pandas is to help you quickly discover information in data, with information being defined as an underlying meaning.
The development of pandas was begun in 2008 by Wes McKinney; it was open sourced in 2009. pandas is currently supported and actively developed by various organizations and contributors.
pandas was initially designed with finance in mind specifically with its ability around time series data manipulation and processing historical stock information. The processing of financial information has many challenges, the following being a few:
Representing security data, such as a stock's price, as it changes over time
Matching the measurement of multiple streams of data at identical times
Determining the relationship (correlation) of two or more streams of data
Representing times and dates as first-class entities
Converting the period of samples of data, either up or down
To do this processing, a tool was needed that allows us to retrieve, index, clean and tidy, reshape, combine, slice, and perform various analyses on both single- and multidimensional data, including heterogeneous-typed data that is automatically aligned along a set of common index labels. This is where pandas comes in, having been created with many useful and powerful features such as the following:
Fast and efficient
Series
and
DataFrame
objects for data manipulation with integrated indexing
Intelligent data alignment using indexes and labels
Integrated handling of missing data
Facilities for converting messy data into orderly data (tidying)
Built-in tools for reading and writing data between in-memory data structures and files, databases, and web services
The ability to process data stored in many common formats such as CSV, Excel, HDF5, and JSON
Flexible reshaping and pivoting of sets of data
Smart label-based slicing, fancy indexing, and subsetting of large datasets
Columns can be inserted and deleted from data structures for size mutability
Aggregating or transforming data with a powerful data grouping facility to perform split-apply-combine on datasets
High-performance merging and joining of datasets
Hierarchical indexing facilitating working with high-dimensional data in a lower-dimensional data structure
Extensive features for time series data, including date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting, and lagging
Highly optimized for performance, with critical code paths written in
Cython
or C
The robust feature set, combined with its seamless integration with Python and other tools within the Python ecosystem, has given pandas wide adoption in many domains. It is in use in a wide variety of academic and commercial domains, including finance, neurosciences, economics, statistics, advertising, and web analytic. It has become one of the most preferred tools for data scientists to represent data for manipulation and analysis.
Python has long been exceptional for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain -specific language such as R. This is very important, as those familiar with Python, a more generalized programming language than R (more a statistical package), gain many data representation and manipulation features of R while remaining entirely within an incredibly rich Python ecosystem.
Combined with IPython, Jupyter notebooks, and a wide range of other libraries, the environment for performing data analysis in Python excels in performance, productivity, and the ability to collaborate, compared to many other tools. This has led to the widespread adoption of pandas by many users in many industries.
We live in a world in which massive amounts of data are produced and stored every day. This data comes from a plethora of information systems, devices, and sensors. Almost everything you do, and items you use to do it, produces data which can be, or is, captured.
This has been greatly enabled by the ubiquitous nature of services that are connected to networks, and by the great increases in data storage facilities; this, combined with the ever-decreasing cost of storage, has made capturing and storing even the most trivial of data effective.
This has led to massive amounts of data being piled up and ready for access. But this data is spread out all over cyber-space, and is cannot actually be referred to as information. It tends to be a collected collection of the recording of events, whether financial, of your interactions with social networks, or of your personal health monitor tracking your heartbeat throughout the day. This data is stored in all kinds of formats, is located in scattered places, and beyond its raw nature does give much insight.
Logically, the overall process can be broken into three major areas of discipline:
Data manipulation
Data analysis
Data science
These three disciplines can and do have a lot of overlap. Where each ends and the others begin is open to interpretation. For the purposes of this book we will define each as in the following sections.
Data is distributed all over the planet. It is stored in different formats. It has widely varied levels of quality. Because of this there is a need for tools and processes for pulling data together and into a form that can be used for decision making. This requires many different tasks and capabilities from a tool that manipulates data in preparation for analysis. The features needed from such a tool include:
Programmability for reuse and sharing
Access to data from external sources
Storing data locally
Indexing data for efficient retrieval
Alignment of data in different sets based upon attributes
Combining data in different sets
Transformation of data into other representations
Cleaning data from cruft
Effective handling of bad data
Grouping data into common baskets
Aggregation of data of like characteristics
Application of functions to calculate meaning or perform transformations
Query and slicing to explore pieces of the whole
Restructuring into other forms
Modeling distinct categories of data such as categorical, continuous, discrete, and time series
Resampling data to different frequencies
There are many data manipulation tools in existence. Each differs in support for the items on this list, how they are deployed, and how they are utilized by their users. These tools include relational databases (SQL Server, Oracle), spreadsheets (Excel), event processing systems (such as Spark), and more generic tools such as R and pandas.