Hands-On Data Analysis with Pandas - Stefanie Molin - E-Book

Hands-On Data Analysis with Pandas E-Book

Stefanie Molin

0,0
46,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Extracting valuable business insights is no longer a ‘nice-to-have’, but an essential skill for anyone who handles data in their enterprise. Hands-On Data Analysis with Pandas is here to help beginners and those who are migrating their skills into data science get up to speed in no time.
This book will show you how to analyze your data, get started with machine learning, and work effectively with the Python libraries often used for data science, such as pandas, NumPy, matplotlib, seaborn, and scikit-learn.
Using real-world datasets, you will learn how to use the pandas library to perform data wrangling to reshape, clean, and aggregate your data. Then, you will learn how to conduct exploratory data analysis by calculating summary statistics and visualizing the data to find patterns. In the concluding chapters, you will explore some applications of anomaly detection, regression, clustering, and classification using scikit-learn to make predictions based on past data.
This updated edition will equip you with the skills you need to use pandas 1.x to efficiently perform various data manipulation tasks, reliably reproduce analyses, and visualize your data for effective decision making – valuable knowledge that can be applied across multiple domains.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 807

Veröffentlichungsjahr: 2021

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Hands-On Data Analysis with Pandas – Second Edition

A Python data science handbook for data collection, wrangling, analysis, and visualization

Stefanie Molin

BIRMINGHAM—MUMBAI

Hands-On Data Analysis with Pandas Second Edition

Copyright © 2021 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Kunal Parikh

Publishing Product Manager: Sunith Shetty

Senior Editor: Roshan Ravikumar

Content Development Editor: Athikho Sapuni Rishana

Technical Editor: Sonam Pandey

Copy Editor: Safis Editing

Project Coordinator: Aishwarya Mohan

Proofreader: Safis Editing

Indexer: Pratik Shirodkar

Production Designer: Shankar Kalbhor

First published: July 2019

Second edition: April 2021

Production reference: 1270421

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80056-345-2

www.packt.com

To everyone that made the first edition such a success.

Foreword to the Second Edition

As educators, we are inclined to teach across the medium that we best learn from. I personally gravitated towards video content early in my career. As I produce more online content, surprisingly, one of the most frequently asked questions I get is: What book would you recommend for someone getting started in data science?

Initially, I was baffled at why people would turn to books when there are so many great online resources out there. However, after reading Hands-On Data Analysis with Pandas, my perception of books for learning data science began to change.

The first thing I loved about Hands-On Data Analysis with Pandas was the structure. The book gives you just the right amount of information at the right time to keep you progressing at a natural pace. Starting with light foundations in statistics and concepts gives the perfect amount of cognitive glue to keep theory and practice comfortably bound together.

After the foundations, you are introduced to the star of the show: pandas. Stefanie uses practical examples (not the same old datasets you have used before) to bring the module to life. I use pandas almost every day, and I still learned quite a few tricks across these sections.

As a software engineer, Stefanie knows the importance of quality documentation. She has all of the data, examples, and more in a tidy GitHub repo. Through these examples, the book truly earns the "Hands-On" moniker in its title.

The latter portion of the book gives the reader a taste of what is possible with a strong foundation in pandas. Stefanie leads you just a little bit deeper into the more advanced machine learning concepts. Once again, she provides just enough information to get you excited about taking the next step in your learning journey without inundating you with overly technical jargon.

I could sense the pride Stefanie took in this work through our conversations. While the book is a great resource for people looking to learn data science tools, it was also a way for her to solidify her own knowledge and push her boundaries. In my opinion, you want to learn from people that are creating not only for the community but also for their own learning. People with intrinsic motivation like this are willing to go the extra mile to make that extra revision or get the wording perfect.

I hope you enjoy learning from this book as much as I did. To those who asked me the question above, I have a simple answer: This one.

Ken Jee YouTuber & Head of Data Science @ Scouts Consulting Group Honolulu, HI (03/09/2021)

Foreword to the First Edition

Recent advancements in computing and artificial intelligence have completely changed the way we understand the world. Our current ability to record and analyze data has already transformed industries and inspired big changes in society.

Stefanie Molin's Hands-On Data Analysis with Pandas is much more than an introduction to the subject of data analysis or the pandas Python library; it's a guide to help you become part of this transformation.

Not only will this book teach you the fundamentals of using Python to collect, analyze, and understand data, but it will also expose you to important software engineering, statistical, and machine learning concepts that you will need to be successful.

Using examples based on real data, you will be able to see firsthand how to apply these techniques to extract value from data. In the process, you will learn important software development skills, including writing simulations, creating your own Python packages, and collecting data from APIs.

Stefanie possesses a rare combination of skills that makes her uniquely qualified to guide you through this process. Being both an expert data scientist and a strong software engineer, she can not only talk authoritatively about the intricacies of the data analysis workflow but also about how to implement it correctly and efficiently in Python.

Whether you are a Python programmer interested in learning more about data analysis, or a data scientist learning how to work in Python, this book will get you up to speed fast, so you can begin to tackle your own data analysis projects right away.

Felipe Moreno New York, June 10, 2019.

Felipe Moreno has been working in information security for the last two decades. He currently works for Bloomberg LP, where he leads the Security Data Science team within the Chief Information Security Office and focuses on applying statistics and machine learning to security problems.

Contributors

About the author

Stefanie Molin is a data scientist and software engineer at Bloomberg LP in NYC, tackling tough problems in information security, particularly revolving around anomaly detection, building tools for gathering data, and knowledge sharing. She has extensive experience in data science, designing anomaly detection solutions, and utilizing machine learning in both R and Python in the AdTech and FinTech industries. She holds a B.S. in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, with minors in economics, and entrepreneurship and innovation. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.

Writing this book was a tremendous amount of work, but I have grown a lot through the experience: as a writer, as a technologist, and as a person. This wouldn't have been possible without the help of my friends, family, and colleagues. I'm very grateful to you all. In particular, I want to thank Aliki Mavromoustaki, Felipe Moreno, Suphannee Sivakorn, Lucy Hao, Javon Thompson, and Ken Jee. (The full version of my acknowledgments can be found in the code repository; see the preface for the link.)

About the reviewer

Aliki Mavromoustaki is the lead data scientist at Tasman Analytics. She works with direct-to-consumer companies to deliver scalable infrastructure and implement event-driven analytics. Previously, she worked at Criteo, an AdTech company that employs machine learning to help digital commerce companies target valuable customers. Aliki has worked on optimizing marketing campaigns and designed statistical experiments comparing Criteo products. Aliki holds a PhD in fluid dynamics from Imperial College London and was an assistant adjunct professor in applied mathematics at UCLA.

Table of Contents

Preface

Section 1: Getting Started with Pandas

Chapter 1: Introduction to Data Analysis

Chapter materials

The fundamentals of data analysis

Data collection

Data wrangling

Exploratory data analysis

Drawing conclusions

Statistical foundations

Sampling

Descriptive statistics

Prediction and forecasting

Inferential statistics

Setting up a virtual environment

Virtual environments

Installing the required Python packages

Why pandas?

Jupyter Notebooks

Summary

Exercises

Further reading

Chapter 2: Working with Pandas DataFrames

Chapter materials

Pandas data structures

Series

Index

DataFrame

Creating a pandas DataFrame

From a Python object

From a file

From a database

From an API

Inspecting a DataFrame object

Examining the data

Describing and summarizing the data

Grabbing subsets of the data

Selecting columns

Slicing

Indexing

Filtering

Adding and removing data

Creating new data

Deleting unwanted data

Summary

Exercises

Further reading

Section 2: Using Pandas for Data Analysis

Chapter 3: Data Wrangling with Pandas

Chapter materials

Understanding data wrangling

Data cleaning

Data transformation

Data enrichment

Exploring an API to find and collect temperature data

Cleaning data

Renaming columns

Type conversion

Reordering, reindexing, and sorting data

Reshaping data

Transposing DataFrames

Pivoting DataFrames

Melting DataFrames

Handling duplicate, missing, or invalid data

Finding the problematic data

Mitigating the issues

Summary

Exercises

Further reading

Chapter 4: Aggregating Pandas DataFrames

Chapter materials

Performing database-style operations on DataFrames

Querying DataFrames

Merging DataFrames

Using DataFrame operations to enrich data

Arithmetic and statistics

Binning

Applying functions

Window calculations

Pipes

Aggregating data

Summarizing DataFrames

Aggregating by group

Pivot tables and crosstabs

Working with time series data

Time-based selection and filtering

Shifting for lagged data

Differenced data

Resampling

Merging time series

Summary

Exercises

Further reading

Chapter 5: Visualizing Data with Pandas and Matplotlib

Chapter materials

An introduction to matplotlib

The basics

Plot components

Additional options

Plotting with pandas

Evolution over time

Relationships between variables

Distributions

Counts and frequencies

The pandas.plotting module

Scatter matrices

Lag plots

Autocorrelation plots

Bootstrap plots

Summary

Exercises

Further reading

Chapter 6: Plotting with Seaborn and Customization Techniques

Chapter materials

Utilizing seaborn for advanced plotting

Categorical data

Correlations and heatmaps

Regression plots

Faceting

Formatting plots with matplotlib

Titles and labels

Legends

Formatting axes

Customizing visualizations

Adding reference lines

Shading regions

Annotations

Colors

Textures

Summary

Exercises

Further reading

Section 3: Applications – Real-World Analyses Using Pandas

Chapter 7: Financial Analysis – Bitcoin and the Stock Market

Chapter materials

Building a Python package

Package structure

Overview of the stock_analysis package

UML diagrams

Collecting financial data

The StockReader class

Collecting historical data from Yahoo! Finance

Exploratory data analysis

The Visualizer class family

Visualizing a stock

Visualizing multiple assets

Technical analysis of financial instruments

The StockAnalyzer class

The AssetGroupAnalyzer class

Comparing assets

Modeling performance using historical data

The StockModeler class

Time series decomposition

ARIMA

Linear regression with statsmodels

Comparing models

Summary

Exercises

Further reading

Chapter 8: Rule-Based Anomaly Detection

Chapter materials

Simulating login attempts

Assumptions

The login_attempt_simulator package

Simulating from the command line

Exploratory data analysis

Implementing rule-based anomaly detection

Percent difference

Tukey fence

Z-score

Evaluating performance

Summary

Exercises

Further reading

Section 4: Introduction to Machine Learning with Scikit-Learn

Chapter 9: Getting Started with Machine Learning in Python

Chapter materials

Overview of the machine learning landscape

Types of machine learning

Common tasks

Machine learning in Python

Exploratory data analysis

Red wine quality data

White and red wine chemical properties data

Planets and exoplanets data

Preprocessing data

Training and testing sets

Scaling and centering data

Encoding data

Imputing

Additional transformers

Building data pipelines

Clustering

k-means

Evaluating clustering results

Regression

Linear regression

Evaluating regression results

Classification

Logistic regression

Evaluating classification results

Summary

Exercises

Further reading

Chapter 10: Making Better Predictions – Optimizing Models

Chapter materials

Hyperparameter tuning with grid search

Feature engineering

Interaction terms and polynomial features

Dimensionality reduction

Feature unions

Feature importances

Ensemble methods

Random forest

Gradient boosting

Voting

Inspecting classification prediction confidence

Addressing class imbalance

Under-sampling

Over-sampling

Regularization

Summary

Exercises

Further reading

Chapter 11: Machine Learning Anomaly Detection

Chapter materials

Exploring the simulated login attempts data

Utilizing unsupervised methods of anomaly detection

Isolation forest

Local outlier factor

Comparing models

Implementing supervised anomaly detection

Baselining

Logistic regression

Incorporating a feedback loop with online learning

Creating the PartialFitPipeline subclass

Stochastic gradient descent classifier

Summary

Exercises

Further reading

Section 5: Additional Resources

Chapter 12: The Road Ahead

Data resources

Python packages

Searching for data

APIs

Websites

Practicing working with data

Python practice

Summary

Exercises

Further reading

Solutions

Appendix

Other Books You May Enjoy

Section 1: Getting Started with Pandas

Our journey begins with an introduction to data analysis and statistics, which will lay a strong foundation for the concepts we will cover throughout the book. Then, we will set up our Python data science environment, which contains everything we will need to work through the examples, and get started with learning the basics of pandas.

This section comprises the following chapters:

Chapter 1, Introduction to Data AnalysisChapter 2, Working with Pandas DataFrames