Dealing With Data Pocket Primer - O Campesato - E-Book

Dealing With Data Pocket Primer E-Book

O Campesato

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book introduces the basic concepts of managing data using various computer languages and applications. It is designed as a fast-paced introduction to key features of data management, including statistical concepts, data-related techniques, Pandas, RDBMS, SQL, NLP topics, Matplotlib, and data visualization. The companion files with source code and color figures enhance the learning experience.
Understanding these concepts is crucial for anyone looking to manage data effectively. The book covers the fundamentals of probability and statistics, working with data using Pandas, managing databases with SQL and MySQL, and cleaning data using NLP techniques. It also delves into data visualization, providing practical insights and numerous code samples.
The journey begins with an introduction to probability and statistics, moving on to working with data and Pandas. It then covers RDBMS and SQL, focusing on practical SQL and MySQL usage. The book concludes with NLP, data cleaning, and visualization techniques, equipping readers with a comprehensive understanding of data management.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 315

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



DEALINGWITHDATA

Pocket Primer

LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY

By purchasing or using this book and companion files (the “Work”), you agree that this license grants permission to use the contents contained herein, including the disc, but does not give you the right of ownership to any of the textual content in the book/disc or ownership to any of the information or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.

MERCURY LEARNING AND INFORMATION (“MLI” or “the Publisher”) and anyone involved in the creation, writing, or production of the companion disc, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to ensure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).

The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.

The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and/or disc, and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product.

Companion files for this title are available by writing to the publisher at [email protected].

DEALINGWITHDATA

Pocket Primer

Oswald Campesato

MERCURY LEARNING AND INFORMATION

Dulles, Virginia

Boston, Massachusetts

New Delhi

Copyright ©2022 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.

This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.

Publisher: David Pallai

MERCURY LEARNING AND INFORMATION

22841 Quicksilver Drive

Dulles, VA 20166

[email protected]

www.merclearning.com

800-232-0223

O. Campesato. Dealing with Data Pocket Primer.

ISBN: 978-1-683928-201

The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.

Library of Congress Control Number: 2022934840

222324321 This book is printed on acid-free paper in the United States of America.

Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).

All of our titles are available in digital format at academiccourseware.com and other digital vendors. Companion files (figures and code listings) for this title are available by contacting [email protected]. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the disc, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.

I’d like to dedicate this book to my parents– may this bring joy and happiness into their lives.

CONTENTS

Preface

Chapter 1: Introduction to Probability and Statistics

What Is a Probability?

Calculating the Expected Value

Random Variables

Discrete versus Continuous Random Variables

Well-Known Probability Distributions

Fundamental Concepts in Statistics

The Mean

The Median

The Mode

The Variance and Standard Deviation

Population, Sample, and Population Variance

Chebyshev’s Inequality

What Is a P-Value?

The Moments of a Function (Optional)

What Is Skewness?

What Is Kurtosis?

Data and Statistics

The Central Limit Theorem

Correlation versus Causation

Statistical Inferences

Statistical Terms RSS, TSS, R^2, and F1 Score

What Is an F1 Score?

Gini Impurity, Entropy, and Perplexity

What Is Gini Impurity?

What Is Entropy?

Calculating Gini Impurity and Entropy Values

Multidimensional Gini Index

What Is Perplexity?

Cross-Entropy and KL Divergence

What Is Cross-Entropy?

What Is KL Divergence?

What Is Their Purpose?

Covariance and Correlation Matrices

The Covariance Matrix

Covariance Matrix: An Example

The Correlation Matrix

Eigenvalues and Eigenvectors

Calculating Eigenvectors: A Simple Example

Gauss Jordan Elimination (Optional)

PCA (Principal Component Analysis)

The New Matrix of Eigenvectors

Well-Known Distance Metrics

Pearson Correlation Coefficient

Jaccard Index (or Similarity)

Local Sensitivity Hashing (Optional)

Types of Distance Metrics

What Is Bayesian Inference?

Bayes’ Theorem

Some Bayesian Terminology

What Is MAP?

Why Use Bayes’ Theorem?

Summary

Chapter 2: Working with Data

Dealing With Data: What Can Go Wrong?

What Is Data Drift?

What Are Datasets?

Data Preprocessing

Data Types

Preparing Datasets

Discrete Data versus Continuous Data

“Binning” Continuous Data

Scaling Numeric Data via Normalization

Scaling Numeric Data via Standardization

Scaling Numeric Data via Robust Standardization

What to Look for in Categorical Data

Mapping Categorical Data to Numeric Values

Working With Dates

Working With Currency

Working With Outliers and Anomalies

Outlier Detection/Removal

Finding Outliers With Numpy

Finding Outliers With Pandas

Calculating Z-Scores to Find Outliers

Finding Outliers With SkLearn (Optional)

Working With Missing Data

Imputing Values: When Is Zero a Valid Value?

Dealing With Imbalanced Datasets

What Is SMOTE?

SMOTE Extensions

The Bias-Variance Tradeoff

Types of Bias in Data

Analyzing Classifiers (Optional)

What Is LIME?

What Is ANOVA?

Summary

Chapter 3: Introduction to Pandas

What Is Pandas?

Pandas DataFrames

Pandas Operations: In-place or Not?

Data Frames and Data Cleaning Tasks

A Pandas DataFrame Example

Describing a Pandas Data Frame

Pandas Boolean Data Frames

Transposing a Pandas Data Frame

Pandas Data Frames and Random Numbers

Converting Categorical Data to Numeric Data

Merging and Splitting Columns in Pandas

Combining Pandas DataFrames

Data Manipulation With Pandas DataFrames

Pandas DataFrames and CSV Files

Useful Options for the Pandas read_csv() Function

Reading Selected Rows From CSV Files

Pandas DataFrames and Excel Spreadsheets

Useful Options for Reading Excel Spreadsheets

Select, Add, and Delete Columns in Data frames

Handling Outliers in Pandas

Pandas DataFrames and Simple Statistics

Finding Duplicate Rows in Pandas

Finding Missing Values in Pandas

Missing Values in Iris-Based Dataset

Sorting Data Frames in Pandas

Working With groupby() in Pandas

Aggregate Operations With the titanic.csv Dataset

Working With apply() and mapapply() in Pandas

Useful One-Line Commands in Pandas

Working With JSON-Based Data

Python Dictionary and JSON

Python, Pandas, and JSON

Summary

Chapter 4: Introduction to RDBMS and SQL

What Is an RDBMS?

What Relationships Do Tables Have in an RDBMS?

Features of an RDBMS

What Is ACID?

When Do We Need an RDBMS?

The Importance of Normalization

A Four-Table RDBMS

Detailed Table Descriptions

The customers Table

The purchase_orders Table

The line_items Table

The item_desc Table

What Is SQL?

DCL, DDL, DQL, DML, and TCL

SQL Privileges

Properties of SQL Statements

The CREATE Keyword

What Is MySQL?

What About MariaDB?

Installing MySQL

Data Types in MySQL

The CHAR and VARCHAR Data Types

String-Based Data Types

FLOAT and DOUBLE Data Types

BLOB and TEXT Data Types

MySQL Database Operations

Creating a Database

Display a List of Databases

Display a List of Database Users

Dropping a Database

Exporting a Database

Renaming a Database

The INFORMATION_SCHEMA Table

The PROCESSLIST Table

SQL Formatting Tools

Summary

Chapter 5: Working with SQL and MySQL

Create Database Tables

Manually Creating Tables for mytools.com

Creating Tables via an SQL Script for mytools.com

Creating Tables With Japanese Text

Creating Tables From the Command Line

Drop Database Tables

Dropping Tables via a SQL Script for mytools.com

Altering Database Tables With the ALTER Keyword

Add a Column to a Database Table

Drop a Column From a Database Table

Change the Data Type of a Column

What Are Referential Constraints?

Combining Data for a Table Update (Optional)

Merging Data for a Table Update

Appending Data to a Table From a CSV File

Appending Table Data from CSV Files via SQL

Inserting Data Into Tables

Populating Tables From Text Files

Working With Simple SELECT Statements

Duplicate versus Distinct Rows

Unique Rows

The EXISTS Keyword

The LIMIT Keyword

DELETE, TRUNCATE, and DROP in SQL

More Options for the DELETE Statement in SQL

Creating Tables From Existing Tables in SQL

Working With Temporary Tables in SQL

Creating Copies of Existing Tables in SQL

What Is an SQL Index?

Types of Indexes

Creating an Index

Disabling and Enabling an Index

View and Drop Indexes

Overhead of Indexes

Considerations for Defining Indexes

Selecting Columns for an Index

Finding Columns Included in Indexes

Export Data From MySQL

Export the Result Set of a SQL Query

Export a Database or Its Contents

Using LOAD DATA in MySQL

Data Cleaning in SQL

Replace NULL With 0

Replace NULL Values With Average Value

Replace Multiple Values With a Single Value

Handle Mismatched Attribute Values

Convert Strings to Date Values

Data Cleaning From the Command Line (Optional)

Working With the sed Utility

Working With the awk Utility

Summary

Chapter 6: NLP and Data Cleaning

NLP Tasks in ML

NLP Steps for Training a Model

Text Normalization and Tokenization

Word Tokenization in Japanese

Text Tokenization With Unix Commands

Handling Stop Words

What Is Stemming?

Singular versus Plural Word Endings

Common Stemmers

Stemmers and Word Prefixes

Over Stemming and Under Stemming

What Is Lemmatization?

Stemming/Lemmatization Caveats

Limitations of Stemming and Lemmatization

Working With Text: POS

POS Tagging

POS Tagging Techniques

Cleaning Data With Regular Expressions

Cleaning Data With the cleantext Library

Handling Contracted Words

What Is BeautifulSoup?

Web Scraping With Pure Regular Expressions

What Is Scrapy?

Summary

Chapter 7: Data Visualization

What Is Data Visualization?

Types of Data Visualization

What Is Matplotlib?

Lines in a Grid in Matplotlib

A Colored Grid in Matplotlib

Randomized Data Points in Matplotlib

A Histogram in Matplotlib

A Set of Line Segments in Matplotlib

Plotting Multiple Lines in Matplotlib

Trigonometric Functions in Matplotlib

Display IQ Scores in Matplotlib

Plot a Best-Fitting Line in Matplotlib

The Iris Dataset in Sklearn

Sklearn, Pandas, and the Iris Dataset

Working With Seaborn

Features of Seaborn

Seaborn Built-In Datasets

The Iris Dataset in Seaborn

The Titanic Dataset in Seaborn

Extracting Data From the Titanic Dataset in Seaborn (1)

Extracting Data from Titanic Dataset in Seaborn (2)

Visualizing a Pandas Dataset in Seaborn

Data Visualization in Pandas

What Is Bokeh?

Summary

Index

PREFACE

What Is the Value Proposition for This Book?

This book contains a fast-paced introduction to as much relevant information about dealing with data that can be reasonably included in a book this size. You will be exposed to statistical concepts, data-related techniques, features of Pandas, SQL, NLP topics, and data visualization.

Keep in mind that some topics are presented in a cursory manner, which is for two main reasons. First, it’s important that you be exposed to these concepts. In some cases, you will find topics that might pique your interest, and hence motivate you to learn more about them through self-study; in other cases, you will probably be satisfied with a brief introduction. In other words, you will decide whether or not to delve into more detail regarding the topics in this book.

Second, a full treatment of all the topics that are covered in this book would significantly increase the size of this book, and few people are interested in reading technical tomes.

The Target Audience

This book is intended primarily for people who plan to become data scientists as well as anyone who needs to perform data cleaning tasks. This book is also intended to reach an international audience of readers with highly diverse backgrounds in various age groups. Hence, this book uses standard English rather than colloquial expressions that might be confusing to those readers. As you know, many people learn by different types of imitation; which includes reading, writing, or hearing new material. This book takes these points into consideration in order to provide a comfortable and meaningful learning experience for the intended readers.

What Will I Learn From This Book?

The first chapter briefly introduces basic probability and then discusses basic concepts in statistics, such as the mean, variance, and standard deviation, as well as other concepts. Then you will learn about more advanced concepts, such as Gini impurity, entropy, cross entropy, and KL divergence. You will also learn about different types of distance metrics and Bayesian inference.

Chapter 2 delves into processing different data types in a dataset, along with normalization, standardization, and handling missing data. You will learn about outliers and how to detect them via z-scores and quantile transformation. You will also learn about SMOTE for handling imbalanced datasets.

Chapter 3 introduces Pandas, which is a very powerful Python library that enables you to read the contents of CSV files (and other text files) into data frames (somewhat analogous to Excel spreadsheets), where you can programmatically slice-and-dice the data to conform to your requirements.

Since large quantities of data are stored in the form structured data in relational databases, Chapter 4 introduces you to SQL concepts and also how to perform basic operations in MySQL, such as working with databases.

Chapter 5 covers database topics such as managing database tables and illustrates how to populate them with data. You will also see examples of SQL statements that select rows of data from a collection of database tables.

Chapter 6 introduces you to NLP and how to perform tasks such as tokenization and removing stop words and punctuation, followed by stemming and lemmatization.

The final chapter of this book delves into data visualization with Matplotlib, Seaborn, and an example of a rendering graphics effects in Bokeh.

Why Are the Code Samples Primarily in Python?

Most of the code samples are short (usually less than one page and sometimes less than half a page), and if need be, you can easily and quickly copy/paste the code into a new Jupyter notebook. For the Python code samples that reference a CSV file, you do not need any additional code in the corresponding Jupyter notebook to access the CSV file. Moreover, the code samples execute quickly, so you won’t need to avail yourself of the free GPU that is provided in Google Colaboratory.

If you do decide to use Google Colaboratory, you can easily copy/paste the Python code into a notebook, and also use the upload feature to upload existing Jupyter notebooks. Keep in mind the following point: if the Python code references a CSV file, make sure that you include the appropriate code snippet (as explained in Chapter 1) to access the CSV file in the corresponding Jupyter notebook in Google Colaboratory.

Do I Need to Learn the Theory Portions of This Book?

Once again, the answer depends on the extent to which you plan to become involved in data analytics. For example, if you plan to study machine learning, then you will probably learn how to create and train a model, which is a task that is performed after data cleaning tasks. In general, you will probably need to learn everything that you encounter in this book if you are planning to become a machine learning engineer.

Getting the Most From This Book

Some programmers learn well from prose, others learn well from sample code (and lots of it), which means that there’s no single style that can be used for everyone.

Moreover, some programmers want to run the code first, see what it does, and then return to the code to delve into the details (and others use the opposite approach).

Consequently, there are various types of code samples in this book: some are short, some are long, and other code samples “build” from earlier code samples.

What Do I Need to Know for This Book?

Current knowledge of Python 3.x is the most helpful skill. Knowledge of other programming languages (such as Java) can also be helpful because of the exposure to programming concepts and constructs. The less technical knowledge that you have, the more diligence will be required in order to understand the various topics that are covered.

If you want to be sure that you can grasp the material in this book, glance through some of the code samples to get an idea of how much is familiar to you and how much is new.

Does This Book Contain Production-Level Code Samples?

The primary purpose of the code samples in this book is to show you Python-based libraries for solving a variety of data-related tasks in conjunction with acquiring a rudimentary understanding of statistical concepts. Clarity has higher priority than writing more compact code that is more difficult to understand (and possibly more prone to bugs). If you decide to use any of the code in this book in a production website, you ought to subject that code to the same rigorous analysis as the other parts of your code base.

What Are the Nontechnical Prerequisites for This Book?

Although the answer to this question is more difficult to quantify, it’s very important to have strong desire to learn about data analytics, along with the motivation and discipline to read and understand the code samples.

How Do I Set Up a Command Shell?

If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications > Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:

open /Applications/Utilities/Terminal.app

A second method for Mac users is to open a new command shell on a Macbook from a command shell that is already visible simply by clicking command+n in that command shell, and your Mac will launch another command shell.

If you are a PC user, you can install Cygwin (open source https://cygwin.com/) that simulates bash commands, or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).

Companion Files

All the code samples and figures in this book may be obtained by writing to the publisher at [email protected].

What Are the “Next Steps” After Finishing This Book?

The answer to this question varies widely, mainly because the answer depends heavily on your objectives. If you are interested primarily in NLP, then you can learn more advanced concepts, such as attention, transformers, and the BERT-related models.

If you are primarily interested in machine learning, there are some subfields of machine learning, such as deep learning and reinforcement learning (and deep reinforcement learning) that might appeal to you. Fortunately, there are many resources available, and you can perform an internet search for those resources. Keep in mind the different aspects of machine learning that pertain to you will vary as the needs of a machine learning engineer, data scientist, manager, student, or software developer are all different.