Data Science Algorithms in a Week. - David Natingga - E-Book

Data Science Algorithms in a Week. E-Book

David Natingga

0,0
31,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Machine learning applications are highly automated and self-modifying, and continue to improve over time with minimal human intervention, as they learn from the trained data. To address the complex nature of various real-world data problems, specialized machine learning algorithms have been developed. Through algorithmic and statistical analysis, these models can be leveraged to gain new knowledge from existing data as well.
Data Science Algorithms in a Week addresses all problems related to accurate and efficient data classification and prediction. Over the course of seven days, you will be introduced to seven algorithms, along with exercises that will help you understand different aspects of machine learning. You will see how to pre-cluster your data to optimize and classify it for large datasets. This book also guides you in predicting data based on existing trends in your dataset. This book covers algorithms such as k-nearest neighbors, Naive Bayes, decision trees, random forest, k-means, regression, and time-series analysis.
By the end of this book, you will understand how to choose machine learning algorithms for clustering, classification, and regression and know which is best suited for your problem

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 213

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Science Algorithms in a WeekSecond Edition
Top 7 algorithms for scientific computing, data analysis, and machine learning

 

 

 

 

 

 

 

 

 

Dávid Natingga

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Data Science Algorithms in a Week Second Edition

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor:Amey VarangaonkarAcquisition Editor: Joshua NadarContent Development Editor: Ronnel MathewTechnical Editor: Sneha HanchateCopy Editor: Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer: Rekha NairGraphics: Tom ScariaProduction Coordinator: Aparna Bhagat

First published: August 2017 Second edition: October 2018

Production reference: 1311018

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78980-607-6

www.packtpub.com

 
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

Contributors

About the author

Dávid Natingga graduated with a master's in engineering in 2014 from Imperial College London, specializing in artificial intelligence. In 2011, he worked at Infosys Labs in Bangalore, India, undertaking research into the optimization of machine learning algorithms. In 2012 and 2013, while at Palantir Technologies in USA, he developed algorithms for big data. In 2014, while working as a data scientist at Pact Coffee, London, he created an algorithm suggesting products based on the taste preferences of customers and the structures of the coffees. In order to use pure mathematics to advance the field of AI, he is a PhD candidate in Computability Theory at the University of Leeds, UK. In 2016, he spent 8 months at Japan's Advanced Institute of Science and Technology as a research visitor.

 

About the reviewers

Surendra Pepakayala is a hands-on, seasoned technology professional with over 20 years of experience in the US and India. He has built enterprise software products at startups and multinational companies and has built and sold a technology business after five years in operation. He currently consults for small to medium businesses, helping them leverage cloud, data science, machine learning, AI, and cutting-edge technologies to gain an advantage over their competition. In addition to being an advisory board member for a couple of startups in the technology space, he holds numerous coveted certifications such as TOGAF, CRISC, and CGEIT.Jen Stirrup is a data strategist and technologist, Microsoft Most Valuable Professional (MVP) and Microsoft Regional Director,  tech community advocate, public speaker, blogger, published author, and keynote speaker. She is the founder of a boutique consultancy based in the UK, Data Relish, which focuses on delivering successful business intelligence and artificial intelligence solutions that add real value to customers worldwide. She has featured on the BBC as a guest expert on topics related to data.

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Data Science Algorithms in a Week Second Edition

Packt Upsell

Why subscribe?

Packt.com

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Classification Using K-Nearest Neighbors

Mary and her temperature preferences

Implementation of the k-nearest neighbors algorithm

Map of Italy example – choosing the value of k

Analysis

House ownership – data rescaling

Analysis

Text classification – using non-Euclidean distances

Analysis

Text classification – k-NN in higher dimensions

Analysis

Summary

Problems

Mary and her temperature preference problems

Map of Italy – choosing the value of k

House ownership

Analysis

Naive Bayes

Medical tests – basic application of Bayes' theorem

Analysis

Bayes' theorem and its extension

Bayes' theorem

Proof

Extended Bayes' theorem

Proof

Playing chess – independent events

Analysis

Implementation of a Naive Bayes classifier

Playing chess – dependent events

Analysis

Gender classification – Bayes for continuous random variables

Analysis

Summary

Problems

Analysis

Decision Trees

Swim preference – representing data using a decision tree

Information theory

Information entropy

Coin flipping

Definition of information entropy

Information gain

Swim preference – information gain calculation

ID3 algorithm – decision tree construction

Swim preference – decision tree construction by the ID3 algorithm

Implementation

Classifying with a decision tree

Classifying a data sample with the swimming preference decision tree

Playing chess – analysis with a decision tree

Analysis

Classification

Going shopping – dealing with data inconsistencies

Analysis

Summary

Problems

Analysis

Random Forests

Introduction to the random forest algorithm

Overview of random forest construction

Swim preference – analysis involving a random forest

Analysis

Random forest construction

Construction of random decision tree number 0

Construction of random decision tree number 1

Constructed random forest

Classification using random forest

Implementation of the random forest algorithm

Playing chess example

Analysis

Random forest construction

Classification

Going shopping – overcoming data inconsistencies with randomness and measuring the level of confidence

Analysis

Summary

Problems

Analysis

Clustering into K Clusters

Household incomes – clustering into k clusters

K-means clustering algorithm

Picking the initial k-centroids

Computing a centroid of a given cluster

Using the k-means clustering algorithm on the household income example

Gender classification – clustering to classify

Analysis

Implementation of the k-means clustering algorithm

Input data from gender classification

Program output for gender classification data

House ownership – choosing the number of clusters

Analysis

Document clustering – understanding the number of k clusters in a semantic context

Analysis

Summary

Problems

Analysis

Regression

Fahrenheit and Celsius conversion – linear regression on perfect data

Analysis from first principles

Least squares method for linear regression

Analysis using the least squares method in Python

Visualization

Weight prediction from height – linear regression on real-world data

Analysis

Gradient descent algorithm and its implementation

Gradient descent algorithm

Implementation

Visualization – comparison of the least squares method and the gradient descent algorithm

Flight time duration prediction based on distance

Analysis

Ballistic flight analysis – non-linear model

Analysis

Analysis by using the least squares method in Python

Summary

Problems

Analysis

Time Series Analysis

Business profits – analyzing trends

Analysis

Analyzing trends using the least squares method in Python

Visualization

Conclusion

Electronics shop's sales – analyzing seasonality

Analysis

Analyzing trends using the least squares method in Python

Visualization

Analyzing seasonality

Conclusion

Summary

Problems

Analysis

Python Reference

Introduction

Python Hello World example

Comments

Data types

int

float

String

Tuple

List

Set

Dictionary

Flow control

Conditionals

For loop

For loop on range

For loop on list

Break and continue

Functions

Input and output

Program arguments

Reading and writing a file

Statistics

Basic concepts

Bayesian inference

Distributions

Normal distribution

Cross-validation

K-fold cross-validation

A/B testing

Glossary of Algorithms and Methods in Data Science

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Data science is a discipline at the intersection of machine learning, statistics, and data mining with the objective of gaining new knowledge from existing data by means of algorithmic and statistical analysis. In this book, you will learn the seven most important ways in data science of analyzing the data. Each chapter first explains its algorithm or analysis as a simple concept, supported by a trivial example. Further examples and exercises are used to build and expand your knowledge of a particular type of analysis.

Who this book is for

This book is for aspiring data science professionals who are familiar with Python and have a background of sorts in statistics. Developers who are currently implementing one or two data science algorithms and who now want to learn more to expand their skillset will find this book quite useful.

What this book covers

Chapter 1, Classification Using K-Nearest Neighbors, classifies a data item based on the most similar k items.

Chapter 2, Naive Bayes, delves into Bayes' Theorem with a view to computing the probability a data item belonging to a certain class.

Chapter 3, Decision Trees, organizes your decision criteria into the branches of a tree, and uses a decision tree to classify a data item into one of the classes at the leaf node.

Chapter 4, Random Forests, classifies a data item with an ensemble of decision trees to improve the accuracy of the algorithm by reducing the negative impact of the bias.

Chapter 5, Clustering into K Clusters, divides your data into k clusters to discover the patterns and similarities between the data items and goes into how to exploit these patterns to classify new data.

Chapter 6, Regression, models phenomena in your data by using a function that can predict the values of the unknown data in a simple way.

Chapter 7, Time-Series Analysis, unveils the trends and repeating patterns in time-dependent data to predict the future of the stock market, Bitcoin prices, and other time events.

Appendix A,  Python Reference, is a reference of the basic Python language constructs, commands, and functions used throughout the book.

Appendix B, Statistics, provides a summary of the statistical methods and tools that are useful to a data scientist.

Appendix C, Glossary of Algorithms and Methods in Data Science, provides a glossary of some of the most important and powerful algorithms and methods from the fields of data science and machine learning.

To get the most out of this book

To get the most out of this book, you require, first and foremost, an active attitude to think of the problems—a lot of new content is presented in the exercises at the end of the chapter in the section entitled Problems. You also then need to be able to run Python programs on the operating system of your choice. The author ran the programs on the Linux operating system using the command line

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packt.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Data-Science-Algorithms-in-a-Week-Second-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781789806076_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Classification Using K-Nearest Neighbors

A nearest neighbor algorithm classifies a data instance based on its neighbors. The class of a data instance determined by the k-nearest neighbors algorithm is the class with the highest representation among the k-closest neighbors.

In this chapter, we will cover the following topics:

How to implement the basics of the

k-

NN algorithm using the example of Mary and her temperature preferences

How to 

choose a correct

k

 value so that the algorithm can perform correctly and with the highest degree of accuracy using the example of a map of Italy

How to rescale 

values and prepare them for the k-NN algorithm using the example of house preferences

How 

to choose a good metric to measure distances between data points

How to eliminate irrelevant dimensions in higher-dimensional space to ensure that the algorithm performs accurately using the text classification example

Mary and her temperature preferences

As an example, if we know that our friend, Mary, feels cold when it is 10°C, but warm when it is 25°C, then in a room where it is 22°C, the nearest neighbor algorithm would guess that our friend would feel warm, because 22 is closer to 25 than to 10.

Suppose that we would like to know when Mary feels warm and when she feels cold, as in the previous example, but in addition, wind speed data is also available when Mary is asked whether she feels warm or cold:

Temperature in °C

Wind speed in km/h

Mary's perception

10

0

Cold

25

0

Warm

15

5

Cold

20

3

Warm

18

7

Cold

20

10

Cold

22

5

Warm

24

6

Warm

 

We could represent the data in a graph, as follows:

Now, suppose we would like to find out how Mary feels when the temperature is 16°C with a wind speed of 3 km/h by using the 1-NN algorithm:

For simplicity, we will use a Manhattan metric to measure the distance between the neighbors on the grid. The Manhattan distance dMan of the neighbor N1=(x1,y1) from the neighbor N2=(x2,y2) is defined as dMan=|x1-x2|+|y1-y2|.

Let's label the grid with distances around the neighbors to see which neighbor with a known class is closest to the point we would like to classify:

We can see that the closest neighbor with a known class is the one with a temperature of 15°C (blue) and a wind speed of 5 km/h. Its distance from the point in question is three units. Its class is blue (cold). The closest red (warm) neighbour is at a distance of four units from the point in question. Since we are using the 1-nearest neighbor algorithm, we just look at the closest neighbor and, therefore, the class of the point in question should be blue (cold).

By applying this procedure to every data point, we can complete the graph, as follows:

Note that, sometimes, a data point might be the same distance away from two known classes: for example, 20°C and 6 km/h. In such situations, we could prefer one class over the other, or ignore these boundary cases. The actual result depends on the specific implementation of an algorithm.