E-Book
91,19 €

Scikit-learn : Machine Learning Simplified E-Book

Raul Garreta

0,0

91,19 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Implement scikit-learn into every step of the data science pipeline

About This Book

Use Python and scikit-learn to create intelligent applications
Discover how to apply algorithms in a variety of situations to tackle common and not-so common challenges in the machine learning domain
A practical, example-based guide to help you gain expertise in implementing and evaluating machine learning systems using scikit-learn

Who This Book Is For

If you are a programmer and want to explore machine learning and data-based methods to build intelligent applications and enhance your programming skills, this is the course for you. No previous experience with machine-learning algorithms is required.

What You Will Learn

Review fundamental concepts including supervised and unsupervised experiences, common tasks, and performance metrics
Classify objects (from documents to human faces and flower species) based on some of their features, using a variety of methods from Support Vector Machines to Naive Bayes
Use Decision Trees to explain the main causes of certain phenomena such as passenger survival on the Titanic
Evaluate the performance of machine learning systems in common tasks
Master algorithms of various levels of complexity and learn how to analyze data at the same time
Learn just enough math to think about the connections between various algorithms
Customize machine learning algorithms to fit your problem, and learn how to modify them when the situation calls for it
Incorporate other packages from the Python ecosystem to munge and visualize your dataset
Improve the way you build your models using parallelization techniques

In Detail

Machine learning, the art of creating applications that learn from experience and data, has been around for many years. Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility; moreover, within the Python data space, scikit-learn is the unequivocal choice for machine learning. The course combines an introduction to some of the main concepts and methods in machine learning with practical, hands-on examples of real-world problems. The course starts by walking through different methods to prepare your data—be it a dataset with missing values or text columns that require the categories to be turned into indicator variables. After the data is ready, you'll learn different techniques aligned with different objectives—be it a dataset with known outcomes such as sales by state, or more complicated problems such as clustering similar customers. Finally, you'll learn how to polish your algorithm to ensure that it's both accurate and resilient to new datasets. You will learn to incorporate machine learning in your applications. Ranging from handwritten digit recognition to document classification, examples are solved step-by-step using scikit-learn and Python. By the end of this course you will have learned how to build applications that learn from experience, by applying the main concepts and techniques of machine learning.

Style and Approach

Implement scikit-learn using engaging examples and fun exercises, and with a gentle and friendly but comprehensive "learn-by-doing" approach. This is a practical course, which analyzes compelling data about life, health, and death with the help of tutorials. It offers you a useful way of interpreting the data that's specific to this course, but that can also be applied to any other data. This course is designed to be both a guide and a reference for moving beyond the basics of scikit-learn.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 567

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Credits

Preface

What this learning path covers

What you need for this learning path

Who this learning path is for

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Module 1

1. Machine Learning – A Gentle Introduction

Installing scikit-learn

Linux

Mac

Windows

Checking your installation

Datasets

Our first machine learning method –linear classification

Evaluating our results

Machine learning categories

Important concepts related to machine learning

Summary

2. Supervised Learning

Image recognition with Support Vector Machines

Training a Support Vector Machine

Text classification with Naïve Bayes

Preprocessing the data

Training a Naïve Bayes classifier

Evaluating the performance

Explaining Titanic hypothesis with decision trees

Preprocessing the data

Training a decision tree classifier

Interpreting the decision tree

Random Forests – randomizing decisions

Evaluating the performance

Predicting house prices with regression

First try – a linear model

Second try – Support Vector Machines for regression

Third try – Random Forests revisited

Evaluation

Summary

3. Unsupervised Learning

Principal Component Analysis

Clustering handwritten digits with k-means

Alternative clustering methods

Summary

4. Advanced Features

Feature extraction

Feature selection

Model selection

Grid search

Parallel grid search

Summary

2. Module 2

1. Premodel Workflow

Introduction

Getting sample data from external sources

Getting ready

How to do it…

How it works…

There's more…

scikit-learn: Machine Learning Simplified

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Published on: November 2017

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-84719-752-8

www.packtpub.com

Credits

Authors

Raúl Garreta

Guillermo Moncecchi

Trent Hauck

Gavin Hackeling

Reviewers

Andreas Hjortgaard Danielsen

Noel Dawe

Gavin Hackeling

Anoop Thomas Mathew

Xingzhong

Fahad Arshad

Sarah Guido

Mikhail Korobov

Aman Madaan

Content Development Editor

Mayur Pawanikar

Production Coordinator

Arvindkumar Gupta

Preface

Suppose you want to predict whether tomorrow will be a sunny or rainy day. You can develop an algorithm that is based on the current weather and your meteorological knowledge using a rather complicated set of rules to return the desired prediction. Now suppose that you have a record of the day-by-day weather conditions for the last five years, and you find that every time you had two sunny days in a row, the following day also happened to be a sunny one. Your algorithm could generalize this and predict that tomorrow will be a sunny day since the sun reigned today and yesterday. This algorithm is a pretty simple example of learning from experience. This is what Machine Learning is all about: algorithms that learn from the available data.

This course is designed in the same way that many data science and analytics projects play out. First, we need to acquire data; the data is often messy, incomplete, or not correct in some way. Therefore, we spend the first chapter talking about strategies for dealing with bad data and ways to deal with other problems that arise from data. For example, what happens if we have too many features? How do we handle that?

What this learning path covers

Module 1, Learning scikit-learn: Machine Learning in Python, in this module, you will learn several methods for building Machine Learning applications that solve different real-world tasks, from document classification to image recognition. We will use Python, a simple, popular, and widely used programming language, and scikit-learn, an open source Machine Learning library. In each chapter of this module, we will present a different Machine Learning setting and a couple of well-studied methods as well as show step-by-step examples that use Python and scikit-learn to solve concrete tasks. We will also show you tips and tricks to improve algorithm performance, both from the accuracy and computational cost point of views.

Module 2, scikit-learn Cookbook, the first chapter of this module is your guide. The meat of this module will walk you through various algorithms and how to implement them into your workflow. And finally, we'll end with the postmodel workflow. This chapter is fairly agnostic to the other chapters of the module and can be applied to the various algorithms you'll learn up until the final chapter.

Module 3, Mastering Machine Learning with scikit-learn, in this module, we will examine several machine learning models and learning algorithms. We will discuss tasks that machine learning is commonly applied to, and learn to measure the performance of machine learning systems. We will work with a popular library for the Python programming language called scikit-learn, which has assembled excellent implementations of many machine learning models and algorithms under a simple yet versatile API.

This module is motivated by two goals:

Its content should be accessible. The book only assumes familiarity with basic programming and math.Its content should be practical. This book offers hands-on examples that readers can adapt to problems in the real world.

What you need for this learning path

Module 1:

For running the module's examples, you will need a running Python environment, including the scikit-learn library and NumPy and SciPy mathematical libraries. The source code will be available in the form of IPython notebooks. For Chapter 4, Advanced Features, we will also include the Pandas Python library. Chapter 1, Machine Learning – A Gentle Introduction, shows how to install them in your operating system.

Module 2:

Here are the contents that will get the environment set up. This will allow you to follow along with the code in this module. This method may be easier for less-experienced Python developers:

dateutil==2.1

ipython==2.2.0

ipython-notebook==2.1.0

jinja2==2.7.3

markupsafe==0.18

matplotlib==1.3.1

numpy==1.8.1

patsy==0.3.0

pandas==0.14.1

pip==1.5.6

pydot==1.0.28

pyparsing==1.5.6

pytz==2014.4

pyzmq==14.3.1

scikit-learn==0.15.0

scipy==0.14.0

setuptools==3.6

six==1.7.3

ssl_match_hostname==3.4.0.2

tornado==3.2.2

Module 3:

The examples in this module assume that you have an installation of Python 2.7. The first chapter will describe methods to install scikit-learn 0.15.2, its dependencies, and other libraries on Linux, OS X, and Windows.

Who this learning path is for

If you are a programmer and want to explore machine learning and data-based methods to build intelligent applications and enhance your programming skills, this is the book for you. No previous experience with machine-learning algorithms is required.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the course's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the course in the Search box.Select the course for which you're looking to download the code files.Choose from the drop-down menu where you purchased this course from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/scikit-learn-Machine-Learning-Simplified. We also have other code bundles from our rich catalog of books, videos, and courses available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our courses—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this course. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the course in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this course, you can contact us at <[email protected]>, and we will do our best to address the problem.

Part 1. Module 1

Learning scikit-learn: Machine Learning in Python

Experience the benefits of machine learning techniques by applying them to real-world problems using Python and the open source scikit-learn library

Machine learning categories

Classification is only one of the possible machine learning problems that can be addressed with scikit-learn. We can organize them in the following categories:

In the previous example, we had a set of instances (that is, a set of data collected from a population) represented by certain features and with a particular target attribute. Supervised learning algorithms try to build a model from this data, which lets us predict the target attribute for new instances, knowing only these instance features. When the target class belongs to a discrete set (such as a list of flower species), we are facing a classification problem.Sometimes the class we want to predict, instead of belonging to a discrete set, ranges on a continuous set, such as the real number line. In this case, we are trying to solve a regression problem (the term was coined by Francis Galton, who observed that the heights of tall ancestors tend to regress down towards a normal value, the average human height). For example, we could try to predict the petal width based on the other three features. We will see that the methods used for regression are quite different from those used for classification.Another different type of machine learning problem is that of unsupervised learning. In this case, we do not have a target class to predict but instead want to group instances according to some similarity measure based on the available set of features. For example, suppose you have a dataset composed of e-mails and want to group them by their main topic (the task of grouping instances is called clustering). We can use it as features, for example, the different words used in each of them.

Important concepts related to machine learning

The linear classifier we presented in the previous section could look too simple. What if we use a higher degree polynomial? What if we also take as features not only the sepal length and width, but also the petal length and the petal width? This is perfectly possible, and depending on the sample distribution, it could lead to a better fit to the training data, resulting in higher accuracy. The problem with this approach is that now we must estimate not only the three original parameters (the coefficients for x1, x2, and the interception point), but also the parameters for the new features x3 and x4 (petal length and width) and also the product combinations of the four features.

Intuitively, we would need more training data to adequately estimate these parameters. The number of parameters (and consequently, the amount of training data needed to adequately estimate them) would rapidly grow if we add more features or higher order terms. This phenomenon, present in every machine learning method, is called the idem curse of dimensionality: when the number of parameters of a model grows, the data needed to learn them grows exponentially.

This notion is closely related to the problem of overfitting mentioned earlier. As our training data is not enough, we risk producing a model that could be very good at predicting the target class on the training dataset but fail miserably when faced with new data, that is, our model does not have the generalization power. That is why it is so important to evaluate our methods on previously unseen data.

The general rule is that, in order to avoid overfitting, we should prefer simple (that is, with less parameters) methods, something that could be seen as an instantiation of the philosophical principle of Occam's razor, which states that among competing hypotheses, the hypothesis with the fewest assumptions should be selected.

However, we should also take into account Einstein's words:

"Everything should be made as simple as possible, but not simpler."

The idem curse of dimensionality may suggest that we keep our models simple, but on the other hand, if our model is too simple we run the risk of suffering from underfitting. Underfitting problems arise when our model has such a low representation power that it cannot model the data even if we had all the training data we want. We clearly have underfitting when our algorithm cannot achieve good performance measures even when measuring on the training set.

As a result, we will have to achieve a balance between overfitting and underfitting. This is one of the most important problems that we will have to address when designing our machine learning models.

Other key concepts to take into account are the idem bias and variance of a machine learning method. Consider an extreme method that, in a binary classification setting, always predicts the positive class for any new instance. Its predictions are, trivially, always the same, or in statistical terms, it has null variance; but it will fail to predict negative examples: it is very biased towards positive results. On the other hand, consider a method that predicts, for a new instance, the class of the nearest instance in the training set (in fact, this method exists, and it is called the 1-nearest neighbor). The generalization assumptions that this method uses are very small: it has a very low bias; but, if we change the training data, results could dramatically change, that is, its variance is very high. These are extreme examples of the bias-variance tradeoff. It can be shown that, no matter which method we are using, if we reduce bias, variance will increase, and vice versa.

Linear classifiers have generally low-variance: no matter what subset we select for training, results will be similar. However, if the data distribution (as in the case of the versicolor and virginica species) makes target classes not separable by a hyperplane, these results will be consistently wrong, that is, the method is highly biased.

On the other hand, kNN (a memory-based method we will not address in this book) has very low bias but high variance: the results are generally very good at describing training data but tend to vary greatly when trained on different training instances.

There are other important concepts related to real-world applications where our data will not come naturally as a list of real-valued features. In these cases, we will need to have methods to transform non real-valued features to real-valued ones. Besides, there are other steps related to feature standardization and normalization, which as we saw in our Iris example, are needed to avoid undesired effects regarding the different value ranges. These transformations on the feature space are known as data preprocessing.

After having a defined feature set, we will see that not all of the features that come in our original dataset could be useful for resolving our task. So we must also have methods to do feature selection, that is, methods to select the most promising features.

In this book, we will present several problems and in each of them we will show different ways to transform and find the most relevant features to use for learning a task, called feature engineering, which is based on our knowledge of the domain of the problem and/or data analysis methods. These methods, often not valued enough, are a fundamental step toward obtaining good results.

Summary

In this chapter, we introduced the main general concepts in machine learning and presented scikit-learn, the Python library we will use in the rest of this book. We included a very simple example of classification, trying to show the main steps for learning, and including the most important evaluation measures we will use. In the rest of this book, we plan to show you different machine learning methods and techniques using different real-world examples for each one. In almost every computational task, the presence of historical data could allow us to improve performance in the sense introduced at the beginning of this chapter.

The next chapter introduces supervised learning methods: we have annotated data (that is, instances where the target class/value is known) and we want to predict the same class/value for future data from the same population. In the case of classification tasks, that is, a discrete-valued target class, several different models exist, ranging from statistical methods, such as the simple Naïve Bayes to advanced linear classifiers, such as Support Vector Machines (SVM). Some methods, such as decision trees, will allow us to visualize how important a feature is to discriminate between different target classes and have a human interpretation of the decision process. We will also address another type of supervised learning task: regression, that is, methods that try to predict real-valued data.