Scikit-learn : Machine Learning Simplified - Raul Garreta - E-Book

Scikit-learn : Machine Learning Simplified E-Book

Raul Garreta

0,0
91,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Implement scikit-learn into every step of the data science pipeline

About This Book

  • Use Python and scikit-learn to create intelligent applications
  • Discover how to apply algorithms in a variety of situations to tackle common and not-so common challenges in the machine learning domain
  • A practical, example-based guide to help you gain expertise in implementing and evaluating machine learning systems using scikit-learn

Who This Book Is For

If you are a programmer and want to explore machine learning and data-based methods to build intelligent applications and enhance your programming skills, this is the course for you. No previous experience with machine-learning algorithms is required.

What You Will Learn

  • Review fundamental concepts including supervised and unsupervised experiences, common tasks, and performance metrics
  • Classify objects (from documents to human faces and flower species) based on some of their features, using a variety of methods from Support Vector Machines to Naive Bayes
  • Use Decision Trees to explain the main causes of certain phenomena such as passenger survival on the Titanic
  • Evaluate the performance of machine learning systems in common tasks
  • Master algorithms of various levels of complexity and learn how to analyze data at the same time
  • Learn just enough math to think about the connections between various algorithms
  • Customize machine learning algorithms to fit your problem, and learn how to modify them when the situation calls for it
  • Incorporate other packages from the Python ecosystem to munge and visualize your dataset
  • Improve the way you build your models using parallelization techniques

In Detail

Machine learning, the art of creating applications that learn from experience and data, has been around for many years. Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility; moreover, within the Python data space, scikit-learn is the unequivocal choice for machine learning. The course combines an introduction to some of the main concepts and methods in machine learning with practical, hands-on examples of real-world problems. The course starts by walking through different methods to prepare your data—be it a dataset with missing values or text columns that require the categories to be turned into indicator variables. After the data is ready, you'll learn different techniques aligned with different objectives—be it a dataset with known outcomes such as sales by state, or more complicated problems such as clustering similar customers. Finally, you'll learn how to polish your algorithm to ensure that it's both accurate and resilient to new datasets. You will learn to incorporate machine learning in your applications. Ranging from handwritten digit recognition to document classification, examples are solved step-by-step using scikit-learn and Python. By the end of this course you will have learned how to build applications that learn from experience, by applying the main concepts and techniques of machine learning.

Style and Approach

Implement scikit-learn using engaging examples and fun exercises, and with a gentle and friendly but comprehensive "learn-by-doing" approach. This is a practical course, which analyzes compelling data about life, health, and death with the help of tutorials. It offers you a useful way of interpreting the data that's specific to this course, but that can also be applied to any other data. This course is designed to be both a guide and a reference for moving beyond the basics of scikit-learn.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 567

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Module 1
1. Machine Learning – A Gentle Introduction
Installing scikit-learn
Linux
Mac
Windows
Checking your installation
Datasets
Our first machine learning method –linear classification
Evaluating our results
Machine learning categories
Important concepts related to machine learning
Summary
2. Supervised Learning
Image recognition with Support Vector Machines
Training a Support Vector Machine
Text classification with Naïve Bayes
Preprocessing the data
Training a Naïve Bayes classifier
Evaluating the performance
Explaining Titanic hypothesis with decision trees
Preprocessing the data
Training a decision tree classifier
Interpreting the decision tree
Random Forests – randomizing decisions
Evaluating the performance
Predicting house prices with regression
First try – a linear model
Second try – Support Vector Machines for regression
Third try – Random Forests revisited
Evaluation
Summary
3. Unsupervised Learning
Principal Component Analysis
Clustering handwritten digits with k-means
Alternative clustering methods
Summary
4. Advanced Features
Feature extraction
Feature selection
Model selection
Grid search
Parallel grid search
Summary
2. Module 2
1. Premodel Workflow
Introduction
Getting sample data from external sources
Getting ready
How to do it…
How it works…
There's more…
See also
Creating sample data for toy analysis
Getting ready
How to do it...
How it works...
Scaling data to the standard normal
Getting ready
How to do it...
How it works...
There's more...
Creating idempotent scalar objects
Handling sparse imputations
Creating binary features through thresholding
Getting ready
How to do it...
How it works...
There's more...
Sparse matrices
The fit method
Working with categorical variables
Getting ready
How to do it...
How it works...
There's more...
DictVectorizer
Patsy
Binarizing label features
Getting ready
How to do it...
How it works...
There's more...
Imputing missing values through various strategies
Getting ready
How to do it...
How it works...
There's more...
Using Pipelines for multiple preprocessing steps
Getting ready
How to do it...
How it works...
Reducing dimensionality with PCA
Getting ready
How to do it...
How it works...
There's more...
Using factor analysis for decomposition
Getting ready
How to do it...
How it works...
Kernel PCA for nonlinear dimensionality reduction
Getting ready
How to do it...
How it works...
Using truncated SVD to reduce dimensionality
Getting ready
How to do it...
How it works...
There's more...
Sign flipping
Sparse matrices
Decomposition to classify with DictionaryLearning
Getting ready
How to do it...
How it works...
Putting it all together with Pipelines
Getting ready
How to do it...
How it works...
There's more...
Using Gaussian processes for regression
Getting ready
How to do it…
How it works…
There's more…
Defining the Gaussian process object directly
Getting ready
How to do it…
How it works…
Using stochastic gradient descent for regression
Getting ready
How to do it…
How it works…
2. Working with Linear Models
Introduction
Fitting a line through data
Getting ready
How to do it...
How it works...
There's more...
Evaluating the linear regression model
Getting ready
How to do it...
How it works...
There's more...
Using ridge regression to overcome linear regression's shortfalls
Getting ready
How to do it...
How it works...
Optimizing the ridge regression parameter
Getting ready
How to do it...
How it works...
There's more...
Using sparsity to regularize models
Getting ready
How to do it...
How it works...
Lasso cross-validation
Lasso for feature selection
Taking a more fundamental approach to regularization with LARS
Getting ready
How to do it...
How it works...
There's more...
Using linear methods for classification – logistic regression
Getting ready
How to do it...
There's more...
Directly applying Bayesian ridge regression
Getting ready
How to do it...
How it works...
There's more...
Using boosting to learn from errors
Getting ready
How to do it...
How it works...
3. Building Models with Distance Metrics
Introduction
Using KMeans to cluster data
Getting ready
How to do it…
How it works...
Optimizing the number of centroids
Getting ready
How to do it…
How it works…
Assessing cluster correctness
Getting ready
How to do it...
There's more...
Using MiniBatch KMeans to handle more data
Getting ready
How to do it...
How it works...
Quantizing an image with KMeans clustering
Getting ready
How do it…
How it works…
Finding the closest objects in the feature space
Getting ready
How to do it...
How it works...
There's more...
Probabilistic clustering with Gaussian Mixture Models
Getting ready
How to do it...
How it works...
Using KMeans for outlier detection
Getting ready
How to do it...
How it works...
Using k-NN for regression
Getting ready
How to do it…
How it works...
4. Classifying Data with scikit-learn
Introduction
Doing basic classifications with Decision Trees
Getting ready
How to do it…
How it works…
Tuning a Decision Tree model
Getting ready
How to do it…
How it works…
Using many Decision Trees – random forests
Getting ready
How to do it…
How it works…
There's more…
Tuning a random forest model
Getting ready
How to do it…
How it works…
There's more…
Classifying data with support vector machines
Getting ready
How to do it…
How it works…
There's more…
Generalizing with multiclass classification
Getting ready
How to do it…
How it works…
Using LDA for classification
Getting ready
How to do it…
How it works…
Working with QDA – a nonlinear LDA
Getting ready
How to do it…
How it works…
Using Stochastic Gradient Descent for classification
Getting ready
How to do it…
Classifying documents with Naïve Bayes
Getting ready
How to do it…
How it works…
There's more…
Label propagation with semi-supervised learning
Getting ready
How to do it…
How it works…
5. Postmodel Workflow
Introduction
K-fold cross validation
Getting ready
How to do it...
How it works...
Automatic cross validation
Getting ready
How to do it...
How it works...
Cross validation with ShuffleSplit
Getting ready
How to do it...
Stratified k-fold
Getting ready
How to do it...
How it works...
Poor man's grid search
Getting ready
How to do it...
How it works...
Brute force grid search
Getting ready
How to do it...
How it works...
Using dummy estimators to compare results
Getting ready
How to do it...
How it works...
Regression model evaluation
Getting ready
How to do it...
How it works...
Feature selection
Getting ready
How to do it...
How it works...
Feature selection on L1 norms
Getting ready
How to do it...
How it works...
Persisting models with joblib
Getting ready
How to do it...
How it works...
There's more...
3. Module 3
1. The Fundamentals of Machine Learning
Learning from experience
Machine learning tasks
Training data and test data
Performance measures, bias, and variance
An introduction to scikit-learn
Installing scikit-learn
Installing scikit-learn on Windows
Installing scikit-learn on Linux
Installing scikit-learn on OS X
Verifying the installation
Installing pandas and matplotlib
Summary
2. Linear Regression
Simple linear regression
Evaluating the fitness of a model with a cost function
Solving ordinary least squares for simple linear regression
Evaluating the model
Multiple linear regression
Polynomial regression
Regularization
Applying linear regression
Exploring the data
Fitting and evaluating the model
Fitting models with gradient descent
Summary
3. Feature Extraction and Preprocessing
Extracting features from categorical variables
Extracting features from text
The bag-of-words representation
Stop-word filtering
Stemming and lemmatization
Extending bag-of-words with TF-IDF weights
Space-efficient feature vectorizing with the hashing trick
Extracting features from images
Extracting features from pixel intensities
Extracting points of interest as features
SIFT and SURF
Data standardization
Summary
4. From Linear Regression to Logistic Regression
Binary classification with logistic regression
Spam filtering
Binary classification performance metrics
Accuracy
Precision and recall
Calculating the F1 measure
ROC AUC
Tuning models with grid search
Multi-class classification
Multi-class classification performance metrics
Multi-label classification and problem transformation
Multi-label classification performance metrics
Summary
5. Nonlinear Classification and Regression with Decision Trees
Decision trees
Training decision trees
Selecting the questions
Information gain
Gini impurity
Decision trees with scikit-learn
Tree ensembles
The advantages and disadvantages of decision trees
Summary
6. Clustering with K-Means
Clustering with the K-Means algorithm
Local optima
The elbow method
Evaluating clusters
Image quantization
Clustering to learn features
Summary
7. Dimensionality Reduction with PCA
An overview of PCA
Performing Principal Component Analysis
Variance, Covariance, and Covariance Matrices
Eigenvectors and eigenvalues
Dimensionality reduction with Principal Component Analysis
Using PCA to visualize high-dimensional data
Face recognition with PCA
Summary
8. The Perceptron
Activation functions
The perceptron learning algorithm
Binary classification with the perceptron
Document classification with the perceptron
Limitations of the perceptron
Summary
9. From the Perceptron to Support Vector Machines
Kernels and the kernel trick
Maximum margin classification and support vectors
Classifying characters in scikit-learn
Classifying handwritten digits
Classifying characters in natural images
Summary
10. From the Perceptron to Artificial Neural Networks
Nonlinear decision boundaries
Feedforward and feedback artificial neural networks
Multilayer perceptrons
Minimizing the cost function
Forward propagation
Backpropagation
Approximating XOR with Multilayer perceptrons
Classifying handwritten digits
Summary
Bibliography
Index

scikit-learn: Machine Learning Simplified

scikit-learn: Machine Learning Simplified

Copyright © 2017 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Published on: November 2017

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-84719-752-8

www.packtpub.com

Credits

Authors

Raúl Garreta

Guillermo Moncecchi

Trent Hauck

Gavin Hackeling

Reviewers

Andreas Hjortgaard Danielsen

Noel Dawe

Gavin Hackeling

Anoop Thomas Mathew

Xingzhong

Fahad Arshad

Sarah Guido

Mikhail Korobov

Aman Madaan

Content Development Editor

Mayur Pawanikar

Production Coordinator

Arvindkumar Gupta

Preface

Suppose you want to predict whether tomorrow will be a sunny or rainy day. You can develop an algorithm that is based on the current weather and your meteorological knowledge using a rather complicated set of rules to return the desired prediction. Now suppose that you have a record of the day-by-day weather conditions for the last five years, and you find that every time you had two sunny days in a row, the following day also happened to be a sunny one. Your algorithm could generalize this and predict that tomorrow will be a sunny day since the sun reigned today and yesterday. This algorithm is a pretty simple example of learning from experience. This is what Machine Learning is all about: algorithms that learn from the available data.

This course is designed in the same way that many data science and analytics projects play out. First, we need to acquire data; the data is often messy, incomplete, or not correct in some way. Therefore, we spend the first chapter talking about strategies for dealing with bad data and ways to deal with other problems that arise from data. For example, what happens if we have too many features? How do we handle that?

What this learning path covers

Module 1, Learning scikit-learn: Machine Learning in Python, in this module, you will learn several methods for building Machine Learning applications that solve different real-world tasks, from document classification to image recognition. We will use Python, a simple, popular, and widely used programming language, and scikit-learn, an open source Machine Learning library. In each chapter of this module, we will present a different Machine Learning setting and a couple of well-studied methods as well as show step-by-step examples that use Python and scikit-learn to solve concrete tasks. We will also show you tips and tricks to improve algorithm performance, both from the accuracy and computational cost point of views.

Module 2, scikit-learn Cookbook, the first chapter of this module is your guide. The meat of this module will walk you through various algorithms and how to implement them into your workflow. And finally, we'll end with the postmodel workflow. This chapter is fairly agnostic to the other chapters of the module and can be applied to the various algorithms you'll learn up until the final chapter.

Module 3, Mastering Machine Learning with scikit-learn, in this module, we will examine several machine learning models and learning algorithms. We will discuss tasks that machine learning is commonly applied to, and learn to measure the performance of machine learning systems. We will work with a popular library for the Python programming language called scikit-learn, which has assembled excellent implementations of many machine learning models and algorithms under a simple yet versatile API.

This module is motivated by two goals:

Its content should be accessible. The book only assumes familiarity with basic programming and math.Its content should be practical. This book offers hands-on examples that readers can adapt to problems in the real world.

What you need for this learning path

Module 1:

For running the module's examples, you will need a running Python environment, including the scikit-learn library and NumPy and SciPy mathematical libraries. The source code will be available in the form of IPython notebooks. For Chapter 4, Advanced Features, we will also include the Pandas Python library. Chapter 1, Machine Learning – A Gentle Introduction, shows how to install them in your operating system.

Module 2:

Here are the contents that will get the environment set up. This will allow you to follow along with the code in this module. This method may be easier for less-experienced Python developers:

dateutil==2.1

ipython==2.2.0

ipython-notebook==2.1.0

jinja2==2.7.3

markupsafe==0.18

matplotlib==1.3.1

numpy==1.8.1

patsy==0.3.0

pandas==0.14.1

pip==1.5.6

pydot==1.0.28

pyparsing==1.5.6

pytz==2014.4

pyzmq==14.3.1

scikit-learn==0.15.0

scipy==0.14.0

setuptools==3.6

six==1.7.3

ssl_match_hostname==3.4.0.2

tornado==3.2.2

Module 3:

The examples in this module assume that you have an installation of Python 2.7. The first chapter will describe methods to install scikit-learn 0.15.2, its dependencies, and other libraries on Linux, OS X, and Windows.

Who this learning path is for

If you are a programmer and want to explore machine learning and data-based methods to build intelligent applications and enhance your programming skills, this is the book for you. No previous experience with machine-learning algorithms is required.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the course's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the course in the Search box.Select the course for which you're looking to download the code files.Choose from the drop-down menu where you purchased this course from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/scikit-learn-Machine-Learning-Simplified. We also have other code bundles from our rich catalog of books, videos, and courses available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our courses—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this course. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the course in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this course, you can contact us at <[email protected]>, and we will do our best to address the problem.

Part 1. Module 1

Learning scikit-learn: Machine Learning in Python

Experience the benefits of machine learning techniques by applying them to real-world problems using Python and the open source scikit-learn library

Machine learning categories

Classification is only one of the possible machine learning problems that can be addressed with scikit-learn. We can organize them in the following categories:

In the previous example, we had a set of instances (that is, a set of data collected from a population) represented by certain features and with a particular target attribute. Supervised learning algorithms try to build a model from this data, which lets us predict the target attribute for new instances, knowing only these instance features. When the target class belongs to a discrete set (such as a list of flower species), we are facing a classification problem.Sometimes the class we want to predict, instead of belonging to a discrete set, ranges on a continuous set, such as the real number line. In this case, we are trying to solve a regression problem (the term was coined by Francis Galton, who observed that the heights of tall ancestors tend to regress down towards a normal value, the average human height). For example, we could try to predict the petal width based on the other three features. We will see that the methods used for regression are quite different from those used for classification.Another different type of machine learning problem is that of unsupervised learning. In this case, we do not have a target class to predict but instead want to group instances according to some similarity measure based on the available set of features. For example, suppose you have a dataset composed of e-mails and want to group them by their main topic (the task of grouping instances is called clustering). We can use it as features, for example, the different words used in each of them.

Important concepts related to machine learning

The linear classifier we presented in the previous section could look too simple. What if we use a higher degree polynomial? What if we also take as features not only the sepal length and width, but also the petal length and the petal width? This is perfectly possible, and depending on the sample distribution, it could lead to a better fit to the training data, resulting in higher accuracy. The problem with this approach is that now we must estimate not only the three original parameters (the coefficients for x1, x2, and the interception point), but also the parameters for the new features x3 and x4 (petal length and width) and also the product combinations of the four features.

Intuitively, we would need more training data to adequately estimate these parameters. The number of parameters (and consequently, the amount of training data needed to adequately estimate them) would rapidly grow if we add more features or higher order terms. This phenomenon, present in every machine learning method, is called the idem curse of dimensionality: when the number of parameters of a model grows, the data needed to learn them grows exponentially.

This notion is closely related to the problem of overfitting mentioned earlier. As our training data is not enough, we risk producing a model that could be very good at predicting the target class on the training dataset but fail miserably when faced with new data, that is, our model does not have the generalization power. That is why it is so important to evaluate our methods on previously unseen data.

The general rule is that, in order to avoid overfitting, we should prefer simple (that is, with less parameters) methods, something that could be seen as an instantiation of the philosophical principle of Occam's razor, which states that among competing hypotheses, the hypothesis with the fewest assumptions should be selected.

However, we should also take into account Einstein's words:

"Everything should be made as simple as possible, but not simpler."

The idem curse of dimensionality may suggest that we keep our models simple, but on the other hand, if our model is too simple we run the risk of suffering from underfitting. Underfitting problems arise when our model has such a low representation power that it cannot model the data even if we had all the training data we want. We clearly have underfitting when our algorithm cannot achieve good performance measures even when measuring on the training set.

As a result, we will have to achieve a balance between overfitting and underfitting. This is one of the most important problems that we will have to address when designing our machine learning models.

Other key concepts to take into account are the idem bias and variance of a machine learning method. Consider an extreme method that, in a binary classification setting, always predicts the positive class for any new instance. Its predictions are, trivially, always the same, or in statistical terms, it has null variance; but it will fail to predict negative examples: it is very biased towards positive results. On the other hand, consider a method that predicts, for a new instance, the class of the nearest instance in the training set (in fact, this method exists, and it is called the 1-nearest neighbor). The generalization assumptions that this method uses are very small: it has a very low bias; but, if we change the training data, results could dramatically change, that is, its variance is very high. These are extreme examples of the bias-variance tradeoff. It can be shown that, no matter which method we are using, if we reduce bias, variance will increase, and vice versa.

Linear classifiers have generally low-variance: no matter what subset we select for training, results will be similar. However, if the data distribution (as in the case of the versicolor and virginica species) makes target classes not separable by a hyperplane, these results will be consistently wrong, that is, the method is highly biased.

On the other hand, kNN (a memory-based method we will not address in this book) has very low bias but high variance: the results are generally very good at describing training data but tend to vary greatly when trained on different training instances.

There are other important concepts related to real-world applications where our data will not come naturally as a list of real-valued features. In these cases, we will need to have methods to transform non real-valued features to real-valued ones. Besides, there are other steps related to feature standardization and normalization, which as we saw in our Iris example, are needed to avoid undesired effects regarding the different value ranges. These transformations on the feature space are known as data preprocessing.

After having a defined feature set, we will see that not all of the features that come in our original dataset could be useful for resolving our task. So we must also have methods to do feature selection, that is, methods to select the most promising features.

In this book, we will present several problems and in each of them we will show different ways to transform and find the most relevant features to use for learning a task, called feature engineering, which is based on our knowledge of the domain of the problem and/or data analysis methods. These methods, often not valued enough, are a fundamental step toward obtaining good results.

Summary

In this chapter, we introduced the main general concepts in machine learning and presented scikit-learn, the Python library we will use in the rest of this book. We included a very simple example of classification, trying to show the main steps for learning, and including the most important evaluation measures we will use. In the rest of this book, we plan to show you different machine learning methods and techniques using different real-world examples for each one. In almost every computational task, the presence of historical data could allow us to improve performance in the sense introduced at the beginning of this chapter.

The next chapter introduces supervised learning methods: we have annotated data (that is, instances where the target class/value is known) and we want to predict the same class/value for future data from the same population. In the case of classification tasks, that is, a discrete-valued target class, several different models exist, ranging from statistical methods, such as the simple Naïve Bayes to advanced linear classifiers, such as Support Vector Machines (SVM). Some methods, such as decision trees, will allow us to visualize how important a feature is to discriminate between different target classes and have a human interpretation of the decision process. We will also address another type of supervised learning task: regression, that is, methods that try to predict real-valued data.