Building Machine Learning Systems with Python - Luis Pedro Coelho - E-Book

Building Machine Learning Systems with Python E-Book

Luis Pedro Coelho

0,0
36,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Get more from your data by creating practical machine learning systems with Python




Key Features



  • Develop your own Python-based machine learning system


  • Discover how Python offers multiple algorithms for modern machine learning systems


  • Explore key Python machine learning libraries to implement in your projects



Book Description



Machine learning allows systems to learn things without being explicitly programmed to do so. Python is one of the most popular languages used to develop machine learning applications, which take advantage of its extensive library support. This third edition of Building Machine Learning Systems with Python addresses recent developments in the field by covering the most-used datasets and libraries to help you build practical machine learning systems.






Using machine learning to gain deeper insights from data is a key skill required by modern application developers and analysts alike. Python, being a dynamic language, allows for fast exploration and experimentation. This book shows you exactly how to find patterns in your raw data. You will start by brushing up on your Python machine learning knowledge and being introduced to libraries. You'll quickly get to grips with serious, real-world projects on datasets, using modeling and creating recommendation systems. With Building Machine Learning Systems with Python, you'll gain the tools and understanding required to build your own systems, all tailored to solve real-world data analysis problems.






By the end of this book, you will be able to build machine learning systems using techniques and methodologies such as classification, sentiment analysis, computer vision, reinforcement learning, and neural networks.




What you will learn



  • Build a classification system that can be applied to text, images, and sound


  • Employ Amazon Web Services (AWS) to run analysis on the cloud


  • Solve problems related to regression using scikit-learn and TensorFlow


  • Recommend products to users based on their past purchases


  • Understand different ways to apply deep neural networks on structured data


  • Address recent developments in the field of computer vision and reinforcement learning



Who this book is for



Building Machine Learning Systems with Python is for data scientists, machine learning developers, and Python developers who want to learn how to build increasingly complex machine learning systems. You will use Python's machine learning capabilities to develop effective solutions. Prior knowledge of Python programming is expected.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 455

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Building Machine Learning Systems with PythonThird Edition

 

 

 

 

 

 

 

Explore machine learning and deep learning techniques for building intelligent systems using scikit-learn and TensorFlow

 

 

 

 

 

 

 

 

Luis Pedro Coelho
Willi Richert
Matthieu Brucher

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Building Machine Learning Systems with Python

Third Edition

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith ShettyAcquisition Editor: Divya PoojariContent Development Editor: Dattatraya MoreTechnical Editor: Varsha ShivhareCopy Editor: Safis EditingProject Coordinator: Shweta H BirwatkarProofreader: Safis EditingIndexer: Aishwarya GangawaneGraphics: Jisha ChirayilProduction Coordinator: Arvindkumar Gupta

First published: July 2013

Second edition: March 2015

Third edition: July 2018

Production reference: 2280718

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN  978-1-78862-322-3

www.packtpub.com

 
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

Contributors

About the authors

Luis Pedro Coelho is a computational biologist who analyzes DNA from microbial communities to characterize their behavior. He has also worked extensively in bioimage informatics―the application of machine learning techniques for the analysis of images of biological specimens. His main focus is on the processing and integration of large-scale datasets. He has a PhD from Carnegie Mellon University and has authored several scientific publications. In 2004, he began developing in Python and has contributed to several open source libraries. He is currently a faculty member at Fudan University in Shanghai.

 

 

 

 

Willi Richert has a PhD in machine learning/robotics, where he has used reinforcement learning, hidden Markov models, and Bayesian networks to let heterogeneous robots learn by imitation. Now at Microsoft, he is involved in various machine learning areas, such as deep learning, active learning, or statistical machine translation. Willi started as a child with BASIC on his Commodore 128. Later, he discovered Turbo Pascal, then Java, then C++—only to finally arrive at his true love: Python.

 

 

 

 

Matthieu Brucher is a computer scientist who specializes in high-performance computing and computational modeling and currently works for JPMorgan in their quantitative research branch. He is also the lead developer of Audio ToolKit, a library for real-time audio signal processing. He has a PhD in machine learning and signals processing from the University of Strasbourg, two Master of Science degrees—one in digital electronics and signal processing and another in automation – from the University of Paris XI and Supelec, as well as a Master of Music degree from Bath Spa University.

About the reviewers

Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a PhD in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges ranging from NLP, behavioral analysis, and machine learning to deep nets and distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events.

 

 

 

 

Gerold Hintz is an applied scientist who specializes in NLP. He obtained an MSc in Computer Science from Darmstadt University of Technology in 2014, focusing on machine learning and minoring in linguistics. He worked as a researcher in the field of computational semantics, applying unsupervised methods to paraphrasing tasks. He likes to study both natural languages and programming languages, and has a particular interest in the intersection of these fields.

 

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Building Machine Learning Systems with Python

Third Edition

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Getting Started with Python Machine Learning

Machine learning and Python – a dream team

What the book will teach you – and what it will not

How to best read this book

What to do when you are stuck

Getting started

Introduction to NumPy, SciPy, Matplotlib, and TensorFlow

Installing Python

Chewing data efficiently with NumPy and intelligently with SciPy

Learning NumPy

Indexing

Handling nonexistent values

Comparing the runtime

Learning SciPy

Fundamentals of machine learning

Asking a question

Getting answers

Our first (tiny) application of machine learning

Reading in the data

Preprocessing and cleaning the data

Choosing the right model and learning algorithm

Before we build our first model

Starting with a simple straight line

Toward more complex models

Stepping back to go forward - another look at our data

Training and testing

Answering our initial question

Summary

Classifying with Real-World Examples

The Iris dataset

Visualization is a good first step

Classifying with scikit-learn

Building our first classification model

Evaluation – holding out data and cross-validation

How to measure and compare classifiers

A more complex dataset and the nearest-neighbor classifier

Learning about the seeds dataset

Features and feature engineering

Nearest neighbor classification

Looking at the decision boundaries

Which classifier to use

Summary

Regression

Predicting house prices with regression

Multidimensional regression

Cross-validation for regression

Penalized or regularized regression

L1 and L2 penalties

Using Lasso or ElasticNet in scikit-learn

Visualizing the Lasso path

P-greater-than-N scenarios

An example based on text documents

Setting hyperparameters in a principled way

Regression with TensorFlow

Summary

Classification I – Detecting Poor Answers

Sketching our roadmap

Learning to classify classy answers

Tuning the instance

Tuning the classifier

Fetching the data

Slimming the data down to chewable chunks

Preselecting and processing attributes

Defining what a good answer is

Creating our first classifier

Engineering the features

Training the classifier

Measuring the classifier's performance

Designing more features

Deciding how to improve the performance

Bias, variance and their trade-off

Fixing high bias

Fixing high variance

High or low bias?

Using logistic regression

A bit of math with a small example

Applying logistic regression to our post-classification problem

Looking behind accuracy – precision and recall

Slimming the classifier

Ship it!

Classification using Tensorflow

Summary

Dimensionality Reduction

Sketching our roadmap

Selecting features

Detecting redundant features using filters

Correlation

Mutual information

Asking the model about the features using wrappers

Other feature selection methods

Feature projection

Principal component analysis

Sketching PCA

Applying PCA

Limitations of PCA and how LDA can help

Multidimensional scaling

Autoencoders, or neural networks for dimensionality reduction

Summary

Clustering – Finding Related Posts

Measuring the relatedness of posts

How not to do it

How to do it

Preprocessing – similarity measured as a similar number of common words

Converting raw text into a bag of words

Counting words

Normalizing word count vectors

Removing less important words

Stemming

Installing and using NLTK

Extending the vectorizer with NLTK's stemmer

Stop words on steroids

Our achievements and goals

Clustering

K-means

Getting test data to evaluate our ideas

Clustering posts

Solving our initial challenge

Another look at noise

Tweaking the parameters

Summary

Recommendations

Rating predictions and recommendations

Splitting into training and testing

Normalizing the training data

A neighborhood approach to recommendations

A regression approach to recommendations

Combining multiple methods

Basket analysis

Obtaining useful predictions

Analyzing supermarket shopping baskets

Association rule mining

More advanced basket analysis

Summary

Artificial Neural Networks and Deep Learning

Using TensorFlow

TensorFlow API

Graphs

Sessions

Useful operations

Saving and restoring neural networks

Training neural networks

Convolutional neural networks

Recurrent neural networks

LSTM for predicting text

LSTM for image processing

Summary

Classification II – Sentiment Analysis

Sketching our roadmap

Fetching the Twitter data

Introducing the Naïve Bayes classifier

Getting to know the Bayes theorem

Being naïve

Using Naïve Bayes to classify

Accounting for unseen words and other oddities

Accounting for arithmetic underflows

Creating our first classifier and tuning it

Solving an easy problem first

Using all classes

Tuning the classifier's parameters

Cleaning tweets

Taking the word types into account

Determining the word types

Successfully cheating using SentiWordNet

Our first estimator

Putting everything together

Summary

Topic Modeling

Latent Dirichlet allocation

Building a topic model

Comparing documents by topic

Modeling the whole of Wikipedia

Choosing the number of topics

Summary

Classification III – Music Genre Classification

Sketching our roadmap

Fetching the music data

Converting into WAV format

Looking at music

Decomposing music into sine-wave components

Using FFT to build our first classifier

Increasing experimentation agility

Training the classifier

Using a confusion matrix to measure accuracy in multiclass problems

An alternative way to measure classifier performance using receiver-operator characteristics

Improving classification performance with mel frequency cepstral coefficients

Music classification using Tensorflow

Summary

Computer Vision

Introducing image processing

Loading and displaying images

Thresholding

Gaussian blurring

Putting the center in focus

Basic image classification

Computing features from images

Writing your own features

Using features to find similar images

Classifying a harder dataset

Local feature representations

Image generation with adversarial networks

Summary

Reinforcement Learning

Types of reinforcement learning

Policy and value network

Q-network

Excelling at games

A small example

Using Tensorflow for the text game

Playing breakout

Summary

Bigger Data

Learning about big data

Using jug to break up your pipeline into tasks

An introduction to tasks in jug

Looking under the hood

Using jug for data analysis

Reusing partial results

Using Amazon Web Services

Creating your first virtual machines

Installing Python packages on Amazon Linux

Running jug on our cloud machine

Automating the generation of clusters with cfncluster

Summary

Where to Learn More About Machine Learning

Online courses

Books

Blogs

Data sources

Getting competitive

All that was left out

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Machine learning allows models or systems to learn without being explicitly programmed. You will see how to use the best library support available, including, scikit-learn, TensorFlow, and many others, to build efficient, smart systems.

Who this book is for

Building Machine Learning Systems with Python is for data scientists, machine learning developers, and Python developers who want to learn how to build increasingly complex machine learning systems. You will use Python's machine learning capabilities to develop effective solutions. Prior knowledge of Python programming is expected.

What this book covers

Chapter 1, Getting Started with Python Machine Learning, introduces the basic idea of machine learning and TensorFlow with a very simple example. Despite its simplicity, it will challenge us with the risk of overfitting.

Chapter 2, Classifying with Real-world Examples, uses real data to explore classification by training a computer to be able to distinguish between different classes of flowers.

Chapter 3, Regression, explains how to use regression to handle data, a classic topic that is still relevant today. You will also learn about advanced regression techniques such as Lasso and ElasticNet.

Chapter 4, Classification I – Detecting Poor Answers, demonstrates how to use the bias-variance trade-off to debug machine learning models, though this chapter is mainly about using logistic regression to ascertain whether a user's answer to a question is good or bad.

Chapter 5, Dimensionality Reduction, explores what other methods exist to help us to downsize data so that it is chewable by our machine learning algorithms.

Chapter 6, Clustering – Finding Related Posts, demonstrates how powerful the bag of words approach is by applying it to find similar posts without really understanding them.

Chapter 7, Recommendations, builds recommendation systems based on customer product ratings. We will also see how to build recommendations from shopping data without the need for ratings data (which users do not always provide).

Chapter 8, Artificial Neural Networks and Deep Learning, deals with the fundamentals and examples of CNN and RNN using TensorFlow.

Chapter 9, Classification II – Sentiment Analysis, explains how Naïve Bayes works, and how to use it to classify tweets to see whether they are positive or negative.

Chapter 10, Topic Modeling, moves beyond assigning each post to a single cluster to assigning posts to several topics, as real texts can deal with multiple topics.

Chapter 11, Classification III – Music Genre Classification, sets the scene of someone having scrambled our huge music collection, our only hope of creating order being to let a machine learner classify our songs. It turns out that it is sometimes better to trust someone else's expertise to create features ourselves. The chapter also covers the conversion of speech into text.

Chapter 12, Computer Vision, demonstrates how to apply classification in the specific context of handling images by extracting features from data. We also see how these methods can be adapted to find similar images in a collection, and the applications of CNN and GAN using TensorFlow.

Chapter 13, Reinforcement Learning, covers the fundamentals of reinforcement learning and Deep Q networks on Atari game playing.

Chapter 14, Bigger Data, explores some approaches to dealing with larger data by taking advantage of multiple cores or computing clusters. It also introduces cloud computing (using Amazon Web Services as our cloud provider).

To get the most out of this book

This book assumes you know Python and how to install a library using easy_install or pip. We do not rely on any advanced mathematics, such as calculus or matrix algebra.We are using the following versions throughout the book, but you should be fine with any more recent ones:

Python 3.5 

NumPy 

1.13.3

SciPy 1.0.0

scikit-learn latest version

All examples are available as Jupyter notebooks in our code bundle (https://github.com/PacktPublishing/Building-Machine-Learning-Systems-with-Python-Third-edition).

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Building-Machine-Learning-Systems-with-Python-Third-edition. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/BuildingMachineLearningSystemswithPythonThirdedition_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Getting Started with Python Machine Learning

Machine learning teaches machines to learn to carry out tasks by themselves. It is that simple. The complexity comes with the details, and that is most likely the reason you are reading this book.

Maybe you have too much data and too little insight. Maybe you hope that, by using machine learning algorithms, you can solve this challenge, so you started digging into the algorithms. But perhaps after a while you became puzzled: which of the myriad of algorithms should you actually choose?

Alternatively, maybe you are simply more generally interested in machine learning and you have been reading blogs and articles about it for some time. Everything seemed to be magic and cool, so you started your exploration and fed some data into a decision tree or a support vector machine. However, after you successfully applied these to some other data, perhaps you wondered: was the whole setting right? Did you get optimal results? How do you know that there are no better algorithms? Or whether your data was the right kind?

Welcome to the club! All of us authors were once at those stages, looking for information that tells the stories behind the theoretical textbooks about machine learning. It turned out that much of that information was black art, not usually taught in standard text books. So, in a sense, we wrote this book to our younger selves. A book that not only gives a quick introduction to machine learning, but also teaches the lessons we learned during our careers in the field. We hope that it will also give you a smoother entry into one of the most exciting fields in computer science.

Machine learning and Python – a dream team

The goal of machine learning is to teach machines (software) to carry out tasks by providing them with a couple of examples (that is, examples of how to do or not do the task). Let's assume that each morning when you turn on your computer, you perform the same task of moving emails around so that only emails belonging to the same topic end up in the same folder. After some time, you might feel bored and think of automating this chore. One way would be to start analyzing your brain and write down all the rules and decisions that your brain processes while you are shuffling your emails. However, this will be quite cumbersome and always imperfect. While you will miss some rules, you will overspecify others. A better and more future-proof way of doing this would be to automate this process by choosing a set of email meta information and body/folder name pairs and letting an algorithm come up with the best rule set. The pairs would be your training data, and the resulting rule set (also called a model) could then be applied to future emails that you have not yet seen. This is machine learning in its simplest form.

Of course, machine learning is not a brand new field in itself. Quite the contrary: its success in recent years can be attributed to the pragmatic way that it uses rock-solid techniques and insights from other successful fields, such as statistics. In these fields, the purpose is for us humans to get insights into the data—for example, by learning more about the underlying patterns and relationships within it. As you read more and more about successful applications of machine learning (you have checked out www.kaggle.com already, haven't you?), you will see that applied statistics is a common field among machine learning experts.

As you will see later, the process of coming up with a decent machine learning approach is never easy. Instead, you will find yourself going back and forth in your analysis, trying out different versions of your input data on diverse sets of machine learning algorithms. It is this exploratory nature that lends itself perfectly to Python. Being an interpreted, high-level programming language, it seems that Python has been designed exactly for this process of trying out different things. What is more, it even does this fast. Sure, it is slower than C or many other natively compiled programming languages. Nevertheless, with the myriad of easy-to-use libraries that are often written in C, you don't have to sacrifice speed for agility.

What the book will teach you – and what it will not

This book will give you a broad overview of what types of learning algorithms are currently most used in the diverse fields of machine learning, and what to watch out for when applying them. From our own experience, however, we know that doing the cool stuff—that is, using and tweaking machine learning algorithms, such as support vector machines, nearest neighbor searches, or ensembles thereof—will only consume a tiny fraction of the overall time that a good machine learning expert will spend doing the same thing. Looking at the following typical workflow, we can see that most of the time will be spent on rather mundane tasks:

Reading the data and cleaning it

Exploring and understanding the input data

Analyzing how to best present the data to the learning algorithm

Choosing the right model and learning algorithm

Measuring the performance correctly

When talking about exploring and understanding the input data, we will need to use a bit of statistics and basic math. However, while doing this, you will see that those topics that seemed so dry in your math class can actually be really exciting when you use them to look at interesting data.

The journey starts when you read in the data. When you have to answer questions such as, "How do I handle invalid or missing values?", you will see that this is more an art than a precise science, and a very rewarding art, as doing this part correctly will open your data to more machine learning algorithms and hence increase the likelihood of success.

With the data ready and waiting in your program's data structures, you will want to get a real feeling for what animal you are working with. Do you have enough data to answer your questions? If not, you might want to think about additional ways to get more of it. Perhaps you even have too much data. In that case, you probably want to think about how to best extract a sample of it.

Often, you will not feed the data directly into your machine learning algorithm. Instead, you will find that you can refine parts of the data before training. Oftentimes, the machine learning algorithm will reward you with increased performance. You will even find that a simple algorithm with refined data generally outperforms a very sophisticated algorithm with raw data. This part of the machine learning workflow is called feature engineering, and, most of the time, it is a very exciting and rewarding challenge. You will immediately see the results of your previous creative and intelligent efforts.

Choosing the right learning algorithm, then, is not simply a shoot-out of the three or four that are in your toolbox (there will be more; you will see). It is more a thoughtful process of weighing different performance and functional requirements. Do you need a fast result and are willing to sacrifice quality? Or would you rather spend more time to get the best possible result? Do you have a clear idea of future data, or should you be a bit more conservative on that side?

Finally, measuring performance is the part of the process that has the most potential pitfalls for the aspiring machine learner. There are easy mistakes to avoid, such as testing your approach with the same data on which you have trained. But there are more difficult ones as well, such as using imbalanced training data. Again, the data is the part that determines whether your undertaking will fail or succeed.

We see that only the fourth point deals with the fancy algorithms. Nevertheless, we hope that this book will convince you that the other four tasks are not simply chores, but can be equally exciting. Our hope is that, by the end of the book, you will have truly fallen in love with data instead of learning algorithms.

To that end, we will not overwhelm you with the theoretical aspects of the diverse machine learning algorithms, as there are already excellent books in that area (you will find pointers in the appendix). Instead, we will try to provide you with an understanding of the underlying approaches in the individual chapters—just enough for you to get an idea and be able to undertake your first steps. Hence, this book is by no means the definitive guide to machine learning—it is more of a starter kit. We hope that it ignites your curiosity enough to keep you eager in trying to learn more and more about this interesting field.

In the rest of this chapter, we will set up and get to know the basic Python libraries of NumPy and SciPy, and then train our first machine learning algorithm using scikit-learn. During this, we will introduce basic machine learning concepts that will be used throughout the book. The rest of the chapters will then go into more detail about the five steps described earlier, highlighting different aspects of machine learning in Python using diverse application scenarios.

How to best read this book

While we have tried to provide all the code required to convey the book's ideas, we don't want to bore you with repetitive code snippets. Instead, we have created self-contained Jupyter notebooks (http://jupyter.org) that you can find via Git from https://github.com/PacktPublishing/Building-Machine-Learning-Systems-with-Python-Third-edition.

If you don't have Jupyter already, simply install it with pip install jupyter and then run it with jupyter notebook. It provides a much richer experience; for example, it directly integrates charts into it. Once you have cloned the Git repository of the book's code, you can simply follow along by hitting Shift + Enter. As a bonus, you will find that it has interactive widgets that let you play with the code:

What to do when you are stuck

We have tried to convey every idea necessary to reproduce the steps throughout this book. Nevertheless, there will be situations where you are stuck. The reasons might range from simple typos over odd combinations of package versions to problems in understanding.

There are many different ways to get help. Most likely, your problem will already have been raised and solved in the following excellent Q&A sites:

http://stats.stackexchange.com

: This

Q&A

site is named Cross Validated, similar to MetaOptimize, but is focused more on statistical problems.

http://stackoverflow.com

: This

Q&A

site is much like the previous one, but with a broader focus on general programming topics. It contains, for example, more questions on some of the packages that we will use in this book, such as SciPy or Matplotlib.

https://freenode.net/

: This is the IRC channel focused on machine learning topics. It is a small but very active and helpful community of machine learning experts.

As stated at the beginning, this book is intended to help you get started quickly on your machine learning journey. Therefore, we highly encourage you to build up your own list of machine learning-related blogs and check them out regularly. This is the best way to get to know what works and what doesn't.

The only blog we want to highlight right here (though there there are more in the appendix) is http://blog.kaggle.com, the blog of the company, Kaggle, which hosts machine learning competitions. Typically, they encourage the winners of the competitions to write down how they approached the competition, what strategies did not work, and how they arrived at the winning strategy. Even if you don't read anything else, this is a must.

Getting started

Assuming that you have Python already installed (anything at least as recent as 3 should be fine), we need to install NumPy and SciPy for numerical operations, as well as Matplotlib for visualization.

Introduction to NumPy, SciPy, Matplotlib, and TensorFlow

Before we can talk about concrete machine learning algorithms, we have to talk about how best to store the data we will chew through. This is important as the most advanced learning algorithm will not be of any help to us if it will never finish. This may be simply because the mere process of accessing the data is too slow, or maybe its representation forces the operating system to swap all day. Add to this the fact that Python is an interpreted language (a highly optimized one, though) that is slow for many numerically heavy algorithms compared to C or Fortran. So we might ask why on earth so many scientists and companies are betting their fortune on Python even in highly computation-intensive areas.

The answer is that, in Python, it is very easy to offload number crunching tasks to the lower layer in the form of C or Fortran extensions, and that is exactly what NumPy and SciPy do (see https://scipy.org). NumPy provides the support of highly optimized multidimensional arrays, which are the basic data structure of most state-of-the-art algorithms. SciPy uses those arrays to provide a set of fast numerical recipes. Matplotlib (http://matplotlib.org) is probably the most convenient and feature-rich library to plot high-quality graphs using Python. Finally, TensorFlow is one of the leading neural network packages for Python (we will explain what this package is about in a subsequent chapter).

Installing Python

Luckily, for all major operating systems—that is, Windows, Mac, and Linux—there are targeted installers for NumPy, SciPy, Matplotlib, and TensorFlow. If you are unsure about the installation process, you might want to install the Anaconda Python distribution (which you can access at https://www.anaconda.com/download), which is maintained and developed by Travis Oliphant, a founding contributor of SciPy. Luckily, Anaconda is already fully compatible with Python 3—the Python version we will be using throughout this book.

The main Anaconda channel comes with three flavors of TensorFlow (use the Intel channel at your own risk, that is an older version of TensorFlow). The main flavor, tensorflow, is compiled for all platforms and runs on the CPU. If you have a Haswell CPU or a more recent Intel one, you can use the tensorflow-mkl package. Finally, if you have an Nvidia GPU with a compute capability of 3.0 or higher, you can usetensorflow-gpu.

Chewing data efficiently with NumPy and intelligently with SciPy

Let's walk quickly through some basic NumPy examples and then take a look at what SciPy provides on top of it. On the way, we will get our feet wet with plotting using the marvelous matplotlib package.

For an in-depth explanation, you might want to take a look at some of the more interesting examples of what NumPy has to offer at https://docs.scipy.org/doc/numpy/user/quickstart.html.

You will also find the NumPy Beginner's Guide - Second Edition by Ivan Idris, from Packt Publishing, very valuable. Additional tutorial style guides can be found at http://www.scipy-lectures.org, and the official SciPy tutorial can be found at http://docs.scipy.org/doc/scipy/reference/tutorial.

In this book, we will use NumPy in version 1.13.3 and SciPy in version 1.0.0.

Learning SciPy

On top of the efficient data structures of NumPy, SciPy offers a magnitude of algorithms for working on those arrays. Whatever numerical heavy algorithm you take from current books on numerical recipes, you will most likely find support for them in SciPy in one way or another. Whether it is matrix manipulation, linear algebra, optimization, clustering, spatial operations, or even fast Fourier transformation, the toolbox is readily filled. Therefore, it is a good habit to always inspect the scipy module before you start implementing a numerical algorithm.

For convenience, the complete namespace of NumPy is also accessible via SciPy. So, from now on, we will use NumPy's machinery via the SciPy namespace. You can check this by easily comparing the function references of any base function, such as the following:

>>> import scipy, numpy

>>> scipy.version.full_version 1.0.0

>>> scipy.dot is numpy.dot True

The diverse algorithms are grouped into the following toolboxes:

SciPy packages

Functionalities

cluster

Hierarchical clustering (cluster.hierarchy)

Vector quantization/K-means (cluster.vq)

constants

Physical and mathematical constants

Conversion methods

fftpack

Discrete Fourier transform algorithms

integrate

Integration routines

interpolate

Interpolation (linear, cubic, and so on)

io

Data input and output

linalg

Linear algebra routines using the optimized BLAS and LAPACK libraries

ndimage

n-dimensional image package

odr

Orthogonal distance regression

optimize

Optimization (finding minima and roots)

signal

Signal processing

sparse

Sparse matrices

spatial

Spatial data structures and algorithms

special

Special mathematical functions, such as Bessel or Jacobian

stats

Statistics toolkit

 

The toolboxes that are most pertinent to our goals are scipy.stats, scipy.interpolate, scipy.cluster, and scipy.signal. For the sake of brevity, we will briefly explore some features of the stats package and explain the others when they show up in the individual chapters.

Fundamentals of machine learning

In machine learning, what we are doing is asking a question and answering it. From the samples we have, we create a question that is the learning aspect of the model. Answering the question involves using the model for new samples.

Asking a question

If the workflow involves preprocessing the features, followed by model training, and finally model usage, then the preprocessing features step can be linked to the assumptions that we make when asking a question. For instance, the question can be, "Are these images of cats, knowing that cats have two ears, two eyes, a nose, a mouth, and whiskers?"

Our assumptions here are linked to how the images will be preprocessed to get the number of ears, eyes, noses, mouths, and whiskers. This data will be fed into the model during training so that we get answers.

Getting answers

Once the model is trained, we use the same features to get our answer. Of course, with the question we asked earlier, if we feed in images of cats, we will get a positive answer. But if we feed in an image of a tiger, a lion, or a dog, we will also get a positive identification. So the question we asked is not, "Are these images of cats?", but really, "Are these images of cats, knowing that cats have two ears, two eyes, a nose, a mouth, and whiskers?". Our definition of a cat was wrong and lead us to wrong answers.

This is where know-how and practice are important. Designing the right model to answer the question you have been asked is something that anyone can do once this important point has been understood.

Our first (tiny) application of machine learning

Let's get our hands dirty and take a look at our hypothetical web start-up, MLaaS, which sells the service of providing machine learning algorithms via HTTP. With the increasing success of our company, the demand for better infrastructure also increases so that we can serve all incoming web requests successfully. We don't want to allocate too many resources as that would be too costly. On the other hand, we will lose money if we have not reserved enough resources to serve all incoming requests. Now, the question is, when will we hit the limit of our current infrastructure, which we estimated to have a capacity of about 100,000 requests per hour? We would like to know in advance when we have to request additional servers in the cloud to serve all the incoming requests successfully without paying for unused ones.

Choosing the right model and learning algorithm

Now that we have a first impression of the data, we return to the initial question: how long will our server be able to handle the incoming web traffic? To answer this, we have to do the following:

Find the real model behind the noisy data points

Use the model to find the point in time where our infrastructure won't handle the load anymore and has to be extended

Before we build our first model

When we talk about models, you can think of them as simplified theoretical approximations of complex reality. As such, there is always some inferiority involved, also called the approximation error. This error will guide us in choosing the right model among the many choices we have. We will calculate this error as the squared distance of the model's prediction to the real data; for example, for a learned model function, f, the error is calculated as follows:

def error(f, x, y):

return np.sum((f(x)-y)**2)

The vectors x and y contain the web stats data that we extracted earlier. It is the beauty of NumPy's vectorized functions, which we exploit here with f(x). The trained model is assumed to take a vector and return the results again as a vector of the same size so that we can use it to calculate the difference to y.