40,81 €
Discover a project-based approach to mastering machine learning concepts by applying them to everyday problems using libraries such as scikit-learn, TensorFlow, and Keras
Key Features
Book Description
Machine learning is transforming the way we understand and interact with the world around us. This book is the perfect guide for you to put your knowledge and skills into practice and use the Python ecosystem to cover key domains in machine learning. This second edition covers a range of libraries from the Python ecosystem, including TensorFlow and Keras, to help you implement real-world machine learning projects.
The book begins by giving you an overview of machine learning with Python. With the help of complex datasets and optimized techniques, you'll go on to understand how to apply advanced concepts and popular machine learning algorithms to real-world projects. Next, you'll cover projects from domains such as predictive analytics to analyze the stock market and recommendation systems for GitHub repositories. In addition to this, you'll also work on projects from the NLP domain to create a custom news feed using frameworks such as scikit-learn, TensorFlow, and Keras. Following this, you'll learn how to build an advanced chatbot, and scale things up using PySpark. In the concluding chapters, you can look forward to exciting insights into deep learning and you'll even create an application using computer vision and neural networks.
By the end of this book, you'll be able to analyze data seamlessly and make a powerful impact through your projects.
What you will learn
Who this book is for
This book is for machine learning practitioners, data scientists, and deep learning enthusiasts who want to take their machine learning skills to the next level by building real-world projects. The intermediate-level guide will help you to implement libraries from the Python ecosystem to build a variety of projects addressing various machine learning domains. Knowledge of Python programming and machine learning concepts will be helpful.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 294
Veröffentlichungsjahr: 2019
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Sunith ShettyAcquisition Editor: Varsha ShettyContent Development Editor: Snehal KolteTechnical Editor: Naveen SharmaCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Mariammal ChettiyarGraphics: Jisha ChirayilProduction Coordinator:Arvindkumar Gupta
First published: July 2016 Second edition: January 2019
Production reference: 1310119
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78899-417-0
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Alexander Combs is an experienced data scientist, strategist, and developer with a background in financial data extraction, natural language processing and generation, and quantitative and statistical modeling. He currently lives and works in New York City.
Michael Roman is a data scientist at The Atlantic, where he designs, tests, analyzes, and productionizes machine learning models to address a range of business topics. Prior to this he was an associate instructor at a full-time data science immersive program in New York City. His interests include computer vision, propensity modeling, natural language processing, and entrepreneurship.
Saurabh Chhajed is a machine learning and big data engineer with 9 years of professional experience in the enterprise application development life cycle using the latest frameworks, tools, and design patterns. He has experience of designing and implementing some of the most widely used and scalable customer-facing recommendation systems with extensive usage of the big data ecosystem— the batch, real-time, and machine learning pipeline. He has also worked for some of the largest investment banks, credit card companies, and manufacturing companies around the world, implementing a range of robust and scalable product suites.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Python Machine Learning Blueprints Second Edition
About Packt
Why subscribe?
Packt.com
Contributors
About the authors
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
The Python Machine Learning Ecosystem
Data science/machine learning workflow
Acquisition
Inspection
Preparation
Modeling
Evaluation
Deployment
Python libraries and functions for each stage of the data science workflow
Acquisition
Inspection
The Jupyter Notebook
Pandas
Visualization
The matplotlib library
The seaborn library
Preparation
map
apply
applymap
groupby
Modeling and evaluation
Statsmodels
Scikit-learn
Deployment
Setting up your machine learning environment
Summary
Build an App to Find Underpriced Apartments
Sourcing apartment listing data
Pulling down listing data
Pulling out the individual data points
Parsing data
Inspecting and preparing the data
Sneak-peek at the data types
Visualizing our data
Visualizing the data
Modeling the data
Forecasting
Extending the model
Summary
Build an App to Find Cheap Airfares
Sourcing airfare pricing data
Retrieving fare data with advanced web scraping
Creating a link
Parsing the DOM to extract pricing data
Parsing
Identifying outlier fares with anomaly detection techniques
Sending real-time alerts using IFTTT
Putting it all together
Summary
Forecast the IPO Market Using Logistic Regression
The IPO market
What is an IPO?
Recent IPO market performance
Working on the DataFrame
Analyzing the data
Summarizing the performance of the stocks
Baseline IPO strategy
Data cleansing and feature engineering
Adding features to influence the performance of an IPO
Binary classification with logistic regression
Creating the target for our model
Dummy coding
Examining the model performance
Generating the importance of a feature from our model 
Random forest classifier method
Summary
Create a Custom Newsfeed
Creating a supervised training set with Pocket
Installing the Pocket Chrome Extension
Using the Pocket API to retrieve stories
Using the Embedly API to download story bodies
Basics of Natural Language Processing
Support Vector Machines
IFTTT integration with feeds, Google Sheets, and email
Setting up news feeds and Google Sheets through IFTTT
Setting up your daily personal newsletter
Summary
Predict whether Your Content Will Go Viral
What does research tell us about virality?
Sourcing shared counts and content
Exploring the features of shareability
Exploring image data
Clustering
Exploring the headlines
Exploring the story content
Building a predictive content scoring model
Evaluating the model
Adding new features to our model
Summary
Use Machine Learning to Forecast the Stock Market
Types of market analysis
What does research tell us about the stock market?
So, what exactly is a momentum strategy?
How to develop a trading strategy
Analysis of the data
Volatility of the returns
Daily returns
Statistics for the strategies
The mystery strategy
Building the regression model
Performance of the model
Dynamic time warping
Evaluating our trades
Summary
Classifying Images with Convolutional Neural Networks
Image-feature extraction
Convolutional neural networks
Network topology
Convolutional layers and filters
Max pooling layers
Flattening
Fully-connected layers and output
Building a convolutional neural network to classify images in the Zalando Research dataset, using Keras
Summary
Building a Chatbot
The Turing Test
The history of chatbots
The design of chatbots
Building a chatbot
Sequence-to-sequence modeling for chatbots
Summary
Build a Recommendation Engine
Collaborative filtering
So, what's collaborative filtering?
Predicting the rating for the product
Content-based filtering
Hybrid systems
Collaborative filtering
Content-based filtering
Building a recommendation engine
Summary
What's Next?
Summary of the projects
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Machine learning is transforming the way we understand and interact with the world around us. This book is the perfect guide for you to put your knowledge and skills into practice and use the Python ecosystem to cover the key domains in machine learning. This second edition covers a range of libraries from the Python ecosystem, including TensorFlow and Keras, to help you implement real-world machine learning projects.
The book begins by giving you an overview of machine learning with Python. With the help of complex datasets and optimized techniques, you'll learn how to apply advanced concepts and popular machine learning algorithms to real-world projects. Next, you'll cover projects in domains such as predictive analytics to analyze the stock market, and recommendation systems for GitHub repositories. In addition to this, you'll also work on projects from the NLP domain to create a custom news feed using frameworks such as scikit-learn, TensorFlow, and Keras. Following this, you'll learn how to build an advanced chatbot, and scale things up using PySpark. In the concluding chapters, you can look forward to exciting insights into deep learning and even create an application using computer vision and neural networks.
By the end of this book, you'll be able to analyze data seamlessly and make a powerful impact through your projects.
This book is for machine learning practitioners, data scientists, and deep learning enthusiasts who want to take their machine learning skills to the next level by building real-world projects. This intermediate-level guide will help you to implement libraries from the Python ecosystem to build a variety of projects addressing various machine learning domains.
Chapter 1, The Python Machine Learning Ecosystem, discusses the features of key libraries and explains how to prepare your environment to best utilize them.
Chapter 2, Build an App to Find Underpriced Apartments, explains how to create a machine learning application that will make finding the right apartment a little bit easier.
Chapter 3, Build an App to Find Cheap Airfares, covers how to build an application that continually monitors fare pricing, checking for anomalous prices that will generate an alert we can quickly act on.
Chapter 4, Forecast the IPO Market Using Logistic Regression, takes a closer look at the IPO market. We'll see how we can use machine learning to help us decide which IPOs are worth a closer look and which ones we may want to take a pass on.
Chapter 5, Create a Custom Newsfeed, explains how to build a system that understands your taste in news, and will send you a personally tailored newsletter each day.
Chapter 6, Predict whether Your Content Will Go Viral, tries to unravel some of the mysteries. We'll examine some of the most commonly shared content and attempt to find the common elements that differentiate it from content people were less willing to share.
Chapter 7, Use Machine Learning to Forecast the Stock Market, discusses how to build and test a trading strategy. We'll spend more time, however, on how not to do it.
Chapter 8, Classifying Images with Convolutional Neural Networks, details the process of creating a computer vision application using deep learning.
Chapter 9, Building a Chatbot, explains how to construct a chatbot from scratch. Along the way, we'll learn more about the history of the field and its future prospects.
Chapter 10, Build a Recommendation Engine, explores the different varieties of recommendation systems. We'll see how they're implemented commercially and how they work. Finally, we'll implement our own to recommendation engine for finding GitHub repositories.
Chapter 11, What's Next?, summarizes what has been covered so far in this book and what the next steps are from this point on. You will learn how to apply the skills you have gained to other projects, real-life challenges in building and deploying machine learning models, and other common technologies that data scientists frequently use.
Knowledge of Python programming and machine learning concepts will be helpful.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Python-Machine-Learning-Blueprints-Second-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781788994170_ColorImages.pdf.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Machine learning is rapidly changing our world. As the centerpiece of artificial intelligence, it is difficult to go a day without reading how it will lead us into either a techno-utopia along the lines of the Singularity, or into some sort of global Blade Runner-esque nightmare scenario. While pundits may enjoy discussing these hyperbolic futures, the more mundane reality is that machine learning is rapidly becoming a fixture of our daily lives. Through subtle but progressive improvements in how we interact with computers and the world around us, machine learning is progressively making our lives better.
If you shop at online retailers such as Amazon.com, use streaming music or movie services such as Spotify or Netflix, or have even just done a Google search, you have encountered an application that utilizes machine learning. These services collect vast amounts of data—much of it from their users—that is used to build models that improve the user experience.
It's an ideal time to dive into developing machine learning applications, and, as you will discover, Python is an ideal choice with which to develop them. Python has a deep and active developer community, many with roots in the scientific community. This heritage has provided Python with an unparalleled array of libraries for scientific computing. In this book, we will discuss and use a number of the libraries included in this Python Scientific Stack.
In the chapters that follow, we'll learn how to build a wide variety of machine learning applications step by step. Before we begin in earnest though, we'll spend the remainder of this chapter discussing the features of these key libraries and how to prepare your environment to best utilize them.
These are the topics that will be covered in this chapter:
The data science/machine learning workflow
Libraries for each stage of the workflow
Setting up your environment
Building machine learning applications, while similar in many respects to the standard engineering paradigm, differs in one crucial aspect: the need to work with data as a raw material. The success of your project will, in large part, depend on the quality of the data you acquire, as well as your handling of that data. And because working with data falls into the domain of data science, it is helpful to understand the data science workflow:
The process involves these six steps in the following order:
Acquisition
Inspection
Preparation
Modeling
Evaluation
Deployment
Frequently, there is a need to circle back to prior steps, such as when inspecting and preparing the data, or when evaluating and modeling, but the process at a high level can be as described in the preceding list.
Let's now discuss each step in detail.
Data for machine learning applications can come from any number of sources; it may be emailed to you as a CSV file, it may come from pulling down server logs, or it may require building a custom web scraper. Data is also likely to exist in any number of formats. In most cases, you will be dealing with text-based data, but, as we'll see, machine learning applications may just as easily be built that utilize images or even video files. Regardless of the format, once you have secured the data, it is crucial that you understand what's in the data, as well as what isn't.
Once you have acquired your data, the next step is to inspect it. The primary goal at this stage is to sanity check the data, and the best way to accomplish this is to look for things that are either impossible or highly unlikely. As an example, if the data has a unique identifier, check to see that there is indeed only one; if the data is price-based, check that it is always positive; and whatever the data type, check the most extreme cases. Do they make sense? A good practice is to run some simple statistical tests on the data, and visualize it. The outcome of your models is only as good as the data you put in, so it is crucial to get this step right.
When you are confident you have your data in order, next you will need to prepare it by placing it in a format that is amenable to modeling. This stage encompasses a number of processes, such as filtering, aggregating, imputing, and transforming. The type of actions you need to take will be highly dependent on the type of data you're working with, as well as the libraries and algorithms you will be utilizing. For example, if you are working with natural language-based texts, the transformations required will be very different from those required for time-series data. We'll see a number of examples of these types of transformations throughout the book.
Once the data preparation is complete, the next phase is modeling. Here, you will be selecting an appropriate algorithm and using the data to train your model. There are a number of best practices to adhere to during this stage, and we will discuss them in detail, but the basic steps involve splitting your data into training, testing, and validation sets. This splitting up of the data may seem illogical—especially when more data typically yields better models—but as we'll see, doing this allows us to get better feedback on how the model will perform in the real world, and prevents us from the cardinal sin of modeling: overfitting. We will talk more about this in later chapters.
So, now you've got a shiny new model, but exactly how good is that model? This is the question that the evaluation phase seeks to answer. There are a number of ways to measure the performance of a model, and again it is largely dependent on the type of data you are working with and the type of model used, but on the whole, we are seeking to answer the question of how close the model's predictions are to the actual value. There is an array of confusing sounding terms, such as root mean-square error, or Euclidean distance, or F1 score. But in the end, they are all just a measure of distance between the actual prediction and the estimated prediction.
Once you are comfortable with the performance of your model, you'll want to deploy it. This can take a number of forms depending on the use case, but common scenarios include utilization as a feature within another larger application, a bespoke web application, or even just a simple cron job.
Now that you have an understanding of each step in the data science workflow, we'll take a look at a selection of useful Python libraries and functions within those libraries for each step.
Because inspecting your data is such a critical step in the development of machine learning applications, we'll now take an in-depth look at several libraries that will serve you well in this task.
There are a number of libraries that will make the data inspection process easier. The first is Jupyter Notebook with IPython (http://ipython.org/). This is a fully-fledged, interactive computing environment, and it is ideal for data exploration. Unlike most development environments, Jupyter Notebook is a web-based frontend (to the IPython kernel) that is divided into individual code blocks or cells. Cells can be run individually or all at once, depending on the need. This allows the developer to run a scenario, see the output, then step back through the code, make adjustments, and see the resulting changes—all without leaving the notebook. Here is a sample interaction in the Jupyter Notebook:
You will notice that we have done a number of things here and have interacted with not only the IPython backend, but the terminal shell as well. Here, I have imported the Python os library and made a call to find the current working directory (cell #2), which you can see is the output below my input code cell. I then changed directories using the os library in cell #3, but stopped utilizing the os library and began using Linux-based commands in cell #4. This is done by adding the ! prepend to the cell. In cell #6, you can see that I was even able to save the shell output to a Python variable (file_two). This is a great feature that makes file operations a simple task.
Now, let's take a look at some simple data operations using the notebook. This will also be our first introduction to another indispensable library, pandas.
Pandas is a remarkable tool for data analysis that aims to be the most powerful and flexible open source data analysis/manipulation tool available in any language. And, as you will soon see, if it doesn't already live up to this claim, it can't be too far off. Let's now take a look:
You can see from the preceding screenshot that I have imported a classic machine learning dataset, the iris dataset (also available at https://archive.ics.uci.edu/ml/datasets/Iris), using scikit-learn, a library we'll examine in detail later. I then passed the data into a pandas DataFrame, making sure to assign the column headers. One DataFrame contains flower measurement data, and the other DataFrame contains a number that represents the iris species. This is coded 0, 1, and 2 for setosa, versicolor, and virginica respectively. I then concatenated the two DataFrames.
For working with datasets that will fit on a single machine, pandas is the ultimate tool; you can think of it a bit like Excel on steroids. And, like the popular spreadsheet program, the basic units of operation are columns and rows of data that form tables. In the terminology of pandas, columns of data are series and the table is a DataFrame.
Using the same iris DataFrame we loaded previously, let's now take a look at a few common operations, including the following:
The first action was just to use the .head() command to get the first five rows. The second command was to select a single column from the DataFrame by referencing it by its column name. Another way we perform this data slicing is to use the .iloc[row,column] or .loc[row,column] notation. The former slices data using a numeric index for the columns and rows (positional indexing), while the latter uses a numeric index for the rows, but allows for using named columns (label-based indexing).
Let's select the first two columns and the first four rows using the .iloc notation. We'll then look at the .loc notation:
Using the .iloc notation and the Python list slicing syntax, we were able to select a slice of this DataFrame.
Now, let's try something more advanced. We'll use a list iterator to select just the width feature columns:
What we have done here is create a list that is a subset of all columns. df.columns returns a list of all columns, and our iteration uses a conditional statement to select only those with width in the title. Obviously, in this situation, we could have just as easily typed out the columns we wanted into a list, but this gives you a sense of the power available when dealing with much larger datasets.
We've seen how to select slices based on their position within the DataFrame, but let's now look at another method to select data. This time, we will select a subset of the data based upon satisfying conditions that we specify:
Let's now see the unique list of
species
available, and select just one of those:
In the far-right column, you will notice that our DataFrame only contains data for the
Iris-virginica
species (represented by the
2
) now. In fact, the size of the DataFrame is now 50 rows, down from the original 150 rows:
You can also see that the index on the left retains the original row numbers. If we wanted to save just this data, we could save it as a new DataFrame, and reset the index as shown in the following diagram:
We have selected data by placing a condition on one column; let's now add more conditions. We'll go back to our original DataFrame and add two conditions:
The DataFrame now only includes data from the virginica species with a petal width greater than 2.2.
Let's now move on to using pandas to get some quick descriptive statistics from our iris dataset:
With a call to the .describe() function, I have received a breakdown of the descriptive statistics for each of the relevant columns. (Notice that species was automatically removed as it is not relevant for this.) I could also pass in my own percentiles if I wanted more granular information:
Next, let's check whether there is any correlation between these features. That can be done by calling .corr() on our DataFrame:
The default returns the Pearson correlation coefficient for each row-column pair. This can be switched to Kendall's Tau or Spearman's rank correlation coefficient by passing in a method argument (for example, .corr(method="spearman") or .corr(method="kendall")).
