E-Book
40,81 €

Python Machine Learning Blueprints E-Book

Alexander Combs

0,0

40,81 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Discover a project-based approach to mastering machine learning concepts by applying them to everyday problems using libraries such as scikit-learn, TensorFlow, and Keras

Key Features

Get to grips with Python's machine learning libraries including scikit-learn, TensorFlow, and Keras

Implement advanced concepts and popular machine learning algorithms in real-world projects

Build analytics, computer vision, and neural network projects

Book Description

Machine learning is transforming the way we understand and interact with the world around us. This book is the perfect guide for you to put your knowledge and skills into practice and use the Python ecosystem to cover key domains in machine learning. This second edition covers a range of libraries from the Python ecosystem, including TensorFlow and Keras, to help you implement real-world machine learning projects.

The book begins by giving you an overview of machine learning with Python. With the help of complex datasets and optimized techniques, you'll go on to understand how to apply advanced concepts and popular machine learning algorithms to real-world projects. Next, you'll cover projects from domains such as predictive analytics to analyze the stock market and recommendation systems for GitHub repositories. In addition to this, you'll also work on projects from the NLP domain to create a custom news feed using frameworks such as scikit-learn, TensorFlow, and Keras. Following this, you'll learn how to build an advanced chatbot, and scale things up using PySpark. In the concluding chapters, you can look forward to exciting insights into deep learning and you'll even create an application using computer vision and neural networks.

By the end of this book, you'll be able to analyze data seamlessly and make a powerful impact through your projects.

What you will learn

Understand the Python data science stack and commonly used algorithms

Build a model to forecast the performance of an Initial Public Offering (IPO) over an initial discrete trading window

Understand NLP concepts by creating a custom news feed

Create applications that will recommend GitHub repositories based on ones you've starred, watched, or forked

Gain the skills to build a chatbot from scratch using PySpark

Develop a market-prediction app using stock data

Delve into advanced concepts such as computer vision, neural networks, and deep learning

Who this book is for

This book is for machine learning practitioners, data scientists, and deep learning enthusiasts who want to take their machine learning skills to the next level by building real-world projects. The intermediate-level guide will help you to implement libraries from the Python ecosystem to build a variety of projects addressing various machine learning domains. Knowledge of Python programming and machine learning concepts will be helpful.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 294

Veröffentlichungsjahr: 2019

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Python Machine Learning Blueprints

Second Edition

Put your machine learning concepts to the test by developing real-world smart projects

Alexander Combs

Michael Roman

BIRMINGHAM - MUMBAI

Python Machine Learning Blueprints Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith ShettyAcquisition Editor: Varsha ShettyContent Development Editor: Snehal KolteTechnical Editor: Naveen SharmaCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Mariammal ChettiyarGraphics: Jisha ChirayilProduction Coordinator:Arvindkumar Gupta

First published: July 2016 Second edition: January 2019

Production reference: 1310119

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78899-417-0

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

Alexander Combs is an experienced data scientist, strategist, and developer with a background in financial data extraction, natural language processing and generation, and quantitative and statistical modeling. He currently lives and works in New York City.

Writing a book is truly a massive undertaking that would not be possible without the support of others. I would like to thank my family for their love and encouragement and Jocelyn for her patience and understanding. I owe all of you tremendously.

Michael Roman is a data scientist at The Atlantic, where he designs, tests, analyzes, and productionizes machine learning models to address a range of business topics. Prior to this he was an associate instructor at a full-time data science immersive program in New York City. His interests include computer vision, propensity modeling, natural language processing, and entrepreneurship.

About the reviewer

Saurabh Chhajed is a machine learning and big data engineer with 9 years of professional experience in the enterprise application development life cycle using the latest frameworks, tools, and design patterns. He has experience of designing and implementing some of the most widely used and scalable customer-facing recommendation systems with extensive usage of the big data ecosystem— the batch, real-time, and machine learning pipeline. He has also worked for some of the largest investment banks, credit card companies, and manufacturing companies around the world, implementing a range of robust and scalable product suites.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Python Machine Learning Blueprints Second Edition

About Packt

Why subscribe?

Packt.com

Contributors

About the authors

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

The Python Machine Learning Ecosystem

Data science/machine learning workflow

Acquisition

Inspection

Preparation

Modeling

Evaluation

Deployment

Python libraries and functions for each stage of the data science workflow

Acquisition

Inspection

The Jupyter Notebook

Pandas

Visualization

The matplotlib library

The seaborn library

Preparation

map

apply

applymap

groupby

Modeling and evaluation

Statsmodels

Scikit-learn

Deployment

Setting up your machine learning environment

Summary

Build an App to Find Underpriced Apartments

Sourcing apartment listing data

Pulling down listing data

Pulling out the individual data points

Parsing data

Inspecting and preparing the data

Sneak-peek at the data types

Visualizing our data

Visualizing the data

Modeling the data

Forecasting

Extending the model

Summary

Build an App to Find Cheap Airfares

Sourcing airfare pricing data

Retrieving fare data with advanced web scraping

Creating a link

Parsing the DOM to extract pricing data

Parsing

Identifying outlier fares with anomaly detection techniques

Sending real-time alerts using IFTTT

Putting it all together

Summary

Forecast the IPO Market Using Logistic Regression

The IPO market

What is an IPO?

Recent IPO market performance

Working on the DataFrame

Analyzing the data

Summarizing the performance of the stocks

Baseline IPO strategy

Data cleansing and feature engineering

Adding features to influence the performance of an IPO

Binary classification with logistic regression

Creating the target for our model

Dummy coding

Examining the model performance

Generating the importance of a feature from our model 

Random forest classifier method

Summary

Create a Custom Newsfeed

Creating a supervised training set with Pocket

Installing the Pocket Chrome Extension

Using the Pocket API to retrieve stories

Using the Embedly API to download story bodies

Basics of Natural Language Processing

Support Vector Machines

IFTTT integration with feeds, Google Sheets, and email

Setting up news feeds and Google Sheets through IFTTT

Setting up your daily personal newsletter

Summary

Predict whether Your Content Will Go Viral

What does research tell us about virality?

Sourcing shared counts and content

Exploring the features of shareability

Exploring image data

Clustering

Exploring the headlines

Exploring the story content

Building a predictive content scoring model

Evaluating the model

Adding new features to our model

Summary

Use Machine Learning to Forecast the Stock Market

Types of market analysis

What does research tell us about the stock market?

So, what exactly is a momentum strategy?

How to develop a trading strategy

Analysis of the data

Volatility of the returns

Daily returns

Statistics for the strategies

The mystery strategy

Building the regression model

Performance of the model

Dynamic time warping

Evaluating our trades

Summary

Classifying Images with Convolutional Neural Networks

Image-feature extraction

Convolutional neural networks

Network topology

Convolutional layers and filters

Max pooling layers

Flattening

Fully-connected layers and output

Building a convolutional neural network to classify images in the Zalando Research dataset, using Keras

Summary

Building a Chatbot

The Turing Test

The history of chatbots

The design of chatbots

Building a chatbot

Sequence-to-sequence modeling for chatbots

Summary

Build a Recommendation Engine

Collaborative filtering

So, what's collaborative filtering?

Predicting the rating for the product

Content-based filtering

Hybrid systems

Collaborative filtering

Content-based filtering

Building a recommendation engine

Summary

What's Next?

Summary of the projects

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Machine learning is transforming the way we understand and interact with the world around us. This book is the perfect guide for you to put your knowledge and skills into practice and use the Python ecosystem to cover the key domains in machine learning. This second edition covers a range of libraries from the Python ecosystem, including TensorFlow and Keras, to help you implement real-world machine learning projects.

The book begins by giving you an overview of machine learning with Python. With the help of complex datasets and optimized techniques, you'll learn how to apply advanced concepts and popular machine learning algorithms to real-world projects. Next, you'll cover projects in domains such as predictive analytics to analyze the stock market, and recommendation systems for GitHub repositories. In addition to this, you'll also work on projects from the NLP domain to create a custom news feed using frameworks such as scikit-learn, TensorFlow, and Keras. Following this, you'll learn how to build an advanced chatbot, and scale things up using PySpark. In the concluding chapters, you can look forward to exciting insights into deep learning and even create an application using computer vision and neural networks.

By the end of this book, you'll be able to analyze data seamlessly and make a powerful impact through your projects.

Who this book is for

This book is for machine learning practitioners, data scientists, and deep learning enthusiasts who want to take their machine learning skills to the next level by building real-world projects. This intermediate-level guide will help you to implement libraries from the Python ecosystem to build a variety of projects addressing various machine learning domains.

What this book covers

Chapter 1, The Python Machine Learning Ecosystem, discusses the features of key libraries and explains how to prepare your environment to best utilize them.

Chapter 2, Build an App to Find Underpriced Apartments, explains how to create a machine learning application that will make finding the right apartment a little bit easier.

Chapter 3, Build an App to Find Cheap Airfares, covers how to build an application that continually monitors fare pricing, checking for anomalous prices that will generate an alert we can quickly act on.

Chapter 4, Forecast the IPO Market Using Logistic Regression, takes a closer look at the IPO market. We'll see how we can use machine learning to help us decide which IPOs are worth a closer look and which ones we may want to take a pass on.

Chapter 5, Create a Custom Newsfeed, explains how to build a system that understands your taste in news, and will send you a personally tailored newsletter each day.

Chapter 6, Predict whether Your Content Will Go Viral, tries to unravel some of the mysteries. We'll examine some of the most commonly shared content and attempt to find the common elements that differentiate it from content people were less willing to share.

Chapter 7, Use Machine Learning to Forecast the Stock Market, discusses how to build and test a trading strategy. We'll spend more time, however, on how not to do it.

Chapter 8, Classifying Images with Convolutional Neural Networks, details the process of creating a computer vision application using deep learning.

Chapter 9, Building a Chatbot, explains how to construct a chatbot from scratch. Along the way, we'll learn more about the history of the field and its future prospects.

Chapter 10, Build a Recommendation Engine, explores the different varieties of recommendation systems. We'll see how they're implemented commercially and how they work. Finally, we'll implement our own to recommendation engine for finding GitHub repositories.

Chapter 11, What's Next?, summarizes what has been covered so far in this book and what the next steps are from this point on. You will learn how to apply the skills you have gained to other projects, real-life challenges in building and deploying machine learning models, and other common technologies that data scientists frequently use.

To get the most out of this book

Knowledge of Python programming and machine learning concepts will be helpful.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Python-Machine-Learning-Blueprints-Second-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781788994170_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

The Python Machine Learning Ecosystem

Machine learning is rapidly changing our world. As the centerpiece of artificial intelligence, it is difficult to go a day without reading how it will lead us into either a techno-utopia along the lines of the Singularity, or into some sort of global Blade Runner-esque nightmare scenario. While pundits may enjoy discussing these hyperbolic futures, the more mundane reality is that machine learning is rapidly becoming a fixture of our daily lives. Through subtle but progressive improvements in how we interact with computers and the world around us, machine learning is progressively making our lives better.

If you shop at online retailers such as Amazon.com, use streaming music or movie services such as Spotify or Netflix, or have even just done a Google search, you have encountered an application that utilizes machine learning. These services collect vast amounts of data—much of it from their users—that is used to build models that improve the user experience.

It's an ideal time to dive into developing machine learning applications, and, as you will discover, Python is an ideal choice with which to develop them. Python has a deep and active developer community, many with roots in the scientific community. This heritage has provided Python with an unparalleled array of libraries for scientific computing. In this book, we will discuss and use a number of the libraries included in this Python Scientific Stack.

In the chapters that follow, we'll learn how to build a wide variety of machine learning applications step by step. Before we begin in earnest though, we'll spend the remainder of this chapter discussing the features of these key libraries and how to prepare your environment to best utilize them.

These are the topics that will be covered in this chapter:

The data science/machine learning workflow

Libraries for each stage of the workflow

Setting up your environment

Data science/machine learning workflow

Building machine learning applications, while similar in many respects to the standard engineering paradigm, differs in one crucial aspect: the need to work with data as a raw material. The success of your project will, in large part, depend on the quality of the data you acquire, as well as your handling of that data. And because working with data falls into the domain of data science, it is helpful to understand the data science workflow:

Data science workflow

The process involves these six steps in the following order:

Acquisition

Inspection

Preparation

Modeling

Evaluation

Deployment

Frequently, there is a need to circle back to prior steps, such as when inspecting and preparing the data, or when evaluating and modeling, but the process at a high level can be as described in the preceding list.

Let's now discuss each step in detail.

Acquisition

Data for machine learning applications can come from any number of sources; it may be emailed to you as a CSV file, it may come from pulling down server logs, or it may require building a custom web scraper. Data is also likely to exist in any number of formats. In most cases, you will be dealing with text-based data, but, as we'll see, machine learning applications may just as easily be built that utilize images or even video files. Regardless of the format, once you have secured the data, it is crucial that you understand what's in the data, as well as what isn't.

Inspection

Once you have acquired your data, the next step is to inspect it. The primary goal at this stage is to sanity check the data, and the best way to accomplish this is to look for things that are either impossible or highly unlikely. As an example, if the data has a unique identifier, check to see that there is indeed only one; if the data is price-based, check that it is always positive; and whatever the data type, check the most extreme cases. Do they make sense? A good practice is to run some simple statistical tests on the data, and visualize it. The outcome of your models is only as good as the data you put in, so it is crucial to get this step right.

Preparation

When you are confident you have your data in order, next you will need to prepare it by placing it in a format that is amenable to modeling. This stage encompasses a number of processes, such as filtering, aggregating, imputing, and transforming. The type of actions you need to take will be highly dependent on the type of data you're working with, as well as the libraries and algorithms you will be utilizing. For example, if you are working with natural language-based texts, the transformations required will be very different from those required for time-series data. We'll see a number of examples of these types of transformations throughout the book.

Modeling

Once the data preparation is complete, the next phase is modeling. Here, you will be selecting an appropriate algorithm and using the data to train your model. There are a number of best practices to adhere to during this stage, and we will discuss them in detail, but the basic steps involve splitting your data into training, testing, and validation sets. This splitting up of the data may seem illogical—especially when more data typically yields better models—but as we'll see, doing this allows us to get better feedback on how the model will perform in the real world, and prevents us from the cardinal sin of modeling: overfitting. We will talk more about this in later chapters.

Evaluation

So, now you've got a shiny new model, but exactly how good is that model? This is the question that the evaluation phase seeks to answer. There are a number of ways to measure the performance of a model, and again it is largely dependent on the type of data you are working with and the type of model used, but on the whole, we are seeking to answer the question of how close the model's predictions are to the actual value. There is an array of confusing sounding terms, such as root mean-square error, or Euclidean distance, or F1 score. But in the end, they are all just a measure of distance between the actual prediction and the estimated prediction.

Deployment

Once you are comfortable with the performance of your model, you'll want to deploy it. This can take a number of forms depending on the use case, but common scenarios include utilization as a feature within another larger application, a bespoke web application, or even just a simple cron job.

Python libraries and functions for each stage of the data science workflow

Now that you have an understanding of each step in the data science workflow, we'll take a look at a selection of useful Python libraries and functions within those libraries for each step.

Inspection

Because inspecting your data is such a critical step in the development of machine learning applications, we'll now take an in-depth look at several libraries that will serve you well in this task.

The Jupyter Notebook

There are a number of libraries that will make the data inspection process easier. The first is Jupyter Notebook with IPython (http://ipython.org/). This is a fully-fledged, interactive computing environment, and it is ideal for data exploration. Unlike most development environments, Jupyter Notebook is a web-based frontend (to the IPython kernel) that is divided into individual code blocks or cells. Cells can be run individually or all at once, depending on the need. This allows the developer to run a scenario, see the output, then step back through the code, make adjustments, and see the resulting changes—all without leaving the notebook. Here is a sample interaction in the Jupyter Notebook:

Sample interaction in the Jupyter Notebook

You will notice that we have done a number of things here and have interacted with not only the IPython backend, but the terminal shell as well. Here, I have imported the Python os library and made a call to find the current working directory (cell #2), which you can see is the output below my input code cell. I then changed directories using the os library in cell #3, but stopped utilizing the os library and began using Linux-based commands in cell #4. This is done by adding the ! prepend to the cell. In cell #6, you can see that I was even able to save the shell output to a Python variable (file_two). This is a great feature that makes file operations a simple task.

Note that the results would obviously differ slightly on your machine, since this displays information on the user under which it runs.

Now, let's take a look at some simple data operations using the notebook. This will also be our first introduction to another indispensable library, pandas.

Pandas

Pandas is a remarkable tool for data analysis that aims to be the most powerful and flexible open source data analysis/manipulation tool available in any language. And, as you will soon see, if it doesn't already live up to this claim, it can't be too far off. Let's now take a look:

Importing the iris dataset

You can see from the preceding screenshot that I have imported a classic machine learning dataset, the iris dataset (also available at https://archive.ics.uci.edu/ml/datasets/Iris), using scikit-learn, a library we'll examine in detail later. I then passed the data into a pandas DataFrame, making sure to assign the column headers. One DataFrame contains flower measurement data, and the other DataFrame contains a number that represents the iris species. This is coded 0, 1, and 2 for setosa, versicolor, and virginica respectively. I then concatenated the two DataFrames.

For working with datasets that will fit on a single machine, pandas is the ultimate tool; you can think of it a bit like Excel on steroids. And, like the popular spreadsheet program, the basic units of operation are columns and rows of data that form tables. In the terminology of pandas, columns of data are series and the table is a DataFrame.

Using the same iris DataFrame we loaded previously, let's now take a look at a few common operations, including the following:

The first action was just to use the .head() command to get the first five rows. The second command was to select a single column from the DataFrame by referencing it by its column name. Another way we perform this data slicing is to use the .iloc[row,column] or .loc[row,column] notation. The former slices data using a numeric index for the columns and rows (positional indexing), while the latter uses a numeric index for the rows, but allows for using named columns (label-based indexing).

Let's select the first two columns and the first four rows using the .iloc notation. We'll then look at the .loc notation:

Using the .iloc notation and the Python list slicing syntax, we were able to select a slice of this DataFrame.

Now, let's try something more advanced. We'll use a list iterator to select just the width feature columns:

What we have done here is create a list that is a subset of all columns. df.columns returns a list of all columns, and our iteration uses a conditional statement to select only those with width in the title. Obviously, in this situation, we could have just as easily typed out the columns we wanted into a list, but this gives you a sense of the power available when dealing with much larger datasets.

We've seen how to select slices based on their position within the DataFrame, but let's now look at another method to select data. This time, we will select a subset of the data based upon satisfying conditions that we specify:

Let's now see the unique list of

species

available, and select just one of those:

In the far-right column, you will notice that our DataFrame only contains data for the

Iris-virginica

species (represented by the

) now. In fact, the size of the DataFrame is now 50 rows, down from the original 150 rows:

You can also see that the index on the left retains the original row numbers. If we wanted to save just this data, we could save it as a new DataFrame, and reset the index as shown in the following diagram:

We have selected data by placing a condition on one column; let's now add more conditions. We'll go back to our original DataFrame and add two conditions:

The DataFrame now only includes data from the virginica species with a petal width greater than 2.2.

Let's now move on to using pandas to get some quick descriptive statistics from our iris dataset:

With a call to the .describe() function, I have received a breakdown of the descriptive statistics for each of the relevant columns. (Notice that species was automatically removed as it is not relevant for this.) I could also pass in my own percentiles if I wanted more granular information:

Next, let's check whether there is any correlation between these features. That can be done by calling .corr() on our DataFrame:

The default returns the Pearson correlation coefficient for each row-column pair. This can be switched to Kendall's Tau or Spearman's rank correlation coefficient by passing in a method argument (for example, .corr(method="spearman") or .corr(method="kendall")).