47,99 €
Machine learning has become the new black. The challenge in today’s world is the explosion of data from existing legacy data and incoming new structured and unstructured data. The complexity of discovering, understanding, performing analysis, and predicting outcomes on the data using machine learning algorithms is a challenge. This cookbook will help solve everyday challenges you face as a data scientist. The application of various data science techniques and on multiple data sets based on real-world challenges you face will help you appreciate a variety of techniques used in various situations.
The first half of the book provides recipes on fairly complex machine-learning systems, where you’ll learn to explore new areas of applications of machine learning and improve its efficiency. That includes recipes on classifications, neural networks, unsupervised and supervised learning, deep learning, reinforcement learning, and more.
The second half of the book focuses on three different machine learning case studies, all based on real-world data, and offers solutions and solves specific machine-learning issues in each one.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 331
Veröffentlichungsjahr: 2017
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: April 2017
Production reference: 1070417
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78528-051-1
www.packtpub.com
Author
Atul Tripathi
Copy Editor
Safis Editing
Reviewer
Ryota Kamoshida
Project Coordinator
Nidhi Joshi
Commissioning Editor
Akram Hussain
Proofreader
Safis Editing
Acquisition Editor
Tushar Gupta
Indexer
Tejal Daruwale Soni
Content Development Editor
Aishwarya Pandere
Graphics
Tania Dutta
Technical Editor
Prasad Ramesh
Production Coordinator
Shantanu Zagade
Atul Tripathi has spent more than 11 years in the fields of machine learning and quantitative finance. He has a total of 14 years of experience in software development and research. He has worked on advanced machine learning techniques, such as neural networks and Markov models. While working on these techniques, he has solved problems related to image processing, telecommunications, human speech recognition, and natural language processing. He has also developed tools for text mining using neural networks. In the field of quantitative finance, he has developed models for Value at Risk, Extreme Value Theorem, Option Pricing, and Energy Derivatives using Monte Carlo simulation techniques.
Ryota Kamoshida is the developer of the Python library MALSS (MAchine Learning Support System), (https://github.com/canard0328/malss) and now works as a senior researcher in the field of computer science at Hitachi, Ltd.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1785280511.
If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Data in today’s world is the new black gold which is growing exponentially. This growth can be attributed to the growth of existing data, and new data in a structured and unstructured format from multiple sources such as social media, Internet, documents and the Internet of Things. The flow of data must be collected, processed, analyzed, and finally presented in real time to ensure that the consumers of the data are able to take informed decisions in today’s fast-changing environment. Machine learning techniques are applied to the data using the context of the problem to be solved to ensure that fast arriving and complex data can be analyzed in a scientific manner using statistical techniques. Using machine learning algorithms that iteratively learn from data, hidden patterns can be discovered. The iterative aspect of machine learning is important because as models are exposed to new data, they are able to independently adapt and learn to produce reliable decisions from new data sets.
We will start by introducing the various topics of machine learning, that will be covered in the book. Based on real-world challenges, we explore each of the topics under various chapters, such as Classification, Clustering, Model Selection and Regularization, Nonlinearity, Supervised Learning, Unsupervised Learning, Reinforcement Learning, Structured Prediction, Neural Networks, Deep Learning, and finally the case studies. The algorithms have been developed using R as the programming language. This book is friendly for beginners in R, but familiarity with R programming would certainly be helpful for playing around with the code.
You will learn how to make informed decisions about the type of algorithms you need to use and how to implement these algorithms to get the best possible results. If you want to build versatile applications that can make sense of images, text, speech, or some other form of data, this book on machine learning will definitely come to your rescue!
Chapter 1, Introduction to Machine Learning, covers various concepts about machine learning. This chapter makes the reader aware of the various topics we shall be covering in the book.
Chapter 2, Classification, covers the following topics and algorithms: discriminant function analysis, multinomial logistic regression, Tobit regression, and Poisson regression.
Chapter 3, Clustering, covers the following topics and algorithms: hierarchical clustering, binary clustering, and k-means clustering.
Chapter 4, Model Selection and Regularization, covers the following topics and algorithms: shrinkage methods, dimension reduction methods, and principal component analysis.
Chapter 5, Nonlinearity, covers the following topics and algorithms: generalized additive models, smoothing splines, local regression.
Chapter 6, Supervised Learning, covers the following topics and algorithms: decision tree learning, Naive Bayes, random forest, support vector machine, and stochastic gradient descent.
Chapter 7, Unsupervised Learning, covers the following topics and algorithms: self-organizing map, and vector quantization.
Chapter 8, Reinforcement Learning, covers the following topics and algorithms: Markov chains, and Monte Carlo simulations.
Chapter 9, Structured Prediction, covers the following topic and algorithms: hidden Markov models.
Chapter 10, Neural Networks, covers the following topic and algorithms: neural networks.
Chapter 11, Deep Learning, covers the following topic and algorithms: recurrent neural networks.
Chapter 12, Case Study - Exploring World Bank Data, covers World Bank data analysis.
Chapter 13, Case Study - Pricing Reinsurance Contracts, covers pricing reinsurance contracts.
Chapter 14, Case Study - Forecast of Electricity Consumption, covers forecasting electricity consumption.
This book is focused on building machine learning-based applications in R. We have used R to build various solutions. We focused on how to utilize various R libraries and functions in the best possible way to overcome real-world challenges. We have tried to keep all the code as friendly and readable as possible. We feel that this will enable our readers to easily understand the code and readily use it in different scenarios.
This book is for students and professionals working in the fields of statistics, data analytics, machine learning, and computer science, or other professionals who want to build real-world machine learning-based applications. This book is friendly to R beginners, but being familiar with R would be useful for playing around with the code. The will also be useful for experienced R programmers who are looking to explore machine learning techniques in their existing technology stacks.
In this book, you will find headings that appear frequently (Getting ready and How to do it).
To give clear instructions on how to complete a recipe, we use these sections as follows:
This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.
This section contains the steps required to follow the recipe.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We will be saving the data to the fitbit_details frame:"
Any command-line input or output is written as follows:
install.packages("ggplot2")New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Monte Carlo v/s Market n Zero Rates"
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors .
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Machine-Learning-Cookbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/PracticalMachineLearningCookbook_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
In this chapter, we will cover an introduction to machine learning and various topics covered under machine learning. In this chapter you will learn about the following topics:
Human beings are exposed to data from birth. The eyes, ears, nose, skin, and tongue are continuously gathering various forms of data which the brain translates to sight, sound, smell, touch, and taste. The brain then processes various forms of raw data it receives through sensory organs and translates it to speech, which is used to express opinion about the nature of raw data received.
In today's world, sensors attached to machines are applied to gather data. Data is collected from Internet through various websites and social networking sites. Electronic forms of old manuscripts that have been digitized also add to data sets. Data is also obtained from the Internet through various websites and social networking sites. Data is also gathered from other electronic forms such as old manuscripts that have been digitized. These rich forms of data gathered from multiple sources require processing so that insight can be gained and a more meaningful pattern may be understood.
Machine learning algorithms help to gather data from varied sources, transform rich data sets, and help us to take intelligent action based on the results provided. Machine learning algorithms are designed to be efficient and accurate and to provide general learning to do the following:
Some of the areas of applications of machine learning algorithms are as follows:
Linear regression models present response variables that are quantitative in nature. However, certain responses are qualitative in nature. Responses such as attitudes (strongly disagree, disagree, neutral, agree, and strongly agree) are qualitative in nature. Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category or class. Classifiers are an invaluable tool for many tasks today, such as medical or genomics predictions, spam detection, face recognition, and finance.
Clustering is a division of data into groups of similar objects. Each object (cluster) consists of objects that are similar between themselves and dissimilar to objects of other groups. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. Clustering can be used in varied areas of application from data mining (DNA analysis, marketing studies, insurance studies, and so on.), text mining, information retrieval, statistical computational linguists, and corpus-based computational lexicography. Some of the requirements that must be fulfilled by clustering algorithms are as follows:
The following diagram shows a representation of clustering:
Supervised learning entails learning a mapping between a set of input variables (typically a vector) and an output variable (also called the supervisory signal) and applying this mapping to predict the outputs for unseen data. Supervised methods attempt to discover the relationship between input variables and target variables. The relationship discovered is represented in a structure referred to as a model. Usually models describe and explain phenomena, which are hidden in the dataset and can be used for predicting the value of the target attribute knowing the values of the input attributes.
Supervised learning is the machine learning task of inferring a function from supervised training data (set of training examples). The training data consists of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value. A supervised learning algorithm analyzes the training data and produces an inferred function.
In order to solve the supervised learning problems, the following steps must be performed:
The supervised methods can be implemented in a variety of domains such as marketing, finance, and manufacturing.
Some of the issues to consider in supervised learning are as follows:
Unsupervised learning studies how systems can learn to represent particular input patterns in a way that reflects the statistical structure of the overall collection of input patterns. Unsupervised learning is important since it is likely to be much more common in the brain than supervised learning. For example, the activities of photoreceptors in the eyes are constantly changing with the visual world. They go on to provide all the information that is available to indicate what objects there are in the world, how they are presented, what the lighting conditions are, and so on. However, essentially none of the information about the contents of scenes is available during learning. This makes unsupervised methods essential, and allows them to be used as computational models for synaptic adaptation.
In unsupervised learning, the machine receives inputs but obtains neither supervised target outputs, nor rewards from its environment. It may seem somewhat mysterious to imagine what the machine could possibly learn given that it doesn't get any feedback from its environment. However, it is possible to develop a formal framework for unsupervised learning, based on the notion that the machine's goal is to build representations of the input that can be used for decision making, predicting future inputs, efficiently communicating the inputs to another machine, and so on. In a sense, unsupervised learning can be thought of as finding patterns in the data above and beyond what would be considered noise.
Some of the goals of unsupervised learning are as follows:
Reinforcement learning is the problem of getting an agent to act in the world so as to maximize its rewards. It is about what to do and how to map situations to actions so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. The two most important distinguishing features of reinforcement learning are trial and error and search and delayed reward. Some examples of reinforcement learning are as follows:
Reinforcement learning is like trial and error learning. The agent should discover a good policy from its experiences of the environment without losing too much reward along the way. Exploration is about finding more information about the environment while Exploitation exploits known information to maximize reward. For example:
Major components of reinforcement learning are as follows:
Structured prediction is an important area of application for machine learning problems in a variety of domains. Considering an input x and an output y in areas such as a labeling of time steps, a collection of attributes for an image, a parsing of a sentence, or a segmentation of an image into objects, problems are challenging because the y's are exponential in the number of output variables that comprise it. These are computationally challenging because prediction requires searching an enormous space, and also statistical considerations, since learning accurate models from limited data requires reasoning about commonalities between distinct structured outputs. Structured prediction is fundamentally a problem of representation, where the representation must capture both the discriminative interactions between x and y and also allow for efficient combinatorial optimization over y.
Structured prediction is about predicting structured outputs from input data in contrast to predicting just a single number, like in classification or regression. For example:
Neural networks represent a brain metaphor for information processing. These models are biologically inspired rather than an exact replica of how the brain actually functions. Neural networks have been shown to be very promising systems in many forecasting applications and business classification applications due to their ability to learn from the data.
The artificial neural network learns by updating the network architecture and connection weights so that the network can efficiently perform a task. It can learn either from available training patterns or automatically learn from examples or input-output relations. The learning process is designed by one of the following:
There are four basic types of learning rules:
Deep learning refers to a rather wide class of machine learning techniques and architectures, with the hallmark of using many layers of non-linear information processing that are hierarchical in nature. There are broadly three categories of deep learning architecture:
In this chapter, we will cover the following recipes:
Discriminant analysis is used to distinguish distinct sets of observations and allocate new observations to previously defined groups. For example, if a study was to be carried out in order to investigate the variables that discriminate between fruits eaten by (1) primates, (2) birds, or (3) squirrels, the researcher could collect data on numerous fruit characteristics of those species eaten by each of the animal groups. Most fruits will naturally fall into one of the three categories. Discriminant analysis could then be used to determine which variables are the best predictors of whether a fruit will be eaten by birds, primates, or squirrels. Discriminant analysis is commonly used in biological species classification, in medical classification of tumors, in facial recognition technologies, and in the credit card and insurance industries for determining risk. The main goals of discriminant analysis are discrimination and classification. The assumptions regarding discriminant analysis are multivariate normality, equality of variance-covariance within group and low multicollinearity of the variables.
Multinomial logistic regression is used to predict categorical placement in or the probability of category membership on a dependent variable, based on multiple independent variables. It is used when the dependent variable has more than two nominal or unordered categories, in which dummy coding of independent variables is quite common. The independent variables can be either dichotomous (binary) or continuous (interval or ratio in scale). Multinomial logistic regression uses maximum likelihood estimation to evaluate the probability of categorical membership. It uses maximum likelihood estimation rather than the least squares estimation used in traditional multiple regression. The general form of the distribution is assumed. The starting values of the estimated parameters are used and the likelihood that the sample came from a population with those parameters is computed. The values of the estimated parameters are adjusted iteratively until the maximum likelihood value for the estimated parameters is obtained.
Tobit regression
