E-Book
31,19 €

Machine Learning Solutions E-Book

Jalaj Thanaki

0,0

31,19 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Machine learning (ML) helps you find hidden insights from your data without the need for explicit programming. This book is your key to solving any kind of ML problem you might come across in your job.
You’ll encounter a set of simple to complex problems while building ML models, and you'll not only resolve these problems, but you’ll also learn how to build projects based on each problem, with a practical approach and easy-to-follow examples.
The book includes a wide range of applications: from analytics and NLP, to computer vision domains. Some of the applications you will be working on include stock price prediction, a recommendation engine, building a chat-bot, a facial expression recognition system, and many more. The problem examples we cover include identifying the right algorithm for your dataset and use cases, creating and labeling datasets, getting enough clean data to carry out processing, identifying outliers, overftting datasets, hyperparameter tuning, and more. Here, you'll also learn to make more timely and accurate predictions.
In addition, you'll deal with more advanced use cases, such as building a gaming bot, building an extractive summarization tool for medical documents, and you'll also tackle the problems faced while building an ML model. By the end of this book, you'll be able to fine-tune your models as per your needs to deliver maximum productivity.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 549

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Python Natural Language Processing

Jalaj Thanaki

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Der größte Raubzug der Geschichte

Matthias Weik

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Radikal führen

Reinhard K. Sprenger

30 Minuten Sympathisch und souverän: So geht Vortragen!

Thomas Lorenz

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

The Truth About Employee Engagement

Patrick M. Lencioni

Leseprobe

Machine Learning Solutions

Why subscribe?

PacktPub.com

Foreword

Contributors

About the author

About the reviewer

Packt is Searching for Authors Like You

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Reviews

1. Credit Risk Modeling

Introducing the problem statement

Understanding the dataset

Understanding attributes of the dataset

Data analysis

Data preprocessing

First change

Second change

Implementing the changes

Basic data analysis followed by data preprocessing

Listing statistical properties

Finding missing values

Replacing missing values

Correlation

Detecting outliers

Outliers detection techniques

Percentile-based outlier detection

Median Absolute Deviation (MAD)-based outlier detection

Standard Deviation (STD)-based outlier detection

Majority-vote-based outlier detection:

Visualization of outliers

Handling outliers

Revolving utilization of unsecured lines

Age

Number of time 30-59 days past due not worse

Debt ratio

Monthly income

Number of open credit lines and loans

Number of times 90 days late

Number of real estate loans or lines

Number of times 60-89 days past due not worse

Number of dependents

Feature engineering for the baseline model

Finding out Feature importance

Selecting machine learning algorithms

K-Nearest Neighbor (KNN)

Logistic regression

AdaBoost

GradientBoosting

RandomForest

Training the baseline model

Understanding the testing matrix

The Mean accuracy of the trained models

The ROC-AUC score

ROC

AUC

Testing the baseline model

Problems with the existing approach

Optimizing the existing approach

Understanding key concepts to optimize the approach

Cross-validation

The approach of using CV

Hyperparameter tuning

Grid search parameter tuning

Random search parameter tuning

Implementing the revised approach

Implementing a cross-validation based approach

Implementing hyperparameter tuning

Implementing and testing the revised approach

Understanding problems with the revised approach

Best approach

Implementing the best approach

Log transformation of features

Voting-based ensemble ML model

Running ML models on real test data

Summary

2. Stock Market Price Prediction

Introducing the problem statement

Collecting the dataset

Collecting DJIA index prices

Collecting news articles

Understanding the dataset

Understanding the DJIA dataset

Understanding the NYTimes news article dataset

Data preprocessing and data analysis

Preparing the DJIA training dataset

Basic data analysis for a DJIA dataset

Preparing the NYTimes news dataset

Converting publication date into the YYYY-MM-DD format

Filtering news articles by category

Implementing the filter functionality and merging the dataset

Saving the merged dataset in the pickle file format

Feature engineering

Loading the dataset

Minor preprocessing

Converting adj close price into the integer format

Removing the leftmost dot from news headlines

Feature engineering

Sentiment analysis of NYTimes news articles

Selecting the Machine Learning algorithm

Training the baseline model

Splitting the training and testing dataset

Splitting prediction labels for the training and testing datasets

Converting sentiment scores into the numpy array

Training of the ML model

Understanding the testing matrix

The default testing matrix

The visualization approach

Testing the baseline model

Generating and interpreting the output

Generating the accuracy score

Visualizing the output

Exploring problems with the existing approach

Alignment

Smoothing

Trying a different ML algorithm

Understanding the revised approach

Understanding concepts and approaches

Alignment-based approach

Smoothing-based approach

Logistic Regression-based approach

Implementing the revised approach

Implementation

Implementing alignment

Implementing smoothing

Implementing logistic regression

Testing the revised approach

Understanding the problem with the revised approach

The best approach

Summary

3. Customer Analytics

Introducing customer segmentation

Introducing the problem statement

Understanding the datasets

Description of the dataset

Downloading the dataset

Attributes of the dataset

Building the baseline approach

Implementing the baseline approach

Data preparation

Loading the dataset

Exploratory data analysis (EDA)

Removing null data entries

Removing duplicate data entries

EDA for various data attributes

Country

Customer and products

Product categories

Analyzing the product description

Defining product categories

Characterizing the content of clusters

Silhouette intra-cluster score analysis

Analysis using a word cloud

Principal component analysis (PCA)

Generating customer categories

Formatting data

Grouping products

Splitting the dataset

Grouping orders

Creating customer categories

Data encoding

Generating customer categories

PCA analysis

Analyzing the cluster using silhouette scores

Classifying customers

Defining helper functions

Splitting the data into training and testing

Implementing the Machine Learning (ML) algorithm

Understanding the testing matrix

Confusion matrix

Learning curve

Testing the result of the baseline approach

Generating the accuracy score for classifier

Generating the confusion matrix for the classifier

Generating the learning curve for the classifier

Problems with the baseline approach

Optimizing the baseline approach

Building the revised approach

Implementing the revised approach

Testing the revised approach

Problems with the revised approach

Understanding how to improve the revised approach

The best approach

Implementing the best approach

Testing the best approach

Transforming the hold-out corpus in the form of the training dataset

Converting the transformed dataset into a matrix form

Generating the predictions

Customer segmentation for various domains

Summary

4. Recommendation Systems for E-Commerce

Introducing the problem statement

Understanding the datasets

e-commerce Item Data

The Book-Crossing dataset

BX-Book-Ratings.csv

BX-Books.csv

BX-Users.csv

Building the baseline approach

Understanding the basic concepts

Understanding the content-based approach

Implementing the baseline approach

Architecture of the recommendation system

Steps for implementing the baseline approach

Loading the dataset

Generating features using TF-IDF

Building the cosine similarity matrix

Generating the prediction

Understanding the testing matrix

Testing the result of the baseline approach

Problems with the baseline approach

Optimizing the baseline approach 

Building the revised approach

Implementing the revised approach

Loading dataset

EDA of the book-rating datafile

Exploring the book datafile

EDA of the user datafile

Implementing the logic of correlation for the recommendation engine

Recommendations based on the rating of the books

Recommendations based on correlations

Testing the revised approach

Problems with the revised approach

Understanding how to improve the revised approach

The best approach

Understanding the key concepts

Collaborative filtering

Memory-based CF

User-user collaborative filtering

Item-item collaborative filtering

Model-based CF

Matrix-factorization-based algorithms

Difference between memory-based CF and model-based CF

Implementing the best approach

Loading the dataset

Merging the data frames

EDA for the merged data frames

Filtering data based on geolocation

Applying the KNN algorithm

Recommendation using the KNN algorithm

Applying matrix factorization

Recommendation using matrix factorization

Summary

5. Sentiment Analysis

Introducing problem statements

Understanding the dataset

Understanding the content of the dataset

Train folder

Test folder

imdb.vocab file

imdbEr.txt file

README

Understanding the contents of the movie review files

Building the training and testing datasets for the baseline model

Feature engineering for the baseline model

Selecting the machine learning algorithm

Training the baseline model

Implementing the baseline model

Multinomial naive Bayes

C-support vector classification with kernel rbf

C-support vector classification with kernel linear

Linear support vector classification

Understanding the testing matrix

Precision

Recall

F1-Score

Support

Training accuracy

Testing the baseline model

Testing of Multinomial naive Bayes

Testing of SVM with rbf kernel

Testing SVM with the linear kernel

Testing SVM with linearSVC

Problem with the existing approach

How to optimize the existing approach

Understanding key concepts for optimizing the approach

Implementing the revised approach

Importing the dependencies

Downloading and loading the IMDb dataset

Choosing the top words and the maximum text length

Implementing word embedding

Building a convolutional neural net (CNN)

Training and obtaining the accuracy

Testing the revised approach

Understanding problems with the revised approach

The best approach

Implementing the best approach

Loading the glove model

Loading the dataset

Preprocessing

Loading precomputed ID matrix

Splitting the train and test datasets

Building a neural network

Training the neural network

Loading the trained model

Testing the trained model

Summary

6. Job Recommendation Engine

Introducing the problem statement

Understanding the datasets

Scraped dataset

Job recommendation challenge dataset

apps.tsv

users.tsv

Jobs.zip

user_history.tsv

Building the baseline approach

Implementing the baseline approach

Defining constants

Loading the dataset

Defining the helper function

Generating TF-IDF vectors and cosine similarity

Building the training dataset

Generating IF-IDF vectors for the training dataset

Building the testing dataset

Generating the similarity score

Understanding the testing matrix

Problems with the baseline approach

Optimizing the baseline approach

Building the revised approach

Loading the dataset

Splitting the training and testing datasets

Exploratory Data Analysis

Building the recommendation engine using the jobs datafile

Testing the revised approach

Problems with the revised approach

Understanding how to improve the revised approach

The best approach

Implementing the best approach

Filtering the dataset

Preparing the training dataset

Applying the concatenation operation

Generating the TF-IDF and cosine similarity score

Generating recommendations

Summary

7. Text Summarization

Understanding the basics of summarization

Extractive summarization

Abstractive summarization

Introducing the problem statement

Understanding datasets

Challenges in obtaining the dataset

Understanding the medical transcription dataset

Understanding Amazon's review dataset

Building the baseline approach

Implementing the baseline approach

Installing python dependencies

Writing the code and generating the summary

Problems with the baseline approach

Optimizing the baseline approach

Building the revised approach

Implementing the revised approach

The get_summarized function

The reorder_sentences function

The summarize function

Generating the summary

Problems with the revised approach

Understanding how to improve the revised approach

The LSA algorithm

The idea behind the best approach

The best approach

Implementing the best approach

Understanding the structure of the project

Understanding helper functions

Normalization.py

Utils.py

Generating the summary

Building the summarization application using Amazon reviews

Loading the dataset

Exploring the dataset

Preparing the dataset

Building the DL model

Training the DL model

Testing the DL model

Summary

8. Developing Chatbots

Introducing the problem statement

Retrieval-based approach

Generative-based approach

Open domain

Closed domain

Short conversation

Long conversation

Open domain and generative-based approach

Open domain and retrieval-based approach

Closed domain and retrieval-based approach

Closed domain and generative-based approach

Understanding datasets

Cornell Movie-Dialogs dataset

Content details of movie_conversations.txt

Content details of movie_lines.txt

The bAbI dataset

The (20) QA bAbI tasks

Building the basic version of a chatbot

Why does the rule-based system work?

Understanding the rule-based system

Understanding the approach

Listing down possible questions and answers

Deciding standard messages

Understanding the architecture

Implementing the rule-based chatbot

Implementing the conversation flow

Implementing RESTful APIs using flask

Testing the rule-based chatbot

Advantages of the rule-based chatbot

Problems with the existing approach

Understanding key concepts for optimizing the approach

Understanding the seq2seq model

Implementing the revised approach

Data preparation

Generating question-answer pairs

Preprocessing the dataset

Splitting the dataset into the training dataset and the testing dataset

Building a vocabulary for the training and testing datasets

Implementing the seq2seq model

Creating the model

Training the model

Testing the revised approach

Understanding the testing metrics

Perplexity

Loss

Testing the revised version of the chatbot

Problems with the revised approach

Understanding key concepts to solve existing problems

Memory networks

Dynamic memory network (DMN)

Input module

Question module

Episodic memory

The best approach

Implementing the best approach

Random testing mode

User interactive testing mode

Discussing the hybrid approach

Summary

9. Building a Real-Time Object Recognition App

Introducing the problem statement

Understanding the dataset

The COCO dataset

The PASCAL VOC dataset

PASCAL VOC classes

Transfer Learning

What is Transfer Learning?

What is a pre-trained model?

Why should we use a pre-trained model?

How can we use a pre-trained model?

Setting up the coding environment

Setting up and installing OpenCV

Features engineering for the baseline model

Selecting the machine learning algorithm

Architecture of the MobileNet SSD model

Building the baseline model

Understanding the testing metrics

Intersection over Union (IoU)

mean Average Precision

Testing the baseline model

Problem with existing approach

How to optimize the existing approach

Understanding the process for optimization

Implementing the revised approach

Testing the revised approach

Understanding problems with the revised approach

The best approach

Understanding YOLO

The working of YOLO

The architecture of YOLO

Implementing the best approach using YOLO

Implementation using Darknet

Environment setup for Darknet

Compiling the Darknet

Downloading the pre-trained weight

Running object detection for the image

Running the object detection on the video stream

Implementation using Darkflow

Installing Cython

Building the already provided setup file

Testing the environment

Loading the model and running object detection on images

Loading the model and running object detection on the video stream

Summary

10. Face Recognition and Face Emotion Recognition

Introducing the problem statement

Face recognition application

Face emotion recognition application

Setting up the coding environment

Installing dlib

Installing face_recognition

Understanding the concepts of face recognition

Understanding the face recognition dataset

CAS-PEAL Face Dataset

Labeled Faces in the Wild

Algorithms for face recognition

Histogram of Oriented Gradients (HOG)

Convolutional Neural Network (CNN) for FR

Simple CNN architecture

Understanding how CNN works for FR

Approaches for implementing face recognition

Implementing the HOG-based approach

Implementing the CNN-based approach

Implementing real-time face recognition

Understanding the dataset for face emotion recognition

Understanding the concepts of face emotion recognition

Understanding the convolutional layer

Understanding the ReLU layer

Understanding the pooling layer

Understanding the fully connected layer

Understanding the SoftMax layer

Updating the weight based on backpropagation

Building the face emotion recognition model

Preparing the data

Loading the data

Training the model

Loading the data using the dataset_loader script

Building the Convolutional Neural Network

Training for the FER application

Predicting and saving the trained model

Understanding the testing matrix

Testing the model

Problems with the existing approach

How to optimize the existing approach

Understanding the process for optimization

The best approach

Implementing the best approach

Summary

11. Building Gaming Bot

Introducing the problem statement

Setting up the coding environment

Understanding Reinforcement Learning (RL)

Markov Decision Process (MDP)

Discounted Future Reward

Basic Atari gaming bot

Understanding the key concepts

Rules for the game

Understanding the Q-Learning algorithm

Implementing the basic version of the gaming bot

Building the Space Invaders gaming bot

Understanding the key concepts

Understanding a deep Q-network (DQN)

Architecture of DQN

Steps for the DQN algorithm

Understanding Experience Replay

Implementing the Space Invaders gaming bot

Building the Pong gaming bot

Understanding the key concepts

Architecture of the gaming bot

Approach for the gaming bot

Implementing the Pong gaming bot

Initialization of the parameters

Weights stored in the form of matrices

Updating weights

How to move the agent

Understanding the process using NN

Just for fun - implementing the Flappy Bird gaming bot

Summary

A. List of Cheat Sheets

Cheat sheets

Summary

B. Strategy for Wining Hackathons

Strategy for winning hackathons

Keeping up to date

Summary

Index

Machine Learning Solutions

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith Shetty

Acquisition Editor: Aman Singh

Content Development Editors: Snehal Kolte

Technical Editor: Danish Shaikh

Copy Editor: Safis Editing

Project Coordinator: Manthan Patel

Proofreader: Safis Editing

Indexers: Pratik Shirodkar

Graphics: Tania Dutta

Production Coordinator: Arvindkumar Gupta

First published: April 2018

Production reference: 1250418

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78839-004-0

www.packtpub.com

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

With the blessings of God, and all my love for Shetul, parents, and in-laws

Foreword

I have known Jalaj Thanaki for more than 1 year. Jalaj comes across as a passionate techno-analytical expert who has the rigor one requires to achieve excellence. Her points of view on big data analytics, NLP, machine learning, and AI are well informed and carry her own analysis and appreciation of the landscape of problems and solutions. I'm glad to be writing this foreword in my capacity as the CEO and MD of SMEcorner.

Machine Learning solutions are rapidly changing the world and the way we do business, be it retail, banking, financial services, publication, pharmaceutical, or manufacturing industry. Data of all forms is growing exponentially—quantitative, qualitative, structured, unstructured, speech, video, and so on. It is imperative to make use of this data to leverage all functions, avoid risk and frauds, enhance customer experience, increase revenues, and streamline operations.

Organizations are moving fast to embrace data science and investing largely to build high-end data science teams. Having spent more than 30 years in the finance domain as a leader and managing director of various organizations such as Barclays Bank, Equifax, Hinduja Leyland, and SMECorner, I get overwhelmed with the transition that the financial industry has seen in embracing machine learning solutions as a business and no longer as a support function.

In this book, Jalaj takes us through an exciting and insightful journey to develop the best possible machine learning solutions for data science applications. With all the practical examples covered and with solid explanations, in my opinion, this is one of the best practical books for readers who want to become proficient in machine learning and deep learning.

Wishing Jalaj and this book a roaring success, which they deserve.

Samir Bhatia

MD/CEO and Founder of SMECorner

Mumbai, India

Contributors

About the author

Jalaj is an experienced data scientist with a demonstrated history of working in the information technology, publishing, and finance industries. She is author of the book "Python Natural Language Processing".

Her research interest lies in Natural Language Processing, Machine Learning, Deep Learning, and Big Data Analytics. Besides being a data scientist, Jalaj is also a social activist, traveler, and nature-lover.

I would like to dedicate this book to my husband, Shetul, for his constant support and encouragement. I give deep thanks and gratitude to my parents and my in-laws who help me at every stage of my life. I would like to thank my reviewers for providing valuable suggestions towards the improvement of this book.

Thank you, God, for being kind to me...!

About the reviewer

Niclas has been using computers for fun and profit since he got his first computer (a C64) at age four. After a prolonged period of combining a start-up with university studies, he graduated from Åbo Akademi University with an M.Sc. in Computer Engineering in 2015.

His hobbies include long walks, lifting heavy metal objects at the gym and spending quality time with his wife and daughter.

Niclas currently works at Walkbase - a company he founded together with some of his class mates from university. At Walkbase, he leads the engineering team, building the next generation of retail analytics.

Mayur Narkhede has good blend of experience in Data Science and Industry domain. He is researcher with educational qualification of B.Tech in Computer Science and M.Tech. in CSE with Artificial Intelligence specialization.

Data Scientist with core experience in building automated end to end solutions, Proficient at applying technology, AI, ML, Data Mining and Design thinking for better understanding and prediction in improving business functions and desired requirements with growth profitability.

He has worked on multiple advanced solutions such as ML and Predictive model development for Oil and Gas, Financial Services, Road Traffic and Transport, Life Science and Big Data platform for asset intensive industries.

Shetul Thanaki has a bachelor degree in computer engineering from Sardar vallabhbhai patel University,Gujarat. He has 10 years of experience in IT industries. He is currently working with investment bank and provides IT solutions to its global market applications.

He has good knowledge on java technologies,rule based system and database systems. AI technologies in fintech space are one of his key interest areas.

Packt is Searching for Authors Like You

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Preface

This book, titled Machine Learning Solutions, gives you a broad idea about the topic. As a reader, you will get the chance to learn how to develop cutting-edge data science applications using various Machine Learning (ML) techniques. This book is practical guide that can help you to build and optimize your data science applications.

We learn things by practically doing them. Practical implementations of various Machine Learning techniques, tips and tricks, optimization techniques, and so on will enhance your understanding in the ML and data science application development domains.

Now let me answer one of the most common questions I have heard from my friends and colleagues so frequently about ML and the data science application development front. This question is what really inspired me to write this book. For me, it's really important that all my readers get an idea of why am I writing this book. Let's find out that question…!

The question is, "How can I achieve the best possible accuracy for a machine learning application?" The answer includes lots of things that people should take care of:

Understand the goal of the application really well. Why does your organization want to build this application?List down the expected output of the application and how this output helps the organization. This will clarify to you the technical aspect and business aspect of the application.What kind of dataset do you have? Is there anything more you need in order to generate the required output?Explore the dataset really well. Try to get an insight from the dataset.Check whether the dataset is having labels or not. If it is a labeled dataset, then you can apply supervised algorithms; if it is not labeled, then apply unsupervised algorithms. Your problem statement is a regression problem or classification problem.Build the very simple base line approach using simple ML techniques. Measure the accuracy.Now you may think, "I haven't chosen the right algorithm and that is the reason the accuracy of the base line approach is not good." It's ok! Try to list down all the possible problems that you can think your base-line approach has. Be honest about the problems.Now solve the problems one by one and measure the accuracy. If the accuracy is improving, then move forward in that direction; otherwise try out other solutions that eventually solve the shortcomings of the base line approach and improve the accuracy.You can repeat the process number of times. After every iteration, you will get a new and definite direction, which will lead you to the best possible solution as well as accuracy.

I have covered all the specified aspects in this book. Here, the major goal is how readers will get a state-of-the-art result for their own data science problem using ML algorithms, and in order to achieve that, we will use only the bare necessary theory and many hands-on examples of different domains.

We will cover the analytics domain, NLP domain, and computer vision domain. These examples are all industry problems and readers will learn how to get the best result. After reading this book, readers will apply their new skills to any sort of industry problem to achieve best possible for their machine learning applications.

Who this book is for

A typical reader will have a basic to intermediate knowledge of undergraduate mathematics, such as probabilities, statistics, calculus, and linear algebra. No advanced mathematics is required as the book will be mostly self-contained. Basic to intermediate knowledge of Machine Learning (ML) algorithms is required. No advance concepts of machine learning are required as the book will be mostly self-contained. A decent knowledge in Python is required too as it would be out-of-scope to go through an introduction to Python but each procedure will be explained step-by-step to be reproducible.

This book is full if practical examples. The reader wants to know about how to apply the Machine Learning (ML) algorithms for real life data science applications efficiently. This book starts from the basic ML techniques which can be used to develop base-line approach. After that readers learn how to apply optimization techniques for each application in order to achieve the state-of-the-art result. For each application, I have specified the basic concepts, tips and tricks along with the code.

What this book covers

Chapter 1, Credit Risk Modeling, builds the predictive analytics model to help us to predict whether the customer will default the loan or not. We will be using outlier detection, feature transformation, ensemble machine learning algorithms, and so on to get the best possible solution.

Chapter 2, Stock Market Price Prediction, builds a model to predict the stock index price based on a historical dataset. We will use neural networks to get the best possible solution.

Chapter 3, Customer Analytics, explores how to build customer segmentation so that marketing campaigns can be done optimally. Using various machine learning algorithms such as K-nearest neighbor, random forest, and so on, we can build the base-line approach. In order to get the best possible solution, we will be using ensemble machine learning algorithms.

Chapter 4, Recommendation Systems for E-commerce, builds a recommendation engine for e-commerce platform. It can recommend similar books. We will be using concepts such as correlation, TF-IDF, and cosine similarity to build the application.

Chapter 5, Sentiment Analysis, generates sentiment scores for movie reviews. In order to get the best solution, we will be using recurrent neural networks and Long short-term memory units.

Chapter 6, Job Recommendation Engine, is where we build our own dataset, which can be used to make a job recommendation engine. We will also use an already available dataset. We will be using basic statistical techniques to get the best possible solution.

Chapter 7, Text Summarization, covers an application to generate the extractive summary of a medical transcription. We will be using Python libraries for our base line approach. After that we will be using various vectorization and ranking techniques to get the summary for a medical document. We will also generate a summary for Amazon's product reviews.

Chapter 8, Developing Chatbots, develops a chatbot using the rule-based approach and deep learning-based approach. We will be using TensorFlow and Keras to build chatbots.

Chapter 9, Building a Real-Time Object Recognition App, teaches transfer learning. We learn about convolutional networks and YOLO (You Only Look Once) algorithms. We will be using pre-trained models to develop the application.

Chapter 10, Face Recognition and Face Emotion Recognition, covers an application to recognize human faces. During the second half of this chapter, we will be developing an application that can recognize facial expressions of humans. We will be using OpenCV, Keras, and TensorFlow to build this application.

Chapter 11, Building Gaming Bots, teaches reinforcement learning. Here, we will be using the gym or universe library to get the gaming environment. We'll first understand the Q-learning algorithm, and later on we will implement the same to train our gaming bot. Here, we are building bot for Atari games.

Appendix A, List of Cheat Sheets, shows cheat sheets for various Python libraries that we frequently use in data science applications.

Appendix B, Strategy for Wining Hackathons, tells you what the possible strategy for winning hackathons can be. I have also listed down some of the cool resources that can help you to update yourself.

To get the most out of this book

Basic to intermediate knowledge of mathematics, probability, statistics, and calculus is required.

Basic to intermediate knowledge of Machine Learning (ML) algorithms is also required.

Decent knowledge of Python is required.

While reading the chapter, please run the code so that you can understand the flow of the application. All the codes are available on GitHub. The link is: https://github.com/jalajthanaki/Awesome_Machine_Learning_Solutions.

Links of code are specified in the chapters. Installation instructions for each application are also available on GitHub.

You need minimum 8 GB of RAM to run the applications smoothly. If you can run code on GPU, then it would great; otherwise you can use pre-trained models. You can download pre-trained models using the GitHub link or Google drive link. The links are specified in the chapters.

Download the example code files

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at http://www.packtpub.com.Select the SUPPORT tab.Click on Code Downloads & Errata.Enter the name of the book in the Search box and follow the on-screen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Machine-Learning-Solutions. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example; "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

from __future__ import print_function import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

from __future__ import print_function import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt

Any command-line input or output is written as follows:

# cp /usr/src/asterisk-addons/configs/cdr_mysql.conf.sample /etc/asterisk/cdr_mysql.conf

Bold: Indicates a new term, an important word, or words that you see on the screen, for example, in menus or dialog boxes, also appear in the text like this. For example: "Select System info from the Administration panel."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected], and mention the book's title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book we would be grateful if you would report this to us. Please visit, http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Chapter 1. Credit Risk Modeling

All the chapters in this book are practical applications. We will develop one application per chapter. We will understand about the application, and choose the proper dataset in order to develop the application. After analyzing the dataset, we will build the base-line approach for the particular application. Later on, we will develop a revised approach that resolves the shortcomings of the baseline approach. Finally, we will see how we can develop the best possible solution using the appropriate optimization strategy for the given application. During this development process, we will learn necessary key concepts about Machine Learning techniques. I would recommend my reader run the code which is given in this book. That will help you understand concepts really well.

In this chapter, we will look at one of the many interesting applications of predictive analysis. I have selected the finance domain to begin with, and we are going to build an algorithm that can predict loan defaults. This is one of the most widely used predictive analysis applications in the finance domain. Here, we will look at how to develop an optimal solution for predicting loan defaults. We will cover all of the elements that will help us build this application.

We will cover the following topics in this chapter:

Introducing the problem statement Understanding the dataset

Understanding attributes of the datasetData analysis

Features engineering for the baseline modelSelecting an ML algorithmTraining the baseline model Understanding the testing matrixTesting the baseline model Problems with the existing approach How to optimize the existing approach

Understanding key concepts to optimize the approach Hyperparameter tuning

Implementing the revised approach

Testing the revised approach Understanding the problem with the revised approach

The best approach Implementing the best approachSummary

Introducing the problem statement

First of all, let's try to understand the application that we want to develop or the problem that we are trying to solve. Once we understand the problem statement and it's use case, it will be much easier for us to develop the application. So let's begin!

Here, we want to help financial companies, such as banks, NBFS, lenders, and so on. We will make an algorithm that can predict to whom financial institutes should give loans or credit. Now you may ask what is the significance of this algorithm? Let me explain that in detail. When a financial institute lends money to a customer, they are taking some kind of risk. So, before lending, financial institutes check whether or not the borrower will have enough money in the future to pay back their loan. Based on the customer's current income and expenditure, many financial institutes perform some kind of analysis that helps them decide whether the borrower will be a good customer for that bank or not. This kind of analysis is manual and time-consuming. So, it needs some kind of automation. If we develop an algorithm, that will help financial institutes gauge their customers efficiently and effectively.Your next question may be what is the output of our algorithm? Our algorithm will generate probability. This probability value will indicate the chances of borrowers defaulting. Defaulting means borrowers cannot repay their loan in a certain amount of time. Here, probability indicates the chances of a customer not paying their loan EMI on time, resulting in default. So, a higher probability value indicates that the customer would be a bad or inappropriate borrower (customer) for the financial institution, as they may default in the next 2 years. A lower probability value indicates that the customer will be a good or appropriate borrower (customer) for the financial institution and will not default in the next 2 years.

Here, I have given you information regarding the problem statement and its output, but there is an important aspect of this algorithm: its input. So, let's discuss what our input will be!