Java Deep Learning Projects - Md. Rezaul Karim - E-Book

Java Deep Learning Projects E-Book

Md. Rezaul Karim

0,0
39,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Java is one of the most widely used programming languages. With the rise of deep learning, it has become a popular choice of tool among data scientists and machine learning experts.
Java Deep Learning Projects starts with an overview of deep learning concepts and then delves into advanced projects. You will see how to build several projects using different deep neural network architectures such as multilayer perceptrons, Deep Belief Networks, CNN, LSTM, and Factorization Machines.
You will get acquainted with popular deep and machine learning libraries for Java such as Deeplearning4j, Spark ML, and RankSys and you’ll be able to use their features to build and deploy projects on distributed computing environments.
You will then explore advanced domains such as transfer learning and deep reinforcement learning using the Java ecosystem, covering various real-world domains such as healthcare, NLP, image classification, and multimedia analytics with an easy-to-follow approach. Expert reviews and tips will follow every project to give you insights and hacks.
By the end of this book, you will have stepped up your expertise when it comes to deep learning in Java, taking it beyond theory and be able to build your own advanced deep learning systems.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 450

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Java Deep Learning Projects
Implement 10 real-world deep learning applications using Deeplearning4j and open source APIs
Md. Rezaul Karim
BIRMINGHAM - MUMBAI

Java Deep Learning Projects

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith ShettyAcquisition Editor: Tushar GuptaContent Development Editor: Karan ThakkarTechnical Editor: Dinesh PawarCopy Editor: Vikrant PhadkayProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer: Rekha NairGraphics: Tania DuttaProduction Coordinator: Arvindkumar Gupta

First published: June 2018

Production reference: 1280618

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78899-745-4

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Md. Rezaul Karim is a Research Scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he was a Researcher at Insight Centre for Data Analytics, Ireland. Before that, he was a Lead Engineer at Samsung Electronics, Korea.

He has 9 years of R&D experience in Java, Scala, Python, and R. He has hands-on experience in Spark, Zeppelin, Hadoop, Keras, scikit-learn, TensorFlow, Deeplearning4j, and H2O. He has published several research papers in top-ranked journals/conferences focusing on bioinformatics and deep learning.

About the reviewer

Joao Bosco Jares is a Software Engineer with 12 years of experience in machine learning, Semantic Web and IoT. Previously, he was a Software Engineer at IBM Watson, Insight Centre for Data Analytics, Brazilian Northeast Bank, and Bank of Amazonia, Brazil.

He has an MSc and a BSc in computer science, and a data science postgraduate degree. He is also an IBM Jazz RTC Certified Professional, Oracle Certified Master Java EE 6 Enterprise Architect, andSunJava Certified Programmer.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Java Deep Learning Projects

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Getting Started with Deep Learning

A soft introduction to ML

Working principles of ML algorithms

Supervised learning

Unsupervised learning

Reinforcement learning

Putting ML tasks altogether

Delving into deep learning

How did DL take ML into next level?

Artificial Neural Networks

Biological neurons

A brief history of ANNs

How does an ANN learn?

ANNs and the backpropagation algorithm

Forward and backward passes

Weights and biases

Weight optimization

Activation functions

Neural network architectures

Deep neural networks

Multilayer Perceptron

Deep belief networks

Autoencoders

Convolutional neural networks

Recurrent neural networks

Emergent architectures

Residual neural networks

Generative adversarial networks

Capsule networks

DL frameworks and cloud platforms

Deep learning frameworks

Cloud-based platforms for DL

Deep learning from a disaster – Titanic survival prediction

Problem description

Configuring the programming environment

Feature engineering and input dataset preparation

Training MLP classifier

Evaluating the MLP classifier

Frequently asked questions (FAQs)

Summary

Answers to FAQs

Cancer Types Prediction Using Recurrent Type Networks

Deep learning in cancer genomics

Cancer genomics dataset description

Preparing programming environment

Titanic survival revisited with DL4J

Multilayer perceptron network construction

Hidden layer 1

Hidden layer 2

Output layer

Network training

Evaluating the model

Cancer type prediction using an LSTM network

Dataset preparation for training

Recurrent and LSTM networks

Dataset preparation

LSTM network construction

Network training

Evaluating the model

Frequently asked questions (FAQs)

Summary

Answers to questions

Multi-Label Image Classification Using Convolutional Neural Networks

Image classification and drawbacks of DNNs

CNN architecture

Convolutional operations

Pooling and padding operations

Fully connected layer (dense layer)

Multi-label image classification using CNNs

Problem description

Description of the dataset

Removing invalid images

Workflow of the overall project

Image preprocessing

Extracting image metadata

Image feature extraction

Preparing the ND4J dataset

Training, evaluating, and saving the trained CNN models

Network construction

Scoring the model

Submission file generation

Wrapping everything up by executing the main() method

Frequently asked questions (FAQs)

Summary

Answers to questions

Sentiment Analysis Using Word2Vec and LSTM Network

Sentiment analysis is a challenging task

Using Word2Vec for neural word embeddings

Datasets and pre-trained model description

Large Movie Review dataset for training and testing

Folder structure of the dataset

Description of the sentiment labeled dataset

Word2Vec pre-trained model

Sentiment analysis using Word2Vec and LSTM

Preparing the train and test set using the Word2Vec model

Network construction, training, and saving the model

Restoring the trained model and evaluating it on the test set

Making predictions on sample review texts

Frequently asked questions (FAQs)

Summary

Answers to questions

Transfer Learning for Image Classification

Image classification with pretrained VGG16

DL4J and transfer learning

Developing an image classifier using transfer learning

Dataset collection and description

Architecture choice and adoption

Train and test set preparation

Network training and evaluation

Restoring the trained model and inferencing

Making simple inferencing

Frequently asked questions (FAQs)

Summary

Answers to questions

Real-Time Object Detection using YOLO, JavaCV, and DL4J

Object detection from images and videos

Object classification, localization, and detection

Convolutional Sliding Window (CSW)

Object detection from videos

You Only Look Once (YOLO)

Developing a real-time object detection project

Step 1 – Loading a pre-trained YOLO model

Step 2 – Generating frames from video clips

Step 3 – Feeding generated frames into Tiny YOLO model

Step 4 – Object detection from image frames

Step 5 – Non-max suppression in case of more than one bounding box

Step 6 – wrapping up everything and running the application

Frequently asked questions (FAQs)

Summary

Answers to questions

Stock Price Prediction Using LSTM Network

State-of-the-art automated stock trading

Developing a stock price predictive model

Data collection and exploratory analysis

Preparing the training and test sets

LSTM network construction

Network training, and saving the trained model

Restoring the saved model for inferencing

Evaluating the model

Frequently asked questions (FAQs)

Summary

Answers to questions

Distributed Deep Learning – Video Classification Using Convolutional LSTM Networks

Distributed deep learning across multiple GPUs

Distributed training on GPUs with DL4J

Video classification using convolutional – LSTM

UCF101 – action recognition dataset

Preprocessing and feature engineering

Solving the encoding problem

Data processing workflow

Simple UI for checking video frames

Preparing training and test sets

Network creation and training

Performance evaluation

Distributed training on AWS deep learning AMI 9.0

Frequently asked questions (FAQs)

Summary

Answers to questions

Playing GridWorld Game Using Deep Reinforcement Learning

Notation, policy, and utility for RL

Notations in reinforcement learning

Policy

Utility

Neural Q-learning

Introduction to QLearning

Neural networks as a Q-function

Developing a GridWorld game using a deep Q-network

Generating the grid

Calculating agent and goal positions

Calculating the action mask

Providing guidance action

Calculating the reward

Flattening input for the input layer

Network construction and training

Playing the GridWorld game

Frequently asked questions (FAQs)

Summary

Answers to questions

Developing Movie Recommendation Systems Using Factorization Machines

Recommendation systems

Recommendation approaches

Collaborative filtering approaches

Content-based filtering approaches

Hybrid recommender systems

Model-based collaborative filtering

The utility matrix

The cold-start problem in collaborative-filtering approaches

Factorization machines in recommender systems

Developing a movie recommender system using FMs

Dataset description and exploratory analysis

Movie rating prediction

Converting the dataset into LibFM format

Training and test set preparation

Movie rating prediction

Which one makes more sense ;– ranking or rating?

Frequently asked questions (FAQs)

Summary

Answers to questions

Discussion, Current Trends, and Outlook

Discussion and outlook

Discussion on the completed projects

Titanic survival prediction using MLP and LSTM networks

Cancer type prediction using recurrent type networks

Image classification using convolutional neural networks

Sentiment analysis using Word2Vec and the LSTM network

Image classification using transfer learning

Real-time object detection using YOLO, JavaCV, and DL4J

Stock price prediction using LSTM network

Distributed deep learning – video classification using a convolutional-LSTM network

Using deep reinforcement learning for GridWorld

Movie recommender system using factorization machines

Current trends and outlook

Current trends

Outlook on emergent DL architectures

Residual neural networks

GANs

Capsule networks (CapsNet)

Semantic image segmentation

Deep learning for clustering analysis

Frequently asked questions (FAQs)

Answers to questions

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

The continued growth in data, coupled with the need to make increasingly complex decisions against that data, is creating massive hurdles that prevent organizations from deriving insights in a timely manner using traditional analytical approaches.

To find meaningful values and insights, deep learning evolved, which is a branch of machine learning algorithms based on learning multiple levels of abstraction. Neural networks, being at the core of deep learning, are used in predictive analytics, computer vision, natural language processing, time series forecasting, and performing a myriad of other complex tasks.

Until date, most DL books available are written in Python. However, this book is conceived for developers, data scientists, machine learning practitioners, and deep learning enthusiasts who want to build powerful, robust, and accurate predictive models with the power of Deeplearning4j (a JVM-based DL framework), combining other open source Java APIs.

Throughout the book, you will learn how to develop practical applications for AI systems using feedforward neural networks, convolutional neural networks, recurrent neural networks, autoencoders, and factorization machines. Additionally, you will learn how to attain your deep learning programming on GPU in a distributed way.

After finishing the book, you will be familiar with machine learning techniques, in particular, the use of Java for deep learning, and will be ready to apply your knowledge in research or commercial projects.In summary, this book is not meant to be read cover to cover. You can jump to a chapter that looks like something you are trying to accomplish or one that simply ignites your interest.

Happy reading!

Who this book is for

Developers, data scientists, machine learning practitioners, and deep learning enthusiasts who wish to learn how to develop real-life deep learning projects by harnessing the power of JVM-based Deeplearning4j (DL4J), Spark, RankSys, and other open source libraries will find this book extremely useful. A sound understanding of Java is needed. Nevertheless, some basic prior experience of Spark, DL4J, and Maven-based project management will be useful to pick up the concepts quicker.

What this book covers

Chapter 1, Getting Started with Deep Learning, explains some basic concepts of machine learning and artificial neural networks as the core of deep learning. It then briefly discusses existing and emerging neural network architectures. Next, it covers various features of deep learning frameworks and libraries. Then it shows how to solve Titanic survival prediction using a Spark-based Multilayer Perceptron (MLP). Finally, it discusses some frequent questions related to this projects and general DL area.

Chapter 2, Cancer Types Prediction Using Recurrent Type Networks, demonstrates how to develop a DL application for cancer type classification from a very-high-dimensional gene expression dataset. First, it performs necessary feature engineering such that the dataset can feed into a Long Short-Term Memory (LSTM) network. Finally, it discusses some frequent questions related to this project and DL4J hyperparameters/nets tuning.

Chapter 3, Multi-LabelImage Classification Using Convolutional Neural Networks, demonstrates how to develop an end-to-end project for handling the multi-label image classification problem using CNN on top of the DL4J framework on real Yelp image datasets. It discusses how to tune hyperparameters for better classification results.

Chapter 4, Sentiment Analysis Using Word2Vec and the LSTM Network, shows how to develop a hands-on deep learning project that classifies review texts as either positive or negative sentiments. A large-scale movie review dataset will be used to train the LSTM model, and Word2Vec will be used as the neural embedding. Finally, it shows sample predictions for other review datasets.

Chapter 5, Transfer Learning for Image Classification, demonstrates how to develop an end-to-end project to solve dog versus cat image classification using a pre-trained VGG-16 model. We wrap up everything in a Java JFrame and JPanel application to make the overall pipeline understandable for making sample object detection.

Chapter 6, Real-Time Object Detection Using YOLO, JavaCV, and DL4J, shows how to develop an end-to-end project that will detect objects from video frames when the video clips play continuously. The pre-trained YOLO v2 model will be used as transfer learning and JavaCV API for video frame handling on top of DL4J.

Chapter 7, Stock Price Prediction Using the LSTM Network, demonstrates how to develop a real-life plain stock open, close, low, high, or volume price prediction using LSTM on top of the DL4J framework. Time series generated from a real-life stock dataset will be used to train the LSTM model, which will be used to predict the price only 1 day ahead at a time step.

Chapter 8, Distributed Deep Learning on Cloud – Video Classification Using Convolutional LSTM Network, shows how to develop an end-to-end project that accurately classifies a large collection of video clips (for example, UCF101) using a combined CNN and LSTM network on top of DL4J. The training is carried out on Amazon EC2 GPU compute cluster. Eventually, this end-to-end project can be treated as a primer for human activity recognition from video or so.

Chapter 9, Playing GridWorld GameUsing Deep Reinforcement Learning, is all about designing a machine learning system driven by criticisms and rewards. It then shows how to develop a GridWorld game using DL4J, RL4J, and neural QLearning that acts as the Q function.

Chapter 10, Developing Movie Recommendation Systems Using Factorization Machines, is about developing a sample project using factorization machines to predict both the rating and ranking of movies. It then discusses some theoretical background of recommendation systems using matrix factorization and collaborative filtering, before diving the project implementation using RankSys-library-based FMs.

Chapter 11, Discussion, Current Trends, and Outlook, wraps up everything by discussing the completed projects and some abstract takeaways. Then it provides some improvement suggestions. Additionally, it covers some extension guidelines for other real-life deep learning projects.

To get the most out of this book

All the examples have been implemented using Deeplearning4j with some open source libraries in Java. To be more specific, the following API/tools are required:

Java/JDK version 1.8

Spark version 2.3.0

Spark csv_2.11 version 1.3.0

ND4j backend version nd4j-cuda-9.0-platform for GPU, otherwise nd4j-native

ND4j version >=1.0.0-alpha

DL4j version >=1.0.0-alpha

Datavec version >=1.0.0-alpha

Arbiter version >=1.0.0-alpha

Logback version 1.2.3

JavaCV platform version 1.4.1

HTTP Client version 4.3.5

Jfreechart 1.0.13

Jcodec 0.2.3

Eclipse Mars or Luna (latest) or Intellij IDEA

Maven Eclipse plugin (2.9 or higher)

Maven compiler plugin for Eclipse (2.3.2 or higher)

Maven assembly plugin for Eclipse (2.4.1 or higher)

Regarding operating system: Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, CentOS). To be more specific, for example, for Ubuntu it is recommended to have a 14.04 (LTS) 64-bit (or later) complete installation or VMWare player 12 or Virtual box. You can run Spark jobs on Windows (XP/7/8/10) or Mac OS X (10.4.7+).

Regarding hardware configuration: A machine or server having core i5 processor, about 100 GB disk space, and at least 16 GB RAM. In addition, an Nvidia GPU driver has to be installed with CUDA and CuDNN configured if you want to perform the training on GPU. Enough storage for running heavy jobs is needed (depending on the dataset size you will be handling), preferably at least 50 GB of free disk storage (for standalone and for SQL warehouse).

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Java-Deep-Learning-Projects. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/JavaDeepLearningProjects_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Then, I unzipped and copied each .csv file into a folder called label."

A block of code is set as follows:

<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <java.version>1.8</java.version></properties>

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

<properties>

<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <java.version>1.8</java.version>

</properties>

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "We then read and process images into PhotoID | Vector map"

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Getting Started with Deep Learning

In this chapter, we will explain some basic concepts of Machine Learning (ML) and Deep Learning (DL) that will be used in all subsequent chapters. We will start with a brief introduction to ML. Then we will move on to DL, which is one of the emerging branches of ML.

We will briefly discuss some of the most well-known and widely used neural network architectures. Next, we will look at various features of deep learning frameworks and libraries. Then we will see how to prepare a programming environment, before moving on to coding with some open source, deep learning libraries such as DeepLearning4J (DL4J).

Then we will solve a very famous ML problem: the Titanic survival prediction. For this, we will use an Apache Spark-based Multilayer Perceptron (MLP) classifier to solve this problem. Finally, we'll see some frequently asked questions that will help us generalize our basic understanding of DL. Briefly, the following topics will be covered:

A soft introduction to

ML

Artificial Neural Networks (ANNs)

Deep neural network architectures

Deep learning frameworks

Deep learning from disasters—Titanic survival prediction using MLP

Frequently asked questions (FAQ)

A soft introduction to ML

ML approaches are based on a set of statistical and mathematical algorithms in order to carry out tasks such as classification, regression analysis, concept learning, predictive modeling, clustering, and mining of useful patterns. Thus, with the use of ML, we aim at improving the learning experience such that it becomes automatic. Consequently, we may not need complete human interactions, or at least we can reduce the level of such interactions as much as possible.

Working principles of ML algorithms

We now refer to a famous definition of ML by Tom M. Mitchell (Machine Learning, Tom Mitchell, McGraw Hill), where he explained what learning really means from a computer science perspective:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Based on this definition, we can conclude that a computer program or machine can do the following:

Learn from data and histories

Improve with experience

Iteratively enhance a model that can be used to predict outcomes of questions

Since they are at the core of predictive analytics, almost every ML algorithm we use can be treated as an optimization problem. This is about finding parameters that minimize an objective function, for example, a weighted sum of two terms like a cost function and regularization. Typically, an objective function has two components:

A regularizer, which controls the complexity of the model

The loss, which measures the error of the model on the training data.

On the other hand, the regularization parameter defines the trade-off between minimizing the training error and the model's complexity in an effort to avoid overfitting problems. Now, if both of these components are convex, then their sum is also convex; it is non-convex otherwise. More elaborately, when using an ML algorithm, the goal is to obtain the best hyperparameters of a function that return the minimum error when making predictions. Therefore, using a convex optimization technique, we can minimize the function until it converges towards the minimum error.

Given that a problem is convex, it is usually easier to analyze the asymptotic behavior of the algorithm, which shows how fast it converges as the model observes more and more training data. The challenge of ML is to allow training a model so that it can recognize complex patterns and make decisions not only in an automated way but also as intelligently as possible. The entire learning process requires input datasets that can be split (or are already provided) into three types, outlined as follows:

A training set

is the knowledge base coming from historical or live data used to fit the parameters of the ML algorithm. During the training phase, the ML model utilizes the training set to find optimal weights of the network and reach the objective function by minimizing the training error. Here, the

back-prop rule

(or another more advanced optimizer with a proper updater; we'll see this later on) is used to train the model, but all the hyperparameters are need to be set before the learning process starts

.

A validation set

is a set of examples used to tune the parameters of an ML model. It ensures that the model is trained well and generalizes towards avoiding overfitting. Some ML practitioners refer to it as a

development set

or

dev set

as well.

A test set

is used for evaluating the performance of the trained model on unseen data. This step is also referred to as

model inferencing

. After assessing the final model on the test set (that is, when we're fully satisfied with the model's performance), we do not have to tune the model any further but the trained model can be deployed in a production-ready environment.

A common practice is splitting the input data (after necessary pre-processing and feature engineering) into 60% for training, 10% for validation, and 20% for testing, but it really depends on use cases. Also, sometimes we need to perform up-sampling or down-sampling on the data based on the availability and quality of the datasets.

Moreover, the learning theory uses mathematical tools that derive from probability theory and information theory. Three learning paradigms will be briefly discussed:

Supervised learning

Unsupervised learning

Reinforcement learning

The following diagram summarizes the three types of learning, along with the problems they address:

Types of learning and related problems

Supervised learning

Supervised learning is the simplest and most well-known automatic learning task. It is based on a number of pre-defined examples, in which the category to which each of the inputs should belong is already known. Figure 2 shows a typical workflow of supervised learning.

An actor (for example, an ML practitioner, data scientist, data engineer, ML engineer, and so on) performs Extraction Transformation Load (ETL) and the necessary feature engineering (including feature extraction, selection, and so on) to get the appropriate data having features and labels. Then he does the following:

Splits the data into training, development, and test sets

Uses the training set to train an

ML

model

The validation set is used to validate the training against the overfitting problem and regularization

He then evaluates the model's performance on the test set (that is unseen data)

If the performance is not satisfactory, he can perform additional tuning to get the best model based on hyperparameter optimization

Finally, he deploys the best model in a production-ready environment

Supervised learning in action

In the overall life cycle, there might be many actors involved (for example, a data engineer, data scientist, or ML engineer) to perform each step independently or collaboratively.

The supervised learning context includes classification and regression tasks; classification is used to predict which class a data point is part of (discrete value), while regression is used to predict continuous values. In other words, a classification task is used to predict the label of the class attribute, while a regression task is used to make a numeric prediction of the class attribute.

In the context of supervised learning, unbalanced data refers to classification problems where we have unequal instances for different classes. For example, if we have a classification task for only two classes, balanced data would mean 50% pre-classified examples for each of the classes.

If the input dataset is a little unbalanced (for example, 60% data points for one class and 40% for the other class), the learning process will require for the input dataset to be split randomly into three sets, with 50% for the training set, 20% for the validation set, and the remaining 30% for the testing set.

Unsupervised learning

In unsupervised learning, an input set is supplied to the system during the training phase. In contrast with supervised learning, the input objects are not labeled with their class. For classification, we assumed that we are given a training dataset of correctly labeled data. Unfortunately, we do not always have that advantage when we collect data in the real world.

For example, let's say you have a large collection of totally legal, not pirated, MP3 files in a crowded and massive folder on your hard drive. In such a case, how could we possibly group songs together if we do not have direct access to their metadata? One possible approach could be to mix various ML techniques, but clustering is often the best solution.

Now, what if you can build a clustering predictive model that helps automatically group together similar songs and organize them into your favorite categories, such as country, rap, rock, and so on? In short, unsupervised learning algorithms are commonly used in clustering problems. The following diagram gives us an idea of a clustering technique applied to solve this kind of problem:

Clustering techniques – an example of unsupervised learning

Although the data points are not labeled, we can still do the necessary feature engineering and grouping of a set of objects in such a way that objects in the same group (called a cluster) are brought together. This is not easy for a human. Rather, a standard approach is to define a similarity measure between two objects and then look for any cluster of objects that are more similar to each other than they are to the objects in the other clusters. Once we've done the clustering of the data points (that is, MP3 files) and the validation is completed, we know the pattern of the data (that is, what type of MP3 files fall in which group).

Reinforcement learning

Reinforcement learning is an artificial intelligence approach that focuses on the learning of the system through its interactions with the environment. In reinforcement learning, the system's parameters are adapted based on the feedback obtained from the environment, which in turn provides feedback on the decisions made by the system. The following diagram shows a person making decisions in order to arrive at their destination.

Let's take an example of the route you take from home to work. In this case, you take the same route to work every day. However, out of the blue, one day you get curious and decide to try a different route with a view to finding the shortest path. This dilemma of trying out new routes or sticking to the best-known route is an example of exploration versus exploitation:

An agent always tries to reach the destination

We can take a look at one more example in terms of a system modeling a chess player. In order to improve its performance, the system utilizes the result of its previous moves; such a system is said to be a system learning with reinforcement.

Putting ML tasks altogether

We have seen the basic working principles of ML algorithms. Then we have seen what the basic ML tasks are and how they formulate domain-specific problems. Now let's take a look at how can we summarize ML tasks and some applications in the following diagram:

ML tasks and some use cases from different application domains

However, the preceding figure lists only a few use cases and applications using different ML tasks. In practice, ML is used in numerous use cases and applications. We will try to cover a few of those throughout this book.

Delving into deep learning

Simple ML methods that were used in normal-size data analysis are not effective anymore and should be substituted by more robust ML methods. Although classical ML techniques allow researchers to identify groups or clusters of related variables, the accuracy and effectiveness of these methods diminish with large and high-dimensional datasets.

Here comes deep learning, which is one of the most important developments in artificial intelligence in the last few years. Deep learning is a branch of ML based on a set of algorithms that attempt to model high-level abstractions in data.

How did DL take ML into next level?

In short, deep learning algorithms are mostly a set of ANNs that can make better representations of large-scale datasets, in order to build models that learn these representations very extensively. Nowadays it's not limited to ANNs, but there have been really many theoretical advances and software and hardware improvements that were necessary for us to get to this day. In this regard, Ian Goodfellow et al. (Deep Learning, MIT Press, 2016) defined deep learning as follows:

"Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts, and more abstract representations computed in terms of less abstract ones."

Let's take an example; suppose we want to develop a predictive analytics model, such as an animal recognizer, where our system has to resolve two problems:

To classify whether an image represents a cat or a dog

To cluster images of dogs and cats.

If we solve the first problem using a typical ML method, we must define the facial features (ears, eyes, whiskers, and so on) and write a method to identify which features (typically nonlinear) are more important when classifying a particular animal.

However, at the same time, we cannot address the second problem because classical ML algorithms for clustering images (such as k-means) cannot handle nonlinear features. Deep learning algorithms will take these two problems one step further and the most important features will be extracted automatically after determining which features are the most important for classification or clustering.

In contrast, when using a classical ML algorithm, we would have to provide the features manually. In summary, the deep learning workflow would be as follows:

A deep learning algorithm would first identify the edges that are most relevant when clustering cats or dogs. It would then try to find various combinations of shapes and edges hierarchically. This step is called ETL.

After several iterations, hierarchical identification of complex concepts and features is carried out. Then, based on the identified features, the DL algorithm automatically decides which of these features are most significant (statistically) to classify the animal. This step is feature extraction.

Finally, it takes out the label column and performs unsupervised training using

AutoEncoders

(

AEs

) to extract the latent features to be redistributed to k-means for clustering.

Then the clustering assignment hardening loss (CAH loss) and reconstruction loss are jointly optimized towards optimal clustering assignment. Deep Embedding Clustering (see more at

https://arxiv.org/pdf/1511.06335.pdf

) is an example of such an approach. We will discuss deep learning-based clustering approaches in

Chapter 11

,

Discussion, Current Trends, and Outlook

.

Up to this point, we have seen that deep learning systems are able to recognize what an image represents. A computer does not see an image as we see it because it only knows the position of each pixel and its color. Using deep learning techniques, the image is divided into various layers of analysis.

At a lower level, the software analyzes, for example, a grid of a few pixels with the task of detecting a type of color or various nuances. If it finds something, it informs the next level, which at this point checks whether or not that given color belongs to a larger form, such as a line. The process continues to the upper levels until you understand what is shown in the image. The following diagram shows what we have discussed in the case of an image classification system:

A deep learning system at work on a dog versus cat classification problem

More precisely, the preceding image classifier can be built layer by layer, as follows:

Layer 1

: The algorithm starts identifying the dark and light pixels from the raw images

Layer 2

: The algorithm then identifies edges and shapes

Layer 3

: It then learns more complex shapes and objects

Layer 4

: The algorithm then learns which objects define a human face

Although this is a very simple classifier, software capable of doing these types of things is now widespread and is found in systems for recognizing faces, or in those for searching by an image on Google, for example. These pieces of software are based on deep learning algorithms.

On the contrary, by using a linear ML algorithm, we cannot build such applications since these algorithms are incapable of handling nonlinear image features. Also, using ML approaches, we typically handle a few hyperparameters only. However, when neural networks are brought to the party, things become too complex. In each layer, there are millions or even billions of hyperparameters to tune, so much that the cost function becomes non-convex.

Another reason is that activation functions used in hidden layers are nonlinear, so the cost is non-convex. We will discuss this phenomenon in more detail in later chapters but let's take a quick look at ANNs.

Artificial Neural Networks

ANNs work on the concept of deep learning. They represent the human nervous system in how the nervous system consists of a number of neurons that communicate with each other using axons.

Biological neurons

The working principles of ANNs are inspired by how a human brain works, depicted in Figure 7. The receptors receive the stimuli either internally or from the external world; then they pass the information into the biological neurons for further processing. There are a number of dendrites, in addition to another long extension called the axon.

Towards its extremity, there are minuscule structures called synaptic terminals, used to connect one neuron to the dendrites of other neurons. Biological neurons receive short electrical impulses called signals from other neurons, and in response, they trigger their own signals:

Working principle of biological neurons

We can thus summarize that the neuron comprises a cell body (also known as the soma), one or more dendrites for receiving signals from other neurons, and an axon for carrying out the signals generated by the neurons.

A neuron is in an active state when it is sending signals to other neurons. However, when it is receiving signals from other neurons, it is in an inactive state. In an idle state, a neuron accumulates all the signals received before reaching a certain activation threshold. This whole thing motivated researchers to introduce an ANN.

A brief history of ANNs

Inspired by the working principles of biological neurons, Warren McCulloch and Walter Pitts proposed the first artificial neuron model in 1943 in terms of a computational model of nervous activity. This simple model of a biological neuron, also known as an artificial neuron (AN), has one or more binary (on/off) inputs and one output only.

An AN simply activates its output when more than a certain number of its inputs are active. For example, here we see a few ANNs that perform various logical operations. In this example, we assume that a neuron is activated only when at least two of its inputs are active:

ANNs performing simple logical computations

The example sounds too trivial, but even with such a simplified model, it is possible to build a network of ANs. Nevertheless, these networks can be combined to compute complex logical expressions too. This simplified model inspired John von Neumann, Marvin Minsky, Frank Rosenblatt, and many others to come up with another model called a perceptron back in 1957.

The perceptron is one of the simplest ANN architectures we've seen in the last 60 years. It is based on a slightly different AN called a Linear Threshold Unit (LTU). The only difference is that the inputs and outputs are now numbers instead of binary on/off values. Each input connection is associated with a weight. The LTU computes a weighted sum of its inputs, then applies a step function (which resembles the action of an activation function) to that sum, and outputs the result:

The left-side figure represents an LTU and the right-side figure shows a perceptron

One of the downsides of a perceptron is that its decision boundary is linear. Therefore, they are incapable of learning complex patterns. They are also incapable of solving some simple problems like Exclusive OR (XOR). However, later on, the limitations of perceptrons were somewhat eliminated by stacking multiple perceptrons, called MLP.

How does an ANN learn?

Based on the concept of biological neurons, the term and the idea of ANs arose. Similarly to biological neurons, the artificial neuron consists of the following:

One or more incoming connections that aggregate signals from neurons

One or more output connections for carrying the signal to the other neurons

An

activation function

, which determines the numerical value of the output signal

The learning process of a neural network is configured as an iterative process of optimization of the weights (see more in the next section). The weights are updated in each epoch. Once the training starts, the aim is to generate predictions by minimizing the loss function. The performance of the network is then evaluated on the test set.

Now we know the simple concept of an artificial neuron. However, generating only some artificial signals is not enough to learn a complex task. Albeit, a commonly used supervised learning algorithm is the backpropagation algorithm, which is very commonly used to train a complex ANN.

ANNs and the backpropagation algorithm

The backpropagation algorithm aims to minimize the error between the current and the desired output. Since the network is feedforward, the activation flow always proceeds forward from the input units to the output units.

The gradient of the cost function is backpropagated and the network weights get updated; the overall method can be applied to any number of hidden layers recursively. In such a method, the incorporation between two phases is important. In short, the basic steps of the training procedure are as follows:

Initialize the network with some random (or more advanced XAVIER) weights

For all training cases, follow the steps of forward and backward passes as outlined next

Forward and backward passes

In the forward pass, a number of operations are performed to obtain some predictions or scores. In such an operation, a graph is created, connecting all dependent operations in a top-to-bottom fashion. Then the network's error is computed, which is the difference between the predicted output and the actual output.

On the other hand, the backward pass is involved mainly with mathematical operations, such as creating derivatives for all differential operations (that is auto-differentiation methods), top to bottom (for example, measuring the loss function to update the network weights), for all the operations in the graph, and then using them in chain rule.

In this pass, for all layers starting with the output layer back to the input layer, it shows the network layer's output with the correct input (error function). Then it adapts the weights in the current layer to minimize the error function. This is backpropagation's optimization step. By the way, there are two types of auto-differentiation methods:

Reverse mode

: Derivation of a single output with respect to all inputs

Forward mode

: Derivation of all outputs with respect to one input

The backpropagation algorithm processes the information in such a way that the network decreases the global error during the learning iterations; however, this does not guarantee that the global minimum is reached. The presence of hidden units and the nonlinearity of the output function mean that the behavior of the error is very complex and has many local minimas.

This backpropagation step is typically performed thousands or millions of times, using many training batches, until the model parameters converge to values that minimize the cost function. The training process ends when the error on the validation set begins to increase, because this could mark the beginning of a phase overfitting.

Weights and biases

Besides the state of a neuron, synaptic weight is considered, which influences the connection within the network. Each weight has a numerical value indicated by Wij, which is the synaptic weight connecting neuron i to neuron j.

Synaptic weight: This concept evolved from biology and refers to the strength or amplitude of a connection between two nodes, corresponding in biology to the amount of influence the firing of one neuron has on another.

For each neuron (also known as, unit) i, an input vector can be defined by xix1, x2,...xn) and a weight vector can be defined by wiwi1, wi2,...win). Now, depending on the position of a neuron, the weights and the output function determine the behavior of an individual neuron. Then during forward propagation, each unit in the hidden layer gets the following signal:

Nevertheless, among the weights, there is also a special type of weight called bias unit b. Technically, bias units aren't connected to any previous layer, so they don't have true activity. But still, the bias b value allows the neural network to shift the activation function to the left or right. Now, taking the bias unit into consideration, the modified network output can be formulated as follows:

The preceding equation signifies that each hidden unit gets the sum of inputs multiplied by the corresponding weight—summing junction. Then the resultant in the summing junction is passed through the activation function, which squashes the output as depicted in the following figure:

Artificial neuron model

Now, a tricky question: how do we initialize the weights? Well, if we initialize all weights to the same value (for example, 0 or 1), each hidden neuron will get exactly the same signal. Let's try to break it down:

If all weights are initialized to 1, then each unit gets a signal equal to the sum of the inputs

If all weights are 0, which is even worse, every neuron in a hidden layer will get zero signal

For network weight initialization, Xavier initialization is nowadays used widely. It is similar to random initialization but often turns out to work much better since it can automatically determine the scale of initialization based on the number of input and output neurons.

Interested readers should refer to this publication for detailed info: Xavier Glorot and Yoshua Bengio, Understanding the difficulty of training deep feedforward neural networks: proceedings of the 13th international conference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy; Volume 9 of JMLR: W&CP.

You may be wondering whether you can get rid of random initialization while training a regular DNN (for example, MLP or DBN). Well, recently, some researchers have been talking about random orthogonal matrix initializations that perform better than just any random initialization for training DNNs.

When it comes to initializing the biases, we can initialize them to be zero. But setting the biases to a small constant value such as 0.01 for all biases ensures that all Rectified Linear Unit (ReLU) units can propagate some gradient. However, it neither performs well nor shows consistent improvement. Therefore, sticking with zero is recommended.

Activation functions

To allow a neural network to learn complex decision boundaries, we apply a non-linear activation function to some of its layers. Commonly used functions include Tanh, ReLU, softmax, and variants of these. More technically, each neuron receives as input signal the weighted sum of the synaptic weights and the activation values of the neurons connected. One of the most widely used functions for this purpose is the so-called sigmoid function. It is a special case of the logistic function, which is defined by the following formula:

The domain of this function includes all real numbers, and the co-domain is (0, 1). This means that any value obtained as an output from a neuron (as per the calculation of its activation state), will always be between zero and one. The sigmoid function, as represented in the following diagram, provides an interpretation of the saturation rate of a neuron, from not being active () to complete saturation, which occurs at a predetermined maximum value ().

On the other hand, a hyperbolic tangent, or tanh, is another form of the activation function. Tanh squashes a real-valued number to the range [-1, 1]. In particular, mathematically, tanh activation function can be expressed as follows:

The preceding equation can be represented in the following figure:

Sigmoid versus tanh activation function

In general, in the last level of an feedforward neural network (FFNN), the softmax function is applied as the decision boundary. This is a common case, especially when solving a classification problem. In probability theory, the output of the softmax function is squashed as the probability distribution overKdifferent possible outcomes. Nevertheless, the softmax function is used in various multiclass classification methods, such that the network's output is distributed across classes (that is, probability distribution over the classes) having a dynamic range between -1 and 1 or 0 and 1.