39,59 €
Java is one of the most widely used programming languages. With the rise of deep learning, it has become a popular choice of tool among data scientists and machine learning experts.
Java Deep Learning Projects starts with an overview of deep learning concepts and then delves into advanced projects. You will see how to build several projects using different deep neural network architectures such as multilayer perceptrons, Deep Belief Networks, CNN, LSTM, and Factorization Machines.
You will get acquainted with popular deep and machine learning libraries for Java such as Deeplearning4j, Spark ML, and RankSys and you’ll be able to use their features to build and deploy projects on distributed computing environments.
You will then explore advanced domains such as transfer learning and deep reinforcement learning using the Java ecosystem, covering various real-world domains such as healthcare, NLP, image classification, and multimedia analytics with an easy-to-follow approach. Expert reviews and tips will follow every project to give you insights and hacks.
By the end of this book, you will have stepped up your expertise when it comes to deep learning in Java, taking it beyond theory and be able to build your own advanced deep learning systems.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 450
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Sunith ShettyAcquisition Editor: Tushar GuptaContent Development Editor: Karan ThakkarTechnical Editor: Dinesh PawarCopy Editor: Vikrant PhadkayProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer: Rekha NairGraphics: Tania DuttaProduction Coordinator: Arvindkumar Gupta
First published: June 2018
Production reference: 1280618
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78899-745-4
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Md. Rezaul Karim is a Research Scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he was a Researcher at Insight Centre for Data Analytics, Ireland. Before that, he was a Lead Engineer at Samsung Electronics, Korea.
He has 9 years of R&D experience in Java, Scala, Python, and R. He has hands-on experience in Spark, Zeppelin, Hadoop, Keras, scikit-learn, TensorFlow, Deeplearning4j, and H2O. He has published several research papers in top-ranked journals/conferences focusing on bioinformatics and deep learning.
Joao Bosco Jares is a Software Engineer with 12 years of experience in machine learning, Semantic Web and IoT. Previously, he was a Software Engineer at IBM Watson, Insight Centre for Data Analytics, Brazilian Northeast Bank, and Bank of Amazonia, Brazil.
He has an MSc and a BSc in computer science, and a data science postgraduate degree. He is also an IBM Jazz RTC Certified Professional, Oracle Certified Master Java EE 6 Enterprise Architect, andSunJava Certified Programmer.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Java Deep Learning Projects
Packt Upsell
Why subscribe?
PacktPub.com
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Getting Started with Deep Learning
A soft introduction to ML
Working principles of ML algorithms
Supervised learning
Unsupervised learning
Reinforcement learning
Putting ML tasks altogether
Delving into deep learning
How did DL take ML into next level?
Artificial Neural Networks
Biological neurons
A brief history of ANNs
How does an ANN learn?
ANNs and the backpropagation algorithm
Forward and backward passes
Weights and biases
Weight optimization
Activation functions
Neural network architectures
Deep neural networks
Multilayer Perceptron
Deep belief networks
Autoencoders
Convolutional neural networks
Recurrent neural networks
Emergent architectures
Residual neural networks
Generative adversarial networks
Capsule networks
DL frameworks and cloud platforms
Deep learning frameworks
Cloud-based platforms for DL
Deep learning from a disaster – Titanic survival prediction
Problem description
Configuring the programming environment
Feature engineering and input dataset preparation
Training MLP classifier
Evaluating the MLP classifier
Frequently asked questions (FAQs)
Summary
Answers to FAQs
Cancer Types Prediction Using Recurrent Type Networks
Deep learning in cancer genomics
Cancer genomics dataset description
Preparing programming environment
Titanic survival revisited with DL4J
Multilayer perceptron network construction
Hidden layer 1
Hidden layer 2
Output layer
Network training
Evaluating the model
Cancer type prediction using an LSTM network
Dataset preparation for training
Recurrent and LSTM networks
Dataset preparation
LSTM network construction
Network training
Evaluating the model
Frequently asked questions (FAQs)
Summary
Answers to questions
Multi-Label Image Classification Using Convolutional Neural Networks
Image classification and drawbacks of DNNs
CNN architecture
Convolutional operations
Pooling and padding operations
Fully connected layer (dense layer)
Multi-label image classification using CNNs
Problem description
Description of the dataset
Removing invalid images
Workflow of the overall project
Image preprocessing
Extracting image metadata
Image feature extraction
Preparing the ND4J dataset
Training, evaluating, and saving the trained CNN models
Network construction
Scoring the model
Submission file generation
Wrapping everything up by executing the main() method
Frequently asked questions (FAQs)
Summary
Answers to questions
Sentiment Analysis Using Word2Vec and LSTM Network
Sentiment analysis is a challenging task
Using Word2Vec for neural word embeddings
Datasets and pre-trained model description
Large Movie Review dataset for training and testing
Folder structure of the dataset
Description of the sentiment labeled dataset
Word2Vec pre-trained model
Sentiment analysis using Word2Vec and LSTM
Preparing the train and test set using the Word2Vec model
Network construction, training, and saving the model
Restoring the trained model and evaluating it on the test set
Making predictions on sample review texts
Frequently asked questions (FAQs)
Summary
Answers to questions
Transfer Learning for Image Classification
Image classification with pretrained VGG16
DL4J and transfer learning
Developing an image classifier using transfer learning
Dataset collection and description
Architecture choice and adoption
Train and test set preparation
Network training and evaluation
Restoring the trained model and inferencing
Making simple inferencing
Frequently asked questions (FAQs)
Summary
Answers to questions
Real-Time Object Detection using YOLO, JavaCV, and DL4J
Object detection from images and videos
Object classification, localization, and detection
Convolutional Sliding Window (CSW)
Object detection from videos
You Only Look Once (YOLO)
Developing a real-time object detection project
Step 1 – Loading a pre-trained YOLO model
Step 2 – Generating frames from video clips
Step 3 – Feeding generated frames into Tiny YOLO model
Step 4 – Object detection from image frames
Step 5 – Non-max suppression in case of more than one bounding box
Step 6 – wrapping up everything and running the application
Frequently asked questions (FAQs)
Summary
Answers to questions
Stock Price Prediction Using LSTM Network
State-of-the-art automated stock trading
Developing a stock price predictive model
Data collection and exploratory analysis
Preparing the training and test sets
LSTM network construction
Network training, and saving the trained model
Restoring the saved model for inferencing
Evaluating the model
Frequently asked questions (FAQs)
Summary
Answers to questions
Distributed Deep Learning – Video Classification Using Convolutional LSTM Networks
Distributed deep learning across multiple GPUs
Distributed training on GPUs with DL4J
Video classification using convolutional – LSTM
UCF101 – action recognition dataset
Preprocessing and feature engineering
Solving the encoding problem
Data processing workflow
Simple UI for checking video frames
Preparing training and test sets
Network creation and training
Performance evaluation
Distributed training on AWS deep learning AMI 9.0
Frequently asked questions (FAQs)
Summary
Answers to questions
Playing GridWorld Game Using Deep Reinforcement Learning
Notation, policy, and utility for RL
Notations in reinforcement learning
Policy
Utility
Neural Q-learning
Introduction to QLearning
Neural networks as a Q-function
Developing a GridWorld game using a deep Q-network
Generating the grid
Calculating agent and goal positions
Calculating the action mask
Providing guidance action
Calculating the reward
Flattening input for the input layer
Network construction and training
Playing the GridWorld game
Frequently asked questions (FAQs)
Summary
Answers to questions
Developing Movie Recommendation Systems Using Factorization Machines
Recommendation systems
Recommendation approaches
Collaborative filtering approaches
Content-based filtering approaches
Hybrid recommender systems
Model-based collaborative filtering
The utility matrix
The cold-start problem in collaborative-filtering approaches
Factorization machines in recommender systems
Developing a movie recommender system using FMs
Dataset description and exploratory analysis
Movie rating prediction
Converting the dataset into LibFM format
Training and test set preparation
Movie rating prediction
Which one makes more sense ;– ranking or rating?
Frequently asked questions (FAQs)
Summary
Answers to questions
Discussion, Current Trends, and Outlook
Discussion and outlook
Discussion on the completed projects
Titanic survival prediction using MLP and LSTM networks
Cancer type prediction using recurrent type networks
Image classification using convolutional neural networks
Sentiment analysis using Word2Vec and the LSTM network
Image classification using transfer learning
Real-time object detection using YOLO, JavaCV, and DL4J
Stock price prediction using LSTM network
Distributed deep learning – video classification using a convolutional-LSTM network
Using deep reinforcement learning for GridWorld
Movie recommender system using factorization machines
Current trends and outlook
Current trends
Outlook on emergent DL architectures
Residual neural networks
GANs
Capsule networks (CapsNet)
Semantic image segmentation
Deep learning for clustering analysis
Frequently asked questions (FAQs)
Answers to questions
Other Books You May Enjoy
Leave a review - let other readers know what you think
The continued growth in data, coupled with the need to make increasingly complex decisions against that data, is creating massive hurdles that prevent organizations from deriving insights in a timely manner using traditional analytical approaches.
To find meaningful values and insights, deep learning evolved, which is a branch of machine learning algorithms based on learning multiple levels of abstraction. Neural networks, being at the core of deep learning, are used in predictive analytics, computer vision, natural language processing, time series forecasting, and performing a myriad of other complex tasks.
Until date, most DL books available are written in Python. However, this book is conceived for developers, data scientists, machine learning practitioners, and deep learning enthusiasts who want to build powerful, robust, and accurate predictive models with the power of Deeplearning4j (a JVM-based DL framework), combining other open source Java APIs.
Throughout the book, you will learn how to develop practical applications for AI systems using feedforward neural networks, convolutional neural networks, recurrent neural networks, autoencoders, and factorization machines. Additionally, you will learn how to attain your deep learning programming on GPU in a distributed way.
After finishing the book, you will be familiar with machine learning techniques, in particular, the use of Java for deep learning, and will be ready to apply your knowledge in research or commercial projects.In summary, this book is not meant to be read cover to cover. You can jump to a chapter that looks like something you are trying to accomplish or one that simply ignites your interest.
Happy reading!
Developers, data scientists, machine learning practitioners, and deep learning enthusiasts who wish to learn how to develop real-life deep learning projects by harnessing the power of JVM-based Deeplearning4j (DL4J), Spark, RankSys, and other open source libraries will find this book extremely useful. A sound understanding of Java is needed. Nevertheless, some basic prior experience of Spark, DL4J, and Maven-based project management will be useful to pick up the concepts quicker.
Chapter 1, Getting Started with Deep Learning, explains some basic concepts of machine learning and artificial neural networks as the core of deep learning. It then briefly discusses existing and emerging neural network architectures. Next, it covers various features of deep learning frameworks and libraries. Then it shows how to solve Titanic survival prediction using a Spark-based Multilayer Perceptron (MLP). Finally, it discusses some frequent questions related to this projects and general DL area.
Chapter 2, Cancer Types Prediction Using Recurrent Type Networks, demonstrates how to develop a DL application for cancer type classification from a very-high-dimensional gene expression dataset. First, it performs necessary feature engineering such that the dataset can feed into a Long Short-Term Memory (LSTM) network. Finally, it discusses some frequent questions related to this project and DL4J hyperparameters/nets tuning.
Chapter 3, Multi-LabelImage Classification Using Convolutional Neural Networks, demonstrates how to develop an end-to-end project for handling the multi-label image classification problem using CNN on top of the DL4J framework on real Yelp image datasets. It discusses how to tune hyperparameters for better classification results.
Chapter 4, Sentiment Analysis Using Word2Vec and the LSTM Network, shows how to develop a hands-on deep learning project that classifies review texts as either positive or negative sentiments. A large-scale movie review dataset will be used to train the LSTM model, and Word2Vec will be used as the neural embedding. Finally, it shows sample predictions for other review datasets.
Chapter 5, Transfer Learning for Image Classification, demonstrates how to develop an end-to-end project to solve dog versus cat image classification using a pre-trained VGG-16 model. We wrap up everything in a Java JFrame and JPanel application to make the overall pipeline understandable for making sample object detection.
Chapter 6, Real-Time Object Detection Using YOLO, JavaCV, and DL4J, shows how to develop an end-to-end project that will detect objects from video frames when the video clips play continuously. The pre-trained YOLO v2 model will be used as transfer learning and JavaCV API for video frame handling on top of DL4J.
Chapter 7, Stock Price Prediction Using the LSTM Network, demonstrates how to develop a real-life plain stock open, close, low, high, or volume price prediction using LSTM on top of the DL4J framework. Time series generated from a real-life stock dataset will be used to train the LSTM model, which will be used to predict the price only 1 day ahead at a time step.
Chapter 8, Distributed Deep Learning on Cloud – Video Classification Using Convolutional LSTM Network, shows how to develop an end-to-end project that accurately classifies a large collection of video clips (for example, UCF101) using a combined CNN and LSTM network on top of DL4J. The training is carried out on Amazon EC2 GPU compute cluster. Eventually, this end-to-end project can be treated as a primer for human activity recognition from video or so.
Chapter 9, Playing GridWorld GameUsing Deep Reinforcement Learning, is all about designing a machine learning system driven by criticisms and rewards. It then shows how to develop a GridWorld game using DL4J, RL4J, and neural QLearning that acts as the Q function.
Chapter 10, Developing Movie Recommendation Systems Using Factorization Machines, is about developing a sample project using factorization machines to predict both the rating and ranking of movies. It then discusses some theoretical background of recommendation systems using matrix factorization and collaborative filtering, before diving the project implementation using RankSys-library-based FMs.
Chapter 11, Discussion, Current Trends, and Outlook, wraps up everything by discussing the completed projects and some abstract takeaways. Then it provides some improvement suggestions. Additionally, it covers some extension guidelines for other real-life deep learning projects.
All the examples have been implemented using Deeplearning4j with some open source libraries in Java. To be more specific, the following API/tools are required:
Java/JDK version 1.8
Spark version 2.3.0
Spark csv_2.11 version 1.3.0
ND4j backend version nd4j-cuda-9.0-platform for GPU, otherwise nd4j-native
ND4j version >=1.0.0-alpha
DL4j version >=1.0.0-alpha
Datavec version >=1.0.0-alpha
Arbiter version >=1.0.0-alpha
Logback version 1.2.3
JavaCV platform version 1.4.1
HTTP Client version 4.3.5
Jfreechart 1.0.13
Jcodec 0.2.3
Eclipse Mars or Luna (latest) or Intellij IDEA
Maven Eclipse plugin (2.9 or higher)
Maven compiler plugin for Eclipse (2.3.2 or higher)
Maven assembly plugin for Eclipse (2.4.1 or higher)
Regarding operating system: Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, CentOS). To be more specific, for example, for Ubuntu it is recommended to have a 14.04 (LTS) 64-bit (or later) complete installation or VMWare player 12 or Virtual box. You can run Spark jobs on Windows (XP/7/8/10) or Mac OS X (10.4.7+).
Regarding hardware configuration: A machine or server having core i5 processor, about 100 GB disk space, and at least 16 GB RAM. In addition, an Nvidia GPU driver has to be installed with CUDA and CuDNN configured if you want to perform the training on GPU. Enough storage for running heavy jobs is needed (depending on the dataset size you will be handling), preferably at least 50 GB of free disk storage (for standalone and for SQL warehouse).
You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packtpub.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Java-Deep-Learning-Projects. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/JavaDeepLearningProjects_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Then, I unzipped and copied each .csv file into a folder called label."
A block of code is set as follows:
<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <java.version>1.8</java.version></properties>
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <java.version>1.8</java.version>
</properties>
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "We then read and process images into PhotoID | Vector map"
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
In this chapter, we will explain some basic concepts of Machine Learning (ML) and Deep Learning (DL) that will be used in all subsequent chapters. We will start with a brief introduction to ML. Then we will move on to DL, which is one of the emerging branches of ML.
We will briefly discuss some of the most well-known and widely used neural network architectures. Next, we will look at various features of deep learning frameworks and libraries. Then we will see how to prepare a programming environment, before moving on to coding with some open source, deep learning libraries such as DeepLearning4J (DL4J).
Then we will solve a very famous ML problem: the Titanic survival prediction. For this, we will use an Apache Spark-based Multilayer Perceptron (MLP) classifier to solve this problem. Finally, we'll see some frequently asked questions that will help us generalize our basic understanding of DL. Briefly, the following topics will be covered:
A soft introduction to
ML
Artificial Neural Networks (ANNs)
Deep neural network architectures
Deep learning frameworks
Deep learning from disasters—Titanic survival prediction using MLP
Frequently asked questions (FAQ)
ML approaches are based on a set of statistical and mathematical algorithms in order to carry out tasks such as classification, regression analysis, concept learning, predictive modeling, clustering, and mining of useful patterns. Thus, with the use of ML, we aim at improving the learning experience such that it becomes automatic. Consequently, we may not need complete human interactions, or at least we can reduce the level of such interactions as much as possible.
We now refer to a famous definition of ML by Tom M. Mitchell (Machine Learning, Tom Mitchell, McGraw Hill), where he explained what learning really means from a computer science perspective:
Based on this definition, we can conclude that a computer program or machine can do the following:
Learn from data and histories
Improve with experience
Iteratively enhance a model that can be used to predict outcomes of questions
Since they are at the core of predictive analytics, almost every ML algorithm we use can be treated as an optimization problem. This is about finding parameters that minimize an objective function, for example, a weighted sum of two terms like a cost function and regularization. Typically, an objective function has two components:
A regularizer, which controls the complexity of the model
The loss, which measures the error of the model on the training data.
On the other hand, the regularization parameter defines the trade-off between minimizing the training error and the model's complexity in an effort to avoid overfitting problems. Now, if both of these components are convex, then their sum is also convex; it is non-convex otherwise. More elaborately, when using an ML algorithm, the goal is to obtain the best hyperparameters of a function that return the minimum error when making predictions. Therefore, using a convex optimization technique, we can minimize the function until it converges towards the minimum error.
Given that a problem is convex, it is usually easier to analyze the asymptotic behavior of the algorithm, which shows how fast it converges as the model observes more and more training data. The challenge of ML is to allow training a model so that it can recognize complex patterns and make decisions not only in an automated way but also as intelligently as possible. The entire learning process requires input datasets that can be split (or are already provided) into three types, outlined as follows:
A training set
is the knowledge base coming from historical or live data used to fit the parameters of the ML algorithm. During the training phase, the ML model utilizes the training set to find optimal weights of the network and reach the objective function by minimizing the training error. Here, the
back-prop rule
(or another more advanced optimizer with a proper updater; we'll see this later on) is used to train the model, but all the hyperparameters are need to be set before the learning process starts
.
A validation set
is a set of examples used to tune the parameters of an ML model. It ensures that the model is trained well and generalizes towards avoiding overfitting. Some ML practitioners refer to it as a
development set
or
dev set
as well.
A test set
is used for evaluating the performance of the trained model on unseen data. This step is also referred to as
model inferencing
. After assessing the final model on the test set (that is, when we're fully satisfied with the model's performance), we do not have to tune the model any further but the trained model can be deployed in a production-ready environment.
A common practice is splitting the input data (after necessary pre-processing and feature engineering) into 60% for training, 10% for validation, and 20% for testing, but it really depends on use cases. Also, sometimes we need to perform up-sampling or down-sampling on the data based on the availability and quality of the datasets.
Moreover, the learning theory uses mathematical tools that derive from probability theory and information theory. Three learning paradigms will be briefly discussed:
Supervised learning
Unsupervised learning
Reinforcement learning
The following diagram summarizes the three types of learning, along with the problems they address:
Supervised learning is the simplest and most well-known automatic learning task. It is based on a number of pre-defined examples, in which the category to which each of the inputs should belong is already known. Figure 2 shows a typical workflow of supervised learning.
An actor (for example, an ML practitioner, data scientist, data engineer, ML engineer, and so on) performs Extraction Transformation Load (ETL) and the necessary feature engineering (including feature extraction, selection, and so on) to get the appropriate data having features and labels. Then he does the following:
Splits the data into training, development, and test sets
Uses the training set to train an
ML
model
The validation set is used to validate the training against the overfitting problem and regularization
He then evaluates the model's performance on the test set (that is unseen data)
If the performance is not satisfactory, he can perform additional tuning to get the best model based on hyperparameter optimization
Finally, he deploys the best model in a production-ready environment
In the overall life cycle, there might be many actors involved (for example, a data engineer, data scientist, or ML engineer) to perform each step independently or collaboratively.
The supervised learning context includes classification and regression tasks; classification is used to predict which class a data point is part of (discrete value), while regression is used to predict continuous values. In other words, a classification task is used to predict the label of the class attribute, while a regression task is used to make a numeric prediction of the class attribute.
In the context of supervised learning, unbalanced data refers to classification problems where we have unequal instances for different classes. For example, if we have a classification task for only two classes, balanced data would mean 50% pre-classified examples for each of the classes.
If the input dataset is a little unbalanced (for example, 60% data points for one class and 40% for the other class), the learning process will require for the input dataset to be split randomly into three sets, with 50% for the training set, 20% for the validation set, and the remaining 30% for the testing set.
In unsupervised learning, an input set is supplied to the system during the training phase. In contrast with supervised learning, the input objects are not labeled with their class. For classification, we assumed that we are given a training dataset of correctly labeled data. Unfortunately, we do not always have that advantage when we collect data in the real world.
For example, let's say you have a large collection of totally legal, not pirated, MP3 files in a crowded and massive folder on your hard drive. In such a case, how could we possibly group songs together if we do not have direct access to their metadata? One possible approach could be to mix various ML techniques, but clustering is often the best solution.
Now, what if you can build a clustering predictive model that helps automatically group together similar songs and organize them into your favorite categories, such as country, rap, rock, and so on? In short, unsupervised learning algorithms are commonly used in clustering problems. The following diagram gives us an idea of a clustering technique applied to solve this kind of problem:
Although the data points are not labeled, we can still do the necessary feature engineering and grouping of a set of objects in such a way that objects in the same group (called a cluster) are brought together. This is not easy for a human. Rather, a standard approach is to define a similarity measure between two objects and then look for any cluster of objects that are more similar to each other than they are to the objects in the other clusters. Once we've done the clustering of the data points (that is, MP3 files) and the validation is completed, we know the pattern of the data (that is, what type of MP3 files fall in which group).
Reinforcement learning is an artificial intelligence approach that focuses on the learning of the system through its interactions with the environment. In reinforcement learning, the system's parameters are adapted based on the feedback obtained from the environment, which in turn provides feedback on the decisions made by the system. The following diagram shows a person making decisions in order to arrive at their destination.
Let's take an example of the route you take from home to work. In this case, you take the same route to work every day. However, out of the blue, one day you get curious and decide to try a different route with a view to finding the shortest path. This dilemma of trying out new routes or sticking to the best-known route is an example of exploration versus exploitation:
We can take a look at one more example in terms of a system modeling a chess player. In order to improve its performance, the system utilizes the result of its previous moves; such a system is said to be a system learning with reinforcement.
We have seen the basic working principles of ML algorithms. Then we have seen what the basic ML tasks are and how they formulate domain-specific problems. Now let's take a look at how can we summarize ML tasks and some applications in the following diagram:
However, the preceding figure lists only a few use cases and applications using different ML tasks. In practice, ML is used in numerous use cases and applications. We will try to cover a few of those throughout this book.
Simple ML methods that were used in normal-size data analysis are not effective anymore and should be substituted by more robust ML methods. Although classical ML techniques allow researchers to identify groups or clusters of related variables, the accuracy and effectiveness of these methods diminish with large and high-dimensional datasets.
Here comes deep learning, which is one of the most important developments in artificial intelligence in the last few years. Deep learning is a branch of ML based on a set of algorithms that attempt to model high-level abstractions in data.
In short, deep learning algorithms are mostly a set of ANNs that can make better representations of large-scale datasets, in order to build models that learn these representations very extensively. Nowadays it's not limited to ANNs, but there have been really many theoretical advances and software and hardware improvements that were necessary for us to get to this day. In this regard, Ian Goodfellow et al. (Deep Learning, MIT Press, 2016) defined deep learning as follows:
Let's take an example; suppose we want to develop a predictive analytics model, such as an animal recognizer, where our system has to resolve two problems:
To classify whether an image represents a cat or a dog
To cluster images of dogs and cats.
If we solve the first problem using a typical ML method, we must define the facial features (ears, eyes, whiskers, and so on) and write a method to identify which features (typically nonlinear) are more important when classifying a particular animal.
However, at the same time, we cannot address the second problem because classical ML algorithms for clustering images (such as k-means) cannot handle nonlinear features. Deep learning algorithms will take these two problems one step further and the most important features will be extracted automatically after determining which features are the most important for classification or clustering.
In contrast, when using a classical ML algorithm, we would have to provide the features manually. In summary, the deep learning workflow would be as follows:
A deep learning algorithm would first identify the edges that are most relevant when clustering cats or dogs. It would then try to find various combinations of shapes and edges hierarchically. This step is called ETL.
After several iterations, hierarchical identification of complex concepts and features is carried out. Then, based on the identified features, the DL algorithm automatically decides which of these features are most significant (statistically) to classify the animal. This step is feature extraction.
Finally, it takes out the label column and performs unsupervised training using
AutoEncoders
(
AEs
) to extract the latent features to be redistributed to k-means for clustering.
Then the clustering assignment hardening loss (CAH loss) and reconstruction loss are jointly optimized towards optimal clustering assignment. Deep Embedding Clustering (see more at
https://arxiv.org/pdf/1511.06335.pdf
) is an example of such an approach. We will discuss deep learning-based clustering approaches in
Chapter 11
,
Discussion, Current Trends, and Outlook
.
Up to this point, we have seen that deep learning systems are able to recognize what an image represents. A computer does not see an image as we see it because it only knows the position of each pixel and its color. Using deep learning techniques, the image is divided into various layers of analysis.
At a lower level, the software analyzes, for example, a grid of a few pixels with the task of detecting a type of color or various nuances. If it finds something, it informs the next level, which at this point checks whether or not that given color belongs to a larger form, such as a line. The process continues to the upper levels until you understand what is shown in the image. The following diagram shows what we have discussed in the case of an image classification system:
More precisely, the preceding image classifier can be built layer by layer, as follows:
Layer 1
: The algorithm starts identifying the dark and light pixels from the raw images
Layer 2
: The algorithm then identifies edges and shapes
Layer 3
: It then learns more complex shapes and objects
Layer 4
: The algorithm then learns which objects define a human face
Although this is a very simple classifier, software capable of doing these types of things is now widespread and is found in systems for recognizing faces, or in those for searching by an image on Google, for example. These pieces of software are based on deep learning algorithms.
On the contrary, by using a linear ML algorithm, we cannot build such applications since these algorithms are incapable of handling nonlinear image features. Also, using ML approaches, we typically handle a few hyperparameters only. However, when neural networks are brought to the party, things become too complex. In each layer, there are millions or even billions of hyperparameters to tune, so much that the cost function becomes non-convex.
Another reason is that activation functions used in hidden layers are nonlinear, so the cost is non-convex. We will discuss this phenomenon in more detail in later chapters but let's take a quick look at ANNs.
ANNs work on the concept of deep learning. They represent the human nervous system in how the nervous system consists of a number of neurons that communicate with each other using axons.
The working principles of ANNs are inspired by how a human brain works, depicted in Figure 7. The receptors receive the stimuli either internally or from the external world; then they pass the information into the biological neurons for further processing. There are a number of dendrites, in addition to another long extension called the axon.
Towards its extremity, there are minuscule structures called synaptic terminals, used to connect one neuron to the dendrites of other neurons. Biological neurons receive short electrical impulses called signals from other neurons, and in response, they trigger their own signals:
We can thus summarize that the neuron comprises a cell body (also known as the soma), one or more dendrites for receiving signals from other neurons, and an axon for carrying out the signals generated by the neurons.
A neuron is in an active state when it is sending signals to other neurons. However, when it is receiving signals from other neurons, it is in an inactive state. In an idle state, a neuron accumulates all the signals received before reaching a certain activation threshold. This whole thing motivated researchers to introduce an ANN.
Inspired by the working principles of biological neurons, Warren McCulloch and Walter Pitts proposed the first artificial neuron model in 1943 in terms of a computational model of nervous activity. This simple model of a biological neuron, also known as an artificial neuron (AN), has one or more binary (on/off) inputs and one output only.
An AN simply activates its output when more than a certain number of its inputs are active. For example, here we see a few ANNs that perform various logical operations. In this example, we assume that a neuron is activated only when at least two of its inputs are active:
The example sounds too trivial, but even with such a simplified model, it is possible to build a network of ANs. Nevertheless, these networks can be combined to compute complex logical expressions too. This simplified model inspired John von Neumann, Marvin Minsky, Frank Rosenblatt, and many others to come up with another model called a perceptron back in 1957.
The perceptron is one of the simplest ANN architectures we've seen in the last 60 years. It is based on a slightly different AN called a Linear Threshold Unit (LTU). The only difference is that the inputs and outputs are now numbers instead of binary on/off values. Each input connection is associated with a weight. The LTU computes a weighted sum of its inputs, then applies a step function (which resembles the action of an activation function) to that sum, and outputs the result:
One of the downsides of a perceptron is that its decision boundary is linear. Therefore, they are incapable of learning complex patterns. They are also incapable of solving some simple problems like Exclusive OR (XOR). However, later on, the limitations of perceptrons were somewhat eliminated by stacking multiple perceptrons, called MLP.
Based on the concept of biological neurons, the term and the idea of ANs arose. Similarly to biological neurons, the artificial neuron consists of the following:
One or more incoming connections that aggregate signals from neurons
One or more output connections for carrying the signal to the other neurons
An
activation function
, which determines the numerical value of the output signal
The learning process of a neural network is configured as an iterative process of optimization of the weights (see more in the next section). The weights are updated in each epoch. Once the training starts, the aim is to generate predictions by minimizing the loss function. The performance of the network is then evaluated on the test set.
Now we know the simple concept of an artificial neuron. However, generating only some artificial signals is not enough to learn a complex task. Albeit, a commonly used supervised learning algorithm is the backpropagation algorithm, which is very commonly used to train a complex ANN.
The backpropagation algorithm aims to minimize the error between the current and the desired output. Since the network is feedforward, the activation flow always proceeds forward from the input units to the output units.
The gradient of the cost function is backpropagated and the network weights get updated; the overall method can be applied to any number of hidden layers recursively. In such a method, the incorporation between two phases is important. In short, the basic steps of the training procedure are as follows:
Initialize the network with some random (or more advanced XAVIER) weights
For all training cases, follow the steps of forward and backward passes as outlined next
In the forward pass, a number of operations are performed to obtain some predictions or scores. In such an operation, a graph is created, connecting all dependent operations in a top-to-bottom fashion. Then the network's error is computed, which is the difference between the predicted output and the actual output.
On the other hand, the backward pass is involved mainly with mathematical operations, such as creating derivatives for all differential operations (that is auto-differentiation methods), top to bottom (for example, measuring the loss function to update the network weights), for all the operations in the graph, and then using them in chain rule.
In this pass, for all layers starting with the output layer back to the input layer, it shows the network layer's output with the correct input (error function). Then it adapts the weights in the current layer to minimize the error function. This is backpropagation's optimization step. By the way, there are two types of auto-differentiation methods:
Reverse mode
: Derivation of a single output with respect to all inputs
Forward mode
: Derivation of all outputs with respect to one input
The backpropagation algorithm processes the information in such a way that the network decreases the global error during the learning iterations; however, this does not guarantee that the global minimum is reached. The presence of hidden units and the nonlinearity of the output function mean that the behavior of the error is very complex and has many local minimas.
This backpropagation step is typically performed thousands or millions of times, using many training batches, until the model parameters converge to values that minimize the cost function. The training process ends when the error on the validation set begins to increase, because this could mark the beginning of a phase overfitting.
Besides the state of a neuron, synaptic weight is considered, which influences the connection within the network. Each weight has a numerical value indicated by Wij, which is the synaptic weight connecting neuron i to neuron j.
For each neuron (also known as, unit) i, an input vector can be defined by xix1, x2,...xn) and a weight vector can be defined by wiwi1, wi2,...win). Now, depending on the position of a neuron, the weights and the output function determine the behavior of an individual neuron. Then during forward propagation, each unit in the hidden layer gets the following signal:
Nevertheless, among the weights, there is also a special type of weight called bias unit b. Technically, bias units aren't connected to any previous layer, so they don't have true activity. But still, the bias b value allows the neural network to shift the activation function to the left or right. Now, taking the bias unit into consideration, the modified network output can be formulated as follows:
The preceding equation signifies that each hidden unit gets the sum of inputs multiplied by the corresponding weight—summing junction. Then the resultant in the summing junction is passed through the activation function, which squashes the output as depicted in the following figure:
Now, a tricky question: how do we initialize the weights? Well, if we initialize all weights to the same value (for example, 0 or 1), each hidden neuron will get exactly the same signal. Let's try to break it down:
If all weights are initialized to 1, then each unit gets a signal equal to the sum of the inputs
If all weights are 0, which is even worse, every neuron in a hidden layer will get zero signal
For network weight initialization, Xavier initialization is nowadays used widely. It is similar to random initialization but often turns out to work much better since it can automatically determine the scale of initialization based on the number of input and output neurons.
You may be wondering whether you can get rid of random initialization while training a regular DNN (for example, MLP or DBN). Well, recently, some researchers have been talking about random orthogonal matrix initializations that perform better than just any random initialization for training DNNs.
When it comes to initializing the biases, we can initialize them to be zero. But setting the biases to a small constant value such as 0.01 for all biases ensures that all Rectified Linear Unit (ReLU) units can propagate some gradient. However, it neither performs well nor shows consistent improvement. Therefore, sticking with zero is recommended.
To allow a neural network to learn complex decision boundaries, we apply a non-linear activation function to some of its layers. Commonly used functions include Tanh, ReLU, softmax, and variants of these. More technically, each neuron receives as input signal the weighted sum of the synaptic weights and the activation values of the neurons connected. One of the most widely used functions for this purpose is the so-called sigmoid function. It is a special case of the logistic function, which is defined by the following formula:
The domain of this function includes all real numbers, and the co-domain is (0, 1). This means that any value obtained as an output from a neuron (as per the calculation of its activation state), will always be between zero and one. The sigmoid function, as represented in the following diagram, provides an interpretation of the saturation rate of a neuron, from not being active () to complete saturation, which occurs at a predetermined maximum value ().
On the other hand, a hyperbolic tangent, or tanh, is another form of the activation function. Tanh squashes a real-valued number to the range [-1, 1]. In particular, mathematically, tanh activation function can be expressed as follows:
The preceding equation can be represented in the following figure:
In general, in the last level of an feedforward neural network (FFNN), the softmax function is applied as the decision boundary. This is a common case, especially when solving a classification problem. In probability theory, the output of the softmax function is squashed as the probability distribution overKdifferent possible outcomes. Nevertheless, the softmax function is used in various multiclass classification methods, such that the network's output is distributed across classes (that is, probability distribution over the classes) having a dynamic range between -1 and 1 or 0 and 1.