31,19 €
Deep learning has shown its power in several application areas of Artificial Intelligence, especially in Computer Vision. Computer Vision is the science of understanding and manipulating images, and finds enormous applications in the areas of robotics, automation, and so on. This book will also show you, with practical examples, how to develop Computer Vision applications by leveraging the power of deep learning.
In this book, you will learn different techniques related to object classification, object detection, image segmentation, captioning, image generation, face analysis, and more. You will also explore their applications using popular Python libraries such as TensorFlow and Keras. This book will help you master state-of-the-art, deep learning algorithms and their implementation.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 258
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Amey VarangaonkarAcquisition Editor: Aman SinghContent Development Editor: Varun SonyTechnical Editor: Dharmendra YadavCopy Editors: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Pratik ShirodkarGraphics: Tania DuttaProduction Coordinator: Shantanu Zagade
First published: January 2018
Production reference: 1220118
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78829-562-8
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Deep learning is revolutionizing AI, and over the next several decades, it will change the world radically. AI powered by deep learning will be on par in scale with the industrial revolution. This, understandably, has created excitement and fear about the future. But the reality is that just like the industrial revolution and machinery, deep learning will improve industrial capacity and raise the standards of living dramatically for humankind. Rather than replace jobs, it will create many more jobs of a higher stand. This is why this book is so important and timely. Readers of this book will be introduced to deep learning for computer vision, its power, and many applications. This book will give readers a grounding in the fundamentals of an emerging industry that will grow exponentially over the next decade.
Rajalingappaa Shanmugamani is a great researcher whom I have worked with previously on several projects in computer vision. He was the lead engineer in designing and delivering a complex computer vision and deep learning system for fashion search that was deployed in the real world with great success. Among his strengths is his ability to take up state-of-the-art research in complex problems and apply them to real-world situations. He can also break down complex ideas and explain them in simple terms as is demonstrated in this book. Raja is a very ambitious person with great work ethics, and in this book, he has given a great overview of the current state of computer vision using deep learning, a task not many can do in today's industry. This book is a great achievement by Raja and I’m sure the reader will enjoy and benefit from it for many years to come.
Dr. Stephen Moore
Chief Technology Officer, EmotionReader, Singapore
Rajalingappaa Shanmugamani is currently working as a Deep Learning Lead at SAP, Singapore. Previously, he has worked and consulted at various startups for developing computer vision products. He has a Masters from Indian Institute of Technology – Madras where his thesis was based on applications of computer vision in the manufacturing industry. He has published articles in peer-reviewed journals and conferences and applied for few patents in the area of machine learning. In his spare time, he coaches programming and machine learning to school students and engineers.
Nishanth Koganti received B.Tech in Electrical Engineering from Indian Institute of Technology Jodhpur, India in 2012, M.E and PhD in Information Science from Nara Institute of Science and Technology, Japan in 2014, 2017 respectively. He is currently a Postdoctoral researcher at the University of Tokyo, Japan. His research interests are in assistive robotics, motor-skills learning, and machine learning. His graduate research was on the development of a clothing assistance robot that helps elderly people to wear clothes.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Reviews
Getting Started
Understanding deep learning
Perceptron
Activation functions
Sigmoid
The hyperbolic tangent function
The Rectified Linear Unit (ReLU)
Artificial neural network (ANN)
One-hot encoding
Softmax
Cross-entropy
Dropout
Batch normalization
L1 and L2 regularization
Training neural networks
Backpropagation
Gradient descent
Stochastic gradient descent
Playing with TensorFlow playground
Convolutional neural network
Kernel
Max pooling
Recurrent neural networks (RNN)
Long short-term memory (LSTM)
Deep learning for computer vision
Classification
Detection or localization and segmentation
Similarity learning
Image captioning
Generative models
Video analysis
Development environment setup
Hardware and Operating Systems - OS
General Purpose - Graphics Processing Unit (GP-GPU)
Computer Unified Device Architecture - CUDA
CUDA Deep Neural Network - CUDNN
Installing software packages
Python
Open Computer Vision - OpenCV
The TensorFlow library
Installing TensorFlow
TensorFlow example to print Hello, TensorFlow
TensorFlow example for adding two numbers
TensorBoard
The TensorFlow Serving tool
The Keras library
Summary
Image Classification
Training the MNIST model in TensorFlow
The MNIST datasets
Loading the MNIST data
Building a perceptron
Defining placeholders for input data and targets
Defining the variables for a fully connected layer
Training the model with data
Building a multilayer convolutional network
Utilizing TensorBoard in deep learning
Training the MNIST model in Keras
Preparing the dataset
Building the model
Other popular image testing datasets
The CIFAR dataset
The Fashion-MNIST dataset
The ImageNet dataset and competition
The bigger deep learning models
The AlexNet model
The VGG-16 model
The Google Inception-V3 model
The Microsoft ResNet-50 model
The SqueezeNet model
Spatial transformer networks
The DenseNet model
Training a model for cats versus dogs
Preparing the data
Benchmarking with simple CNN
Augmenting the dataset
Augmentation techniques
Transfer learning or fine-tuning of a model
Training on bottleneck features
Fine-tuning several layers in deep learning
Developing real-world applications
Choosing the right model
Tackling the underfitting and overfitting scenarios
Gender and age detection from face
Fine-tuning apparel models
Brand safety
Summary
Image Retrieval
Understanding visual features
Visualizing activation of deep learning models
Embedding visualization
Guided backpropagation
The DeepDream
Adversarial examples
Model inference
Exporting a model
Serving the trained model
Content-based image retrieval
Building the retrieval pipeline
Extracting bottleneck features for an image
Computing similarity between query image and target database
Efficient retrieval
Matching faster using approximate nearest neighbour
Advantages of ANNOY
Autoencoders of raw images
Denoising using autoencoders
Summary
Object Detection
Detecting objects in an image
Exploring the datasets
ImageNet dataset
PASCAL VOC challenge
COCO object detection challenge
Evaluating datasets using metrics
Intersection over Union
The mean average precision
Localizing algorithms
Localizing objects using sliding windows
The scale-space concept
Training a fully connected layer as a convolution layer
Convolution implementation of sliding window
Thinking about localization as a regression problem
Applying regression to other problems
Combining regression with the sliding window
Detecting objects
Regions of the convolutional neural network (R-CNN)
Fast R-CNN
Faster R-CNN
Single shot multi-box detector
Object detection API
Installation and setup
Pre-trained models
Re-training object detection models
Data preparation for the Pet dataset
Object detection training pipeline
Training the model
Monitoring loss and accuracy using TensorBoard
Training a pedestrian detection for a self-driving car
The YOLO object detection algorithm
Summary
Semantic Segmentation
Predicting pixels
Diagnosing medical images
Understanding the earth from satellite imagery
Enabling robots to see
Datasets
Algorithms for semantic segmentation
The Fully Convolutional Network
The SegNet architecture
Upsampling the layers by pooling
Sampling the layers by convolution
Skipping connections for better training
Dilated convolutions
DeepLab
RefiNet
PSPnet
Large kernel matters
DeepLab v3
Ultra-nerve segmentation
Segmenting satellite images
Modeling FCN for segmentation
Segmenting instances
Summary
Similarity Learning
Algorithms for similarity learning
Siamese networks
Contrastive loss
FaceNet
Triplet loss
The DeepNet model
DeepRank
Visual recommendation systems
Human face analysis
Face detection
Face landmarks and attributes
The Multi-Task Facial Landmark (MTFL) dataset
The Kaggle keypoint dataset
The Multi-Attribute Facial Landmark (MAFL) dataset
Learning the facial key points
Face recognition
The labeled faces in the wild (LFW) dataset
The YouTube faces dataset
The CelebFaces Attributes dataset (CelebA)
CASIA web face database
The VGGFace2 dataset
Computing the similarity between faces
Finding the optimum threshold
Face clustering
Summary
Image Captioning
Understanding the problem and datasets
Understanding natural language processing for image captioning
Expressing words in vector form
Converting words to vectors
Training an embedding
Approaches for image captioning and related problems
Using a condition random field for linking image and text
Using RNN on CNN features to generate captions
Creating captions using image ranking
Retrieving captions from images and images from captions
Dense captioning
Using RNN for captioning
Using multimodal metric space
Using attention network for captioning
Knowing when to look
Implementing attention-based image captioning
Summary
Generative Models
Applications of generative models
Artistic style transfer
Predicting the next frame in a video
Super-resolution of images
Interactive image generation
Image to image translation
Text to image generation
Inpainting
Blending
Transforming attributes
Creating training data
Creating new animation characters
3D models from photos
Neural artistic style transfer
Content loss
Style loss using the Gram matrix
Style transfer
Generative Adversarial Networks
Vanilla GAN
Conditional GAN
Adversarial loss
Image translation
InfoGAN
Drawbacks of GAN
Visual dialogue model
Algorithm for VDM
Generator
Discriminator
Summary
Video Classification
Understanding and classifying videos
Exploring video classification datasets
UCF101
YouTube-8M
Other datasets
Splitting videos into frames
Approaches for classifying videos
Fusing parallel CNN for video classification
Classifying videos over long periods
Streaming two CNN's for action recognition
Using 3D convolution for temporal learning
Using trajectory for classification
Multi-modal fusion
Attending regions for classification
Extending image-based approaches to videos
Regressing the human pose
Tracking facial landmarks
Segmenting videos
Captioning videos
Generating videos
Summary
Deployment
Performance of models
Quantizing the models
MobileNets
Deployment in the cloud
AWS
Google Cloud Platform
Deployment of models in devices
Jetson TX2
Android
iPhone
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Deep Learning for Computer Vision is a book intended for readers who want to learn deep-learning-based computer vision techniques for various applications. This book will give the reader tools and techniques to develop computer-vision-based products. There are plenty of practical examples covered in the book to follow the theory.
The reader wants to know how to apply deep learning to computer vision problems such as classification, detection, retrieval, segmentation, generation, captioning, and video classification. The reader also wants to understand how to achieve good accuracy under various constraints such as less data, imbalanced classes, and noise. Then the reader also wants to know how to deploy trained models on various platforms (AWS, Google Cloud, Raspberry Pi, and mobile phones). After completing this book, the reader should be able to develop code for problems of person detection, face recognition, product search, medical image segmentation, image generation, image captioning, video classification, and so on.
Chapter 1, Getting Started, introduces the basics of deep learning and makes the readers familiar with the vocabulary. The readers will install the software packages necessary to follow the rest of the chapters.
Chapter 2, Image Classification, talks about the image classification problem, which is labeling an image as a whole. The readers will learn about image classification techniques and train a deep learning model for pet classification. They will also learn methods to improve accuracy and dive deep into variously advanced architectures.
Chapter 3, Image Retrieval, covers deep features and image retrieval. The reader will learn about various methods of obtaining model visualization, visual features, inference using TensorFlow, and serving and using visual features for product retrieval.
Chapter 4, Object Detection, talks about detecting objects in images. The reader will learn about various techniques of object detection and apply them for pedestrian detection. The TensorFlow API for object detection will be utilized in this chapter.
Chapter 5, Semantic Segmentation, covers segmenting of images pixel-wise. The readers will earn about segmentation techniques and train a model for segmentation of medical images.
Chapter 6, Similarity Learning, talks about similarity learning. The readers will learn about similarity matching and how to train models for face recognition. A model to train facial landmark is illustrated.
Chapter 7, Image Captioning, is about generating or selecting captions for images. The readers will learn natural language processing techniques and how to generate captions for images using those techniques.
Chapter 8, Generative Models, talks about generating synthetic images for various purposes. The readers will learn what generative models are and use them for image generation applications, such as style transfer, training data, and so on.
Chapter 9, Video Classification, covers computer vision techniques for video data. The readers will understand the key differences between solving video versus image problems and implement video classification techniques.
Chapter 10, Deployment,talks about the deployment steps for deep learning models. The reader will learn how to deploy trained models and optimize for speed on various platforms.
The examples covered in this book can be run with Windows, Ubuntu, or Mac. All the installation instructions are covered. Basic knowledge of Python and machine learning is required. It's preferable that the reader has GPU hardware but it's not necessary.
You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packtpub.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Deep-Learning-for-Computer-Vision. We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
Computer vision is the science of understanding or manipulating images and videos. Computer vision has a lot of applications, including autonomous driving, industrial inspection, and augmented reality. The use of deep learning for computer vision can be categorized into multiple categories: classification, detection, segmentation, and generation, both in images and videos. In this book, you will learn how to train deep learning models for computer vision applications and deploy them on multiple platforms. We will use TensorFlow, a popular python library for deep learning throughout this book for the examples. In this chapter, we will cover the following topics:
The basics and vocabulary of deep learning
How deep learning meets computer vision?
Setting up the development environment that will be used for the examples covered in this book
Getting a feel for TensorFlow, along with its powerful tools, such as TensorBoard and TensorFlow Serving
Computer vision as a field has a long history. With the emergence of deep learning, computer vision has proven to be useful for various applications. Deep learning is a collection of techniques from artificial neural network (ANN), which is a branch of machine learning. ANNs are modelled on the human brain; there are nodes linked to each other that pass information to each other. In the following sections, we will discuss in detail how deep learning works by understanding the commonly used basic terms.
An artificial neuron or perceptron takes several inputs and performs a weighted summation to produce an output. The weight of the perceptron is determined during the training process and is based on the training data. The following is adiagram of the perceptron:
The inputs are weighted and summed as shown in the preceding image. The sum is then passed through a unit step function, in this case, for a binary classification problem. A perceptron can only learn simple functions by learning the weights from examples. The process of learning the weights is called training. The training on a perceptron can be done through gradient-based methods which are explained in a later section. The output of the perceptron can be passed through anactivationfunction ortransferfunction, which will be explained in the next section.
The activation functions make neural nets nonlinear. An activation function decides whether a perceptron should fire or not. During training activation, functions play an important role in adjusting the gradients. An activation function such as sigmoid, shown in the next section, attenuates the values with higher magnitudes. This nonlinear behaviour of the activation function gives the deep nets to learn complex functions. Most of the activation functions are continuous and differential functions, except rectified unit at 0. A continuous function has small changes in output for every small change in input. A differential function has a derivative existing at every point in the domain.
In order to train a neural network, the function has to be differentiable. Following are a few activation functions.
Sigmoid can be considered a smoothened step function and hence differentiable.Sigmoid is useful for converting any value to probabilities and can be used for binary classification. The sigmoid maps input to a value in the range of 0 to 1, as shown in the following graph:
The change in Y values with respect to X is going to be small, and hence, there will be vanishing gradients. After some learning, the change may be small. Another activation function called tanh, explained in next section, is a scaled version of sigmoid and avoids the problem of a vanishing gradient.
The hyperbolic tangent function, ortanh, is the scaled version of sigmoid. Like sigmoid, it is smooth and differentiable. The tanh maps input to a value in the range of -1 to 1, as shown in the following graph:
The gradients are more stable than sigmoid and hence have fewer vanishing gradient problems. Both sigmoid and tanh fire all the time, making the ANN really heavy. The Rectified Linear Unit (ReLU) activation function, explained in the next section, avoids this pitfall by not firing at times.
ReLu can let big numbers pass through. This makes a few neurons stale and they don't fire. This increases the sparsity, and hence, it is good. The ReLU maps input x to max (0, x), that is, they map negative inputs to 0, and positive inputs are output without any change as shown in the following graph:
Because ReLU doesn't fire all the time, it can be trained faster. Since the function is simple, it is computationally the least expensive. Choosing the activation function is very dependent on the application. Nevertheless, ReLU works well for a large range of problems. In the next section, you will learn how to stack several perceptrons together that can learn more complex functions than perceptron.
ANN is a collection of perceptrons and activation functions. The perceptrons are connected to form hidden layers or units. The hidden units form the nonlinear basis that maps the input layers to output layers in a lower-dimensional space, which is also called artificial neural networks. ANN is a map from input to output. The map is computed by weighted addition of the inputs with biases. The values of weight and bias values along with the architecture are called model.
The training process determines the values of these weights and biases. The model values are initialized with random values during the beginning of the training. The error is computed using a loss function by contrasting it with the ground truth. Based on the loss computed, the weights are tuned at every step. The training is stopped when the error cannot be further reduced. The training process learns the features during the training. The features are a better representation than the raw images. The following is a diagram of an artificial neural network, or multi-layer perceptron:
Several inputs of x are passed through a hidden layer of perceptrons and summed to the output. The universal approximation theorem suggests that such a neural network can approximate any function. The hidden layer can also be called a dense layer. Every layer can have one of the activation functions described in the previous section. The number of hidden layers and perceptrons can be chosen based on the problem. There are a few more things that make this multilayer perceptron work for multi-class classification problems. A multi-class classification problem tries to discriminate more than ten categories. We will explore those terms in the following sections.
One-hot encoding is a way to represent the target variables or classes in case of a classification problem. The target variables can be converted from the string labels to one-hot encoded vectors. A one-hot vector is filled with 1 at the index of the target class but with 0 everywhere else. For example, if the target classes are cat and dog, they can be represented by [1, 0] and [0, 1], respectively. For 1,000 classes, one-hot vectors will be of size 1,000 integers with all zeros but 1. It makes no assumptions about the similarity of target variables. With the combination of one-hot encoding with softmax explained in the following section, multi-class classification becomes possible in ANN.
Softmax is a way of forcing the neural networks to output the sum of 1. Thereby, the output values of the softmax function can be considered as part of a probability distribution. This is useful in multi-class classification problems. Softmax is a kind of activation function with the speciality of output summing to 1. It converts the outputs to probabilities by dividing the output by summation of all the other values. The Euclidean distance can be computed between softmax probabilities and one-hot encoding for optimization. But the cross-entropy explained in the next section is a better cost function to optimize.
Cross-entropy compares the distance between the outputs of softmax and one-hot encoding. Cross-entropy is a loss function for which error has to be minimized. Neural networks estimate the probability of the given data to every class. The probability has to be maximized to the correct target label. Cross-entropy is the summation of negative logarithmic probabilities. Logarithmic value is used for numerical stability. Maximizing a function is equivalent to minimizing the negative of the same function. In the next section, we will see the following regularization methods to avoid the overfitting of ANN:
Dropout
Batch normalization
L1 and L2 normalization
Dropout is an effective way of regularizing neural networks to avoid the overfitting of ANN. During training, the dropout layer cripples the neural network by removing hidden units stochastically as shown in the following image:
Note how the neurons are randomly trained. Dropout is also an efficient way of combining several neural networks. For each training case, we randomly select a few hidden units so that we end up with different architectures for each case. This is an extreme case of bagging and model averaging. Dropout layer should not be used during the inference as it is not necessary.
Batch normalization, or batch-norm, increase the stability and performance of neural network training. It normalizes the output from a layer with zero mean and a standard deviation of 1. This reduces overfitting and makes the network train faster. It is very useful in training complex neural networks.
L1 penalizes the absolute value of the weight and tends to make the weights zero. L2 penalizes the squared value of the weight and tends to make the weight smaller during the training. Both the regularizes assume that models with smaller weights are better.
Training ANN is tricky as it contains several parameters to optimize. The procedure of updating the weights is called backpropagation. The procedure to minimize the error is called optimization. We will cover both of them in detail in the next sections.
A backpropagation algorithm is commonly used for training artificial neural networks. The weights are updated from backward based on the error calculated as shown in the following image:
After calculating the error, gradient descent can be used to calculate the weight updating, as explained in the next section.
