Enhancing Deep Learning Performance Using Displaced Rectifier Linear Unit - David Macêdo - E-Book

Enhancing Deep Learning Performance Using Displaced Rectifier Linear Unit E-Book

David Macêdo

0,0

Beschreibung

Recently, deep learning has caused a significant impact on computer vision, speech recognition, and natural language understanding. In spite of the remarkable advances, deep learning recent performance gains have been modest and usually rely on increasing the depth of the models, which often requires more computational resources such as processing time and memory usage. To tackle this problem, we turned our attention to the interworking between the activation functions and the batch normalization, which is virtually mandatory currently. In this work, we propose the activation function Displaced Rectifier Linear Unit (DReLU) by conjecturing that extending the identity function of ReLU to the third quadrant enhances compatibility with batch normalization. Moreover, we used statistical tests to compare the impact of using distinct activation functions (ReLU, LReLU, PReLU, ELU, and DReLU) on the learning speed and test accuracy performance of VGG and Residual Networks state-of-the-art models. These convolutional neural networks were trained on CIFAR-10 and CIFAR-100, the most commonly used deep learning computer vision datasets. The results showed DReLU speeded up learning in all models and datasets. Besides, statistical significant performance assessments (p<0.05) showed DReLU enhanced the test accuracy obtained by ReLU in all scenarios. Furthermore, DReLU showed better test accuracy than any other tested activation function in all experiments with one exception.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 83

Veröffentlichungsjahr: 2022

Das E-Book (TTS) können Sie hören im Abo „Legimi Premium” in Legimi-Apps auf:

Android
iOS
Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



To my family.

Acknowledgements

This work would not have been possible without the support of many. I would like to thank and dedicate this dissertation to the following people:

To my advisor Teresa Ludermir. Teresa is an exceptional researcher and professor. Her guidance and support were fundamental to motivate me throughout this research.

To my co-advisor Cleber Zanchettin for his contributions to the work we have done.

To my family, especially my parents, José and Mary, my wife Janaina, and my children, Jéssica, and Daniel, for give me the love that I need through my whole life.

Things should be made as simple as possible, but no simpler.

—ALBERT EINSTEIN

List of Acronyms

ReLU

Rectifier Linear Unit

LReLU

Leaky Rectifier Linear Unit

PReLU

Parametric Rectifier Linear Unit

ELU

Exponential Linear Unit

DReLU

Displaced Rectifier Linear Unit

CNN

Convolutional Neural Network

VGG

Visual Geometry Group

ResNet

Residual Network

RBM

Restricted Boltzmann Machine

AE

Auto-Encoder

DBN

Deep Belief Network

SAE

Stacked Auto-Encoder

MLP

Multilayer Perceptron

FNN

Feedforward Neural Network

SAH

Single Algorithm Hypothesis

RNN

Recurrent Neural Network

GMM

Gaussian Markov Model

DNN

Deep Neural Network

Contents

Capa

Folha de Rosto

Créditos

1. Introduction

1.1 CONTEXT

1.2 PROBLEM

1.3 GOAL

1.4 OUTLINE

2. Background

2.1 DEEP LEARNING

2.2 ACTIVATION FUNCTIONS

2.2.1 Rectifier Linear Unit

2.2.2 Leaky Rectifier Linear Unit

2.2.3 Parametric Rectifier Linear Unit

2.2.4 Exponential Linear Unit

2.3 CONVOLUTIONAL NETWORKS

2.4 ARCHITECTURES

2.4.1 Visual Geometry Group

2.4.2 Residual Networks

2.5 REGULARIZATION

2.5.1 Dropout

2.5.2 Batch Normalization

3. Displaced Rectifier Linear Unit

4. Experiments

4.1 DATASETS, PREPROCESSING AND DATA AUGMENTATION

4.2 ACTIVATION FUNCTIONS PARAMETRIZATION

4.3 MODELS AND INITIALIZATION

4.4 TRAINING AND REGULARIZATION

4.5 PERFORMANCE ASSESSMENT

5. Results

5.1 BIAS SHIFT EFFECT

5.2 CIFAR-10 DATASET

5.2.1 VGG-19 Model

5.2.2 ResNet-56 Model

5.2.3 ResNet-110 Model

5.3 CIFAR-100 DATASET

5.3.1 VGG-19 Model

5.3.2 ResNet-56 Model

5.3.3 ResNet-110 Model

5.4 DISCUSSION

6. Conclusion

6.1 CONTRIBUTIONS

6.2 FUTURE WORK

References

Landmarks

cover

title-page

copyright-page

table of contents

Bibliografia

1. Introduction

A journey of a thousand miles begins with a single step.

—LAO TZU

In this introductory chapter, we explain the context of this work, which is deep learning research. After that, we establish the problem of interest. Then we set the goals of this study and the contributions we achieved. Finally, we present an outline of the subject of the next chapters.

1.1 CONTEXT

The artificial neural networks research passed through three historical waves (Fig.1.1) (GOODFELLOW; BENGIO; COURVILLE,2016). The first one, known as cybernetics, started at the 1960s with the work of Rosenblatt and the definition of the Perceptron, which was showed to be useful in linear separable problems (ROSENBLATT,1958). This initial excitement diminished in the 1970s by the work of Minsk and Papert (MINSKY; PAPERT,1969), which demonstrated some limitations of this concept.

The second wave of artificial neural networks research, known as connectionism, began in the 1980s after the dissemination of the discovery of the so-called backpropagation algorithm (RUMELHART; HINTON; WILLIAMS,1986), which allowed training neural networks with few hidden layers. Nevertheless, the Vanish Gradient Problem supported the idea that training neural networks with more than few layers was a hard challenge (HOCHREITER,1991).

Therefore, this second wave was replaced by a huge interest in new statistical machine learning methods discovered or improved in the 1990s. Artificial neural networks research passed through another dismal period and fell out of favor again. Indeed, it was a time when the machine learning researchers largely forsook neural networks and backpropagation was ignored by the computer vision and natural language processing communities.

Figure 1.1: The three historical waves of artificial neural networks research (GOODFELLOW; BENGIO; COURVILLE,2016).

The third and present wave of artificial neural networks research has been called deep learning, and it started at the late 2000s with some seminal works from Geoffrey Hinton, Yoshua Bengio, and Yann LeCun, which showed that it is possible to train artificial neural networks with many hidden layers. The recent advances in deep learning research have produced more accurate image, speech, and language recognition systems and have generated new state-of- the-art machine learning applications in a broad range of areas such as mathematics, physics, healthcare, genomics, financing, business, agriculture, etc.

Activation functions are the components of neural networks architectures responsible for adding nonlinearity capabilities to the models. In fact, considering Figure1.2, the transformation performed by a generic shallow or deep neural network layer can be written bellow:

As can be seen in Eq.1.1, the activation function is the only component of a neural network, or a deep architecture, that incorporates nonlinearity capability. Indeed, if the activation function f is removed from the mentioned equation, a particular layer would be able only to perform affine transformations, which are composed of a linear plus a bias transformation. Since a composition of affine transformation is also an affine transformation, even very deep neural networks would only be able to perform affine transformations.

Figure 1.2: Activation functions introduces nonlinearity to shallow or deep neural networks (HAYKIN,2008).

However, it is widespread knowledge that most real-world problems are nonlinear and therefore neural networks would be restricted to be useful in extremely few practical situations in case of absence of nonlinear activation functions. In this sense, it is remarkably important that activation functions enable neural networks to be able to perform complex nonlinear maps, mainly when several layers are hierarchically composed.

The demonstration of the fundamental role that activation functions present in neural networks performance does not rely only on theoretical arguments like the above mentioned but are in fact also endorsed by practical considerations.

Indeed, there is no doubt that one of the most significant contributions (or perhaps the major contribution) to make supervised deep learning a major success in practical tasks was the discovery that Rectifier Linear Unit (ReLU) (Fig.1.3) outperforms traditionally used activation functions (KRIZHEVSKY; SUTSKEVER; HINTON,2012).

The mentioned article showed that ReLU presents much better performance than traditionally used Sigmoid and Hyperbolic Tangent activation functions to train artificial neural networks, mainly when a deep architecture is being used (Fig.1.4). The fundamental innovation of ReLU was showing that a linear identity to positive values is a clever mechanism to ensure the backpropagation of the gradient to very deep layers, which was believed to be impossible by the research community for decades (HOCHREITER,1991).

Therefore, considering the theoretical and practical aspects above mentioned, the fundamental importance of activation functions for the design and performance of (deep) neural networks cannot be overestimated.

Figure 1.3: Plot of ReLU activation function.

Figure1.4:The ReLU activationfunction(solidline)presentsmuchhighertrainingspeedand testaccuracyperformancethantraditionallyusedactivationfunctionsliketheSigmoidand Hyperbolic Tangent (dashed line) (KRIZHEVSKY; SUTSKEVER;HINTON,2012).

Similar to activation functions, batch normalization (IOFFE; SZEGEDY,2015) currently plays a fundamental role in training deep architectures. This technique normalizes the inputs of each layer, which is equivalent to normalizing the outputs of the deep model previous layer. Consequently, batch normalization allows the use of higher learning rates and makes the weights initialization technique used almost irrelevant.

Batch normalization also works as an effective regularizer, virtually eliminating the need for another regularization technique called dropout (SRIVASTAVA et al.,2014). Therefore, batch normalization significantly contributes to improving the deep learning performance and currently represents a mandatory method in this research field.

1.2 PROBLEM

The Sigmoid activation function was one the most used during the past decades. The primary motivation for the use of Sigmoid is the fact that it has a natural inspiration and that the almost linear region near the y-axis can be utilized for training. The saturations are needed to provide nonlinearity to the activation function (Fig.1.5).

However, the same saturations that are responsible for proving nonlinearity to Sigmoid, in practical terms, kill the gradients and therefore are extremely harmful to the backpropagation and consequently to the learning process. In fact, the extreme small gradients of saturation areas do not allow the update of weights in such cases. For this reason, for years, careful initialization was used to avoid the saturation region (LECUN et al.,2012). Nevertheless, during training, activations inevitably begin to fall in the saturation areas, which slow training.

The Hyperbolic Tangent was also commonly used during last decades in shallow neural networks. This activation function has essentially the same format of Sigmoid, but it presents zero mean output, which improves learning by working as a type of normalization procedure (LECUN et al.,2012) (Fig.1.6).

Nevertheless, Hyperbolic Tangent still have saturation regions, and therefore all the above comments about the drawbacks of this aspect of the Sigmoid also applies. Therefore, despite representing an advance in comparison to Sigmoid, the Hyperbolic Tangent still presents severe limitations.

Currently, we know that neither Sigmoid nor Hyperbolic Tangent were able to train deep neural networks because of the absence of the identity function for positive input (KRIZHEVSKY; SUTSKEVER; HINTON,2012). The saturations presented by these activation functions produce zero slope and therefore the backpropagation procedure was unable to send gradient information to deeper layers. This phenomenon was studied in 1990’s and become known as the Vanishing Gradient Problem (HOCHREITER,1991).

Figure 1.5: Plot of Sigmoid activation function.

Figure 1.6: Plot of Hyperbolic Tangent activation function.

The discovery of ReLU allowed achieving higher accuracy in less time by avoiding this phenomenon. The ReLU avoided the mentioned problem by imposing an identity for the positive values, which allowed efficient and fast training of deeper architectures. In fact, by replacing the saturation in the first quadrant (present in both Sigmoid and Hyperbolic Tangent) by the linear function, the gradient can be backpropagated to deep layers without going to zero. Consequently, by allowing the training of deeper neural networks, the discovery of the ReLU (NAIR; HINTON, 2010;GLOROT; BORDES; BENGIO,2011;KRIZHEVSKY; SUTSKEVER; HINTON,2012) was one of the main factors that contributed to deep learning advent.