Semi-Supervised and Unsupervised Machine Learning - Amparo Albalate - E-Book

Semi-Supervised and Unsupervised Machine Learning E-Book

Amparo Albalate

0,0
139,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book provides a detailed and up-to-date overview on classification and data mining methods. The first part is focused on supervised classification algorithms and their applications, including recent research on the combination of classifiers. The second part deals with unsupervised data mining and knowledge discovery, with special attention to text mining. Discovering the underlying structure on a data set has been a key research topic associated to unsupervised techniques with multiple applications and challenges, from web-content mining to the inference of cancer subtypes in genomic microarray data. Among those, the book focuses on a new application for dialog systems which can be thereby made adaptable and portable to different domains. Clustering evaluation metrics and new approaches, such as the ensembles of clustering algorithms, are also described.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 257

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Part 1. State of the Art

Chapter 1. Introduction

1.1. Organization of the book

1.2. Utterance corpus

1.3. Datasets from the UCI repository

1.4. Microarray dataset

1.5. Simulated datasets

Chapter 2. State of the Art in Clustering and Semi-Supervised Techniques

2.1. Introduction

2.2. Unsupervised machine learning (clustering)

2.3. A brief history of cluster analysis

2.4. Cluster algorithms

2.5. Applications of cluster analysis

2.6. Evaluation methods

2.7. Internal cluster evaluation

2.8. External cluster validation

2.9. Semi-supervised learning

2.10. Summary

Part 2. Approaches to Semi-Supervised Classification

Chapter 3. Semi-Supervised Classification Using Prior Word Clustering

3.1. Introduction

3.2. Dataset

3.3. Utterance classification scheme

3.4. Semi-supervised approach based on term clustering

3.5. Disambiguation

3.6. Summary

Chapter 4. Semi-Supervised Classification Using Pattern Clustering

4.1. Introduction

4.2. New semi-supervised algorithm using the cluster and label strategy

4.3. Optimum cluster labeling

4.4. Supervised classification block

4.5. Datasets

4.6. An analysis of the bounds for the cluster and label approaches

4.7. Extension through cluster pruning

4.8. Simulations and results

4.9. Summary

Part 3. Contributions to Unsupervised Classification — Algorithms to Detect the Optimal Number of Clusters

Chapter 5. Detection of the Number of Clusters through Non-Parametric Clustering Algorithms

5.1. Introduction

5.2. New hierarchical pole-based clustering algorithm

5.3. Evaluation

5.4. Datasets

5.5. Summary

Chapter 6. Detecting the Number of Clusters through Cluster Validation

6.1. Introduction

6.2. Cluster validation methods

6.3. Combination approach based on quantiles

6.4. Datasets

6.5. Results

6.6. Application of speech utterances

6.7. Summary

Bibliography

Index

First published 2011 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK

John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA

www.iste.co.uk

www.wiley.com

© ISTE Ltd 2011

The rights of Amparo Albalate and Wolfgang Minker to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Cataloging-in-Publication Data

Albalate, Amparo.

Semi-supervised and unsupervised machine learning / Amparo Albalate, Wolfgang Minker.

p. cm.

Includes bibliographical references and index.

ISBN 978-1-84821-203-9 (hardback)

1. Data mining. 2. Discourse analysis--Statistical methods. 3. Speech processing systems. 4. Computational intelligence. I. Minker, Wolfgang. II. Title.

QA76.9.D343A3393 2011

006.3--dc22

2010040730

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISBN 978-1-84821-203-9

PART 1

State of the Art

Chapter 1

Introduction

The main objective of this book is to develop machine learning (ML) tools that help minimize the (costly) human supervision required for the analysis of large volumes of data. To address such an objective, the research work developed in this book focused on two major fields in ML: unsupervised and semi-supervised learning. Both ML areas have been widely used in a large number of applications such as the clustering and semi-automatic annotation of large datasets of documents and the dimensionality reduction of microarray matrices for the analysis and interpretation of genomic data. In these examples, owing to the complexity and/or size of the large amounts of data to be processed, a fully supervised analysis without the help of semi- or unsupervised ML tools would become prohibitive.

Thus, the first aim of this book focused on the development of new algorithms in the field of semi-supervised ML. Semi-supervised learning provides an alternative to fully supervised classification. In supervised classification, a so-called training phase is performed using only labeled data. Typically, the labels for the training observations are manually compiled by human annotators. Then, a supervised algorithm is capable of inferring prediction rules or models from the available training data and consequently delivering the most probable label for a new observation, not necessarily observed in the training data. However, a major limitation of supervised algorithms is related to the availability of large corpora labeled in order to achieve accurate predictions. As it is generally accepted in the ML literature, the performance of supervised classifiers can drastically drop down if only training sets of small dimensions are available.

In [CAS 95] it was shown that some advantage could be, however, gained if a large amount of unlabeled data is available. In particular, this is possible to the degree to which class labels fulfill certain assumptions that allow us to identify the class structure from both labeled and unlabeled data. The framework of classification algorithms designed to use both labeled and unlabeled data to generate their prediction models is known as semi-supervised classification.

Nowadays, the semi-supervised learning field is rapidly evolving, as evidenced by the large amount of semi-supervised approaches available in the machine learning literature, including generative models, co-training, self-training, and graph-based models etc. Frequently, the learning strategy followed by many semi-supervised algorithms can be summarized as follows: (1) select a supervised algorithm with a certain learning rule for labeled data and (2) modify the learning rule by including unlabeled data so that a common objective is attained. A drawback of such a strategy is the algorithms’ stability/robustness with respect to the existence of labeling errors. Given the human effort involved in the manual labeling task, training sets are not exempted from potential labeling errors. These may occur depending on the degree of expertise of the human annotators. Even for expert labelers, the confidence in annotating patterns with a certain degree of ambiguity may drop drastically. Hence, a subjective bias in annotating this kind of pattern is unavoidable. Depending on the nature of the classification task and corpora, subjective biases may become a commonly faced problem, as happens in the recognition of emotional states.

Given the aforementioned statement, in this book two different approaches to semi-supervised classification are described which rely on unsupervised clustering as a prior step to classification. By clearly separating the clustering and classification objectives, the proposed algorithms may gain some robustness under labeling errors with respect to other existing semi-supervised algorithms. The first algorithm has been developed for utterance corpora. It exploits the semantic feature variability by means of prior feature clustering, which is combined with a “fully unsupervised” algorithm for pattern disambiguation. The second approach performs the clustering in the pattern space to extract the underlying class structure and uses the labeled sets to automatically annotate the clusters.

The second aim of this book is to identify the underlying classes in a dataset in a fully unsupervised way, i.e. under the absence of labels. The field of unsupervised learning has witnessed an accelerated growth since the mid-1940s (see Chapter 2 for detail information), resulting in a large pool of clustering algorithms in the ML literature. However, the first question that arose with the use of a clustering algorithm is the optimum number of clusters to be selected. Most clustering algorithms are parametric approaches, which may explicitly require the number of clusters k as an input parameter, or implicitly, other types of parameters that also require appropriate estimation. A number of cluster validation techniques have been also proposed for an attempt to estimate the optimum k, but most of them are reported to provide individual performances ad-hoc which depend on the clustering algorithm and distance functions selected. A number of authors in the unsupervised literature have claimed that finding the optimum k still remains an open research topic.

In this respect, this book also provides some attempts to solve the problem of the number of clusters. The first developed algorithm is a non-parametric clustering tool, which automatically identifies the number of clusters during the clustering process itself. The second approach is a cluster validation strategy that attempts to overcome the individual performances of other existing cluster validation techniques through an adequate combination of them.

1.1. Organization of the book

The book is divided into three main parts:

– The first part (Chapters 1 and 2) states the main objectives and motivation of the contributed tools, and describes the state of the art of unsupervised and semi-supervised methods.

– The second part (Chapters 3 and 4) describes the main contributions of the book in the field of unsupervised learning. Semi-supervised tools take advantage of both labeled and unlabeled data. In Chapter 3, a semi-supervised scheme for the categorization of utterances by using a unique labeled example per category is described. Unlabeled data is then used to identify clusters of synonym words from the vocabulary. This way, the initial small vocabulary available to the classifier is expanded to the clusters of words. Furthermore, the main contribution of the book, which provides important improvements in categorization accuracy, consists of a new unsupervised scheme for the reallocation of ambiguous utterances based on the identification of the most informative words.

Chapter 4 describes a semi-supervised classification approach based also on prior clustering, but this time applied to patterns (e.g. utterances) instead of words. Then, a minimum number of labels were used to automatically annotate the patterns inside the obtained clusters. An optimum cluster labeling is achieved by defining a cost matrix in terms of overlapping labels inside the clusters and applying the Hungarian algorithm to derive the optimum assignment of labels to clusters. This way, the initial small training set is automatically extended to the whole clustered data. This semi-supervised approach has been evaluated and compared to a fully supervised approach in which the initial labeled seeds were directly used to train a supervised classifier.

– The third part of the book (Chapters 5 and 6) describes the contributions in unsupervised learning, particularly focused on the automatic detection of the number of clusters. Chapter 5 focuses on an existing algorithm, the pole-based overlapping clustering (PoBOC), which is, to the author’s knowledge, the only fully non-parametric existing algorithm that is able to detect the number of clusters. Moreover, a hierarchical alternative, namely, hierarchical pole-based clustering (HPoBC), is proposed to overcome a major limitation of PoBOC related to the extraction of clusters based on a concept of globally defined neighborhood. The new alternative applies PoBOC recursively to find clusters based on local neighborhoods. Both approaches have been compared with other traditional algorithms introduced in Chapter 2.

In Chapter 6, the detection of the optimum number of clusters (k) is addressed through cluster validation strategies. In particular, a combination approach is described which attempts to find the optimum k from the combination of validation curves obtained through traditional methods.

1.2. Utterance corpus

The first datasets are two corpora of utterance transcripts collected from spoken language dialogue systems (SLDSs). SLDS emerged in the mid-1990s as a new, important form of human—machine communication. In essence, the typical architecture of an SLDS can be observed in Figure 1.1.

Figure 1.1.Overview of an SLDS

First, input acoustic vectors generated from the speech signal are processed by an Automatic Speech Recogniser (ASR), resulting in a raw text transcription 1 of the input utterance. Subsequently, the transcribed text is interpreted in a semantic analysis block which extracts the utterance meaning in form of an appropriate semantic structure. This semantic representation is processed by the dialog manager which also communicates directly with an external application, namely a database interface. The dialog manager keeps control of the overall interaction progress towards the task completion. During this process, the user may be queried for confirmations, disambiguations, necessary additional information, etc. Finally, the interaction result is presented to the user in form of speech (text-to-speech synthesis or prerecorded prompts).

The utterance corpora used in this book have been obtained from a particular type of SLDS known as automated throubleshooting agents. These are SLDS specially designed to perform customer care issues over the telephone, in a similar way as human agents do. One essential component of this type of SLDSs is the semantic analysis block. Typically, the semantic analysis in throubleshooting agents is carried out at the utterance level. The users are presented an open prompt, such as “please briefly described the reason for your call.” Then, the semantic analysis block maps the unconstrained, natural language user response into one of the possible problems from a predefined list. This is done by means of statistical classifiers. This particular kind of semantic analysis is generally known as Statistical Spoken Language Understanding (SSLU).

The machine learning tools developed in this book are focused on semi-supervised utterance classification for SSLU and the unsupervised discovery of potential symptoms. Both utterance corpora presented in the following are referred to different application domains of commercial troubleshooting agents and have been made available by SpeechCycle Labs [ACO 07].

The first corpus is related to the “Internet” domain. Some examples of transcribed utterances and their corresponding (manual) annotations “symptom categories” are shown in Table 1.1

It should be noted that an utterance is defined as “a complete unit of speech in spoken language.” Although transcribed utterances may have a similar form as text sentences, utterances share the characteristics of natural language, in contrast to sentences in written language. Thus, it is not rare to observe ill-formed or gramatically incomplete utterance transcriptions (in particular when they reflect the users’ responses to the system’s prompts).

Table 1.1.Some examples of utterances and their corresponding manual labels (annotations) in the utterance corpora

Utterance transcript

Annotation

Remotes not working

Cable

Internet was supposed to be scheduled at my home today

Appointment

Billing Information

Billing

I’am having Internet problems

Internet

I need to get the hi-def cable box installed

Cable

I need to talk to a representative

Operator

1.3. Datasets from the UCI repository

The UCI machine learning repository is one of the most popular collections of real word and simulated datasets to be used for machine learning experiments. In this book, the following datasets have been selected, with different characteristics (number of features and classes). The projections of the datasets into the three principal components can be observed in Figures 1.2 and 1.3.

1.3.1. Wine dataset (wine)

The wine set consists of 178 instances with 13 attributes, representing three different types of wines.

Figure 1.2.Projection of the datasets on the three principal components: (a) breast dataset, (b) diabetes dataset, (c) wine dataset, and (d) Iris dataset

1.3.2. Wisconsin breast cancer dataset (breast)

This dataset contains 569 instances in 10 dimensions, denoting 10 different features extracted from digitized images of breast masses. The two existing classes are referred to the possible breast cancer diagnosis (malignant, and benign).

1.3.3. Handwritten digits dataset (Pendig)

The third real dataset is for pen-based recognition of handwritten digits. In this work, the test partition has been used, composed of 3,498 samples with 16 attributes. Ten classes can be distinguished for the digits 0–9.

Figure 1.3.Pendig dataset (projection on the three principal components)

Figure 1.4.Mixture of Gaussians datasets: (a) five Gaussians, and (b) seven Gaussians

1.3.4. Pima Indians diabetes (diabetes)

This dataset comprises 768 instances with eight numeric attributes. Two classes denote the possible diagnostics (the patients show or do not show signs of diabetes).

Figure 1.5.Spatial data bases. (a) Dataset with 100 points in 5 spatial clusters (100p5c), (b) mixture of 6 Gaussians, 1500 points (6Gauss), (c) mixture of three Gaussians (3Gauss), (d) 560 points, 8 clusters (560p8c) and (e) 1000 points, 9 clusters (1000p9c)

1.3.5. Iris dataset (Iris)

The Iris set is one of the most popular datasets from the UCI repository. It comprises 150 instances iris of three different classes of iris flowers (Setosa, Versicolor, and virginica). Two of these classes are linearly separable, the third one is not whereas linearly separable from the second one.

1.4. Microarray dataset

The NCI60 dataset [ROS 00] of the University of Standford has been also used. This dataset is publicly available at [NCI 06]. It consists of gene expression data for 60 cell lines derived from different organs and tissues. The data is a 1,375×60 matrix where each row represents a gene, and each column a cell line, related to a human tumor. Nine known tumor types and one unknown can be distinguished. The cancer types include: leukemia, colon, breast, prostate, lung, ovarian, renal, central nervous system, and melanoma.

1.5. Simulated datasets

Finally, simulated datasets in two dimensions have also been used to illustrate the behavior of the proposed algorithms and their outcomes.

1.5.1. Mixtures of Gaussians

The first datasets are two mixtures of five and seven Gaussians in two dimensions, where a certain amount of overlapping patterns can be observed.

1.5.2. Spatial datasets with non-homogeneous inter-cluster distance

These datasets comprise a number of patterns in two dimensions which build a hierarchy of groups, or groups non-homogeneouly distributed in the data space. Essentially, these datasets have been conceived for the demonstration of the HPoBC algorithm in Chapter 5.

To synthesize the data, a java applet has been implemented. The application is available online.

1. Most probable sequence of words detected.

Chapter 2

State of the Art in Clustering and Semi-Supervised Techniques

2.1. Introduction

This chapter provides a survey of the most popular techniques for unsupervised machine learning (clustering) and their applications and introduces the main existing approaches in the field of semi-supervised classification.

2.2. Unsupervised machine learning (clustering)

Exploring large amounts of data to extract meaningful information about its group structure by considering similarities and differences between data entities has been a relevant subject under investigation since the early 1930s.

Each one of the data groups to be identified by the exploratory analysis is referred to as “cluster,” and the framework of related techniques is known under the name “cluster analysis.”

Formally, cluster analysis has been defined as:

“the organisation of a collection of patterns (usually represented as a vector of measurements, or a point in a multi-dimensional space) into clusters based on similarity.” [JAI 99]

2.3. A brief history of cluster analysis

The first studies about cluster analysis were conducted in the field of analytical psychology. There is some controversy in the literature about the attribution of the initial clustering schemes. However, according to D. E. Bailey [TRY 70], it seems to be Robert C. Tryon who originally conceived cluster analysis and its first application to psychological data. Tryon envisaged and formulated algorithms to group a set of intellectual abilities in humans, and provided a collection of scorings measured by the Holzinger ability tests [TRY 68]. Tryon referred to this particular practice of cluster analysis as variable analysis or V-analysis. The main purpose was to identify composites of abilities that could serve as more “general” and relevant descriptors than the whole set of scores for a more accurate analysis of human differences. This clustering method, called “key-cluster factor analysis,” was proposed as an alternative to the factor analysis generally accepted at the time.

According to Bailey, this early form of cluster analysis was devised by Tryon in 1930. However, it was only three decades later, in 1965, that the method was implemented as part of a software package (BCTRY), following the introduction of the first modern computer at the University of California.

In those days, the field of classical taxonomy was also starting to be the object of important critiques about its conventional principles and practices. For Sokal and Michener [SOK 63], the main concern was the subjectivity of the scientists involved in the elaboration of a taxonomy, and thus the individual biases associated with the systematic classification of organisms. This problem appeared to be aggravated by the advances in related sciences such as serology, chromatology, and spectroscopy, which led to the discovery of a vast amount of new variables (also called characters in biology) to describe the organisms. The resulting number of available characters turned out to make it infeasible for a taxonomist to estimate the affinities between taxonomic units (organisms, species, or higher rank taxa). In consequence, the scientists felt impelled to select a manageable group of characters from the complete set of available characters, which added a further subjective bias to the classical taxonomy procedures.

Having posed the aforementioned “illnesses” of classical taxonomy, Sneath and Sokal envisaged a series of numerical methods as the key to solve the weaknesses of classical taxonomy:

“It is the hope of numerical taxonomy to aim at judgements of affinity based on multiple and unweighted characters without the time and controversy which seems necessary at present for the maturation of taxonomic judgements.” [SOK 63]

The innovations introduced in the field of “numerical taxonomy” included the mathematic formulation of similarity metrics that enabled an objective comparison of specimens on the basis of their characters, as well as the application of hierarchical clustering to provide accurate representations of the new similarity matrices in the form of hierarchical trees or “dendograms”.

According to [BLA 55], clustering analysis has witnessed an “exponential growth” since the initial contributions of Tryon and Sokal. Numerous clustering approaches arose in those years for different applications, specially motivated by the introduction of different mathematical equations to calculate the distances/similarities between data objects [SOK 63].

One example was the application of clustering in the field of psychology, for grouping individuals according to their profile of scores, measured by personality or aptitudes tests, among others. Tryon, who initially devised a clustering scheme for grouping variables, also proposed the use of clustering algorithms for extracting different groups of persons with similar aptitudes (or composites of them) in the Holzinger study of abilities. He referred to this type of cluster analysis as “object analysis” or O-analysis, in contrast to its application to variable analysis. Later, he performed other studies, applying both V- and O-analyses to other domains. For example, one application was to cluster neighborhoods according to census data, in which the extracted clusters were referred to as social areas (the social area study [TRY 55]). Another application was to cluster a number of persons, among them psychiatric patients with psychological disorders as well as other normal subjects, into different groups of personalities (the Minnesota Multiphasic Personality Inventory (MMPI) Study [GRA 05]). To compute affinities between individuals, Tryon applied the Euclidean distance function.

Other contributions to cluster analysis were the works by Cox [COX 61, COX 62] and Fisher [FIS 58], who described hierarchical grouping schemes based on a single variable or the Ward’s method, an extension to deal with the multivariate case. Ward proposed an objective metric based on the sum of squares error (SSE) to quantify the loss of information by considering a group (represented by its mean score) rather than its individual member scores.

Although psychology and biology (taxonomy) were the major fields of application of the initial clustering schemes, the exponential evolution of cluster analysis is patent in the broad coverage of disciplines where clustering data has been an essential component (vector quantization [BUZ 80,EQU 89, GER 91], image segmentation [CLA 95, POR 96, ARI 06], marketing [SHE 96], etc.). According to [JAI 04], the number of clustering techniques developed from the early studies to date is “overwhelming.”

2.4. Cluster algorithms

This section provides an overview of some popular clustering approaches. In particular, five different types of clustering algorithm are distinguished: hierarchical, partitional (specially competitive methods), model-based, density-based, and graph-based algorithms.

2.4.1. Hierarchical algorithms

In hierarchical clustering, a tree or hierarchy of clusters is incrementally built [ALD 64,HAS 09,WAR 63]. Assuming that the data objects are the leaves of the hierarchical tree, the hierarchical tree is build using two basic approaches: agglomerative (bottom-up) or divisive (top-down). These methods are described in the following sections.

2.4.1.1. Agglomerative clustering

Agglomerative methods build the hierarchy tree in a bottom-up manner, starting at the bottom hierarchy level (each object composes a singleton cluster) and successively merging clusters until a unique cluster is found at the top level.

The agglomerative algorithm is described in the following steps:

1. Start with each object in its individual cluster. Initially, the distances between the clusters correspond to the dissimilarities between data objects.

Table 2.1.Cluster distances applied by the different hierarchical agglomerative algorithms

2. Update the distances between the clusters, according to one of the criteria in Table 2.1.

3. Merge the pair of clusters with the minimum distance, calculated in Step 2.

4. Stop if all the data elements are enclosed in a unique global cluster (the top hierarchy level is reached). Otherwise, record the pair of clusters previously merged in Step 3 in the tree structure and repeat from Step 2.

Table 2.1 describes four different types of agglomerative approaches, depending on the criterion to calculate the distance between clusters before each cluster merging. The notation d() indicates the centroid, or center, of cluster .

The hierarchy tree is typically visualized in the so-called dendogram plot, which depicts the pair of clusters that are merged at each iteration and the corresponding dissimilarity level. Figure 2.1 shows the dendogram plot obtained on a dataset with 20 different mammal species. From the hierarchy tree, a cluster partition can be achieved by specifying a distance level or a desired number of clusters to cut the dendogram.

2.4.1.1.1. Comparison of agglomerative criteria

In the single linkage, the clusters with the closest members are merged. Such merging criterion is local, as the global cluster structure is ignored. As can be observed in the dendogram of Figure 2.1(a), the single linkage usually results in large clusters and may lead to erroneous splits of one cluster if two or more clusters are close to each other, specially if the clusters show different pattern densities (in this case, the pairwise distances in the low-density cluster may be larger than the distance between the two clusters). On the other hand, the complete link criterion merges two clusters according to the most distant objects. This criterion overcomes the problem of large clusters obtained by the single linkage. However, the complete link criterion is sensitive to outliers. If a cluster is merged to an outlier at a certain iteration, its inter-cluster distances will become considerably larger at the next iteration. This is due to the presence of the outlier pattern. This fact may prevent merging the cluster to other members of the same cluster in further iterations. A good compromise seems to be the average linkage criterion. It merges two clusters according to the average pairwise distances between their respective members. Finally, in the centroid linkage criterion the distance between two clusters is calculated as the distance between the centroid elements. One important characteristic of the centroid linkage method is that the cluster distances do not show increasing monotony with increasing iterations, in contrast to the single, complete, and average criteria. As an example, consider three globular, equally sized clusters, spatially arranged so that their centroids occupy the vertices of an approximately equilateral triangle, . The distance of the first two clusters, say Ca and Cb (edge ), is only marginally inferior to the length of the other two edges in the triangle, so that the clusters Ca and Cb are merged at iteration t. The centroid of the new cluster will be then placed in the middle point of the edge . The centroid of the third cluster corresponds to the third vertex. Thus, new distance between the new cluster Cab and Cc, merged at iteration t + 1, corresponds to the height of the triangle, , which is smaller than the distance between the clusters Ca and Cb merged in the previous iteration. This non-monotonic distance effect is clearly patent in the new dendogram plot. It can be observed how mergings at a certain iteration can occur at a lower dissimilarity level than a previous iteration. Such effect is also known as dendogram inversion.

Figure 2.1.Dendogram plots obtained by the single, complete,and average link agglomerative algorithms on a dataset of 20 different animals: (a) single linkage, (b) complete linkage, and (c) average linkage

Figure 2.2.Dendogram plot obtained by the centroid agglomerative algorithm on a dataset of 20 species of animals

2.4.1.2. Divisive algorithms

1. Start with all objects in a unique cluster, Cg.

3. For each object in C′, calculate its average distance to all other objects in the cluster.

4. Select the pattern with the highest average distance in C′. This pattern originates into a new cluster C″ and is thus removed from C′.

5. For each one of the remaining objects, xi ∈ C′, calculate the average distance to the rest of objects in C′ and C″. Let this average distances be denoted as davg(xi,C′) and davg(xi, C″). If davg(xi, C′)> davg(xi, C″), the pattern has more affinity to the new formed cluster. Select the pattern whose affinity to C″ is the highest in comparison to C′.

[2.1]

Remove from C′ and attach to C″.

6.Repeat Step 5 until no object in C′ can be found such that davg(xi,C′)> davg(xi,C″). At this point, the partition of the cluster C is finished. Record the resulting clusters C′ and C″ into the corresponding dendogram level.

7. Stop if each data pattern is in its own cluster (the bottom hierarchy level is reached). Otherwise, repeat from Step 2 to evaluate the subsequent cluster to be partitioned.

2.4.2. Model-based clustering

In model-based clustering, it is assumed that the data are drawn from a probabilistic model, λ, with parameters θ. Thus, the estimation of the model parameters θ is a crucial step for identifying the underlying clusters. A typical example of model-based clustering is the expectation maximization (EM) algorithm.

2.4.2.1. The expectation maximization (EM) algorithm