Thematic Cartography, Volume 2, Cartography and the Impact of the Quantitative Revolution - Colette Cauvin - E-Book

Thematic Cartography, Volume 2, Cartography and the Impact of the Quantitative Revolution E-Book

Colette Cauvin

0,0
167,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This series in three volumes considers maps as constructions resulting from a number of successive transformations and stages integrated in a logical reasoning and an order of choices. Volume 2 focuses on the impact of the quantitative revolution, partially related to the advent of the computer age, on thematic cartography.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 556

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

General Introduction

PART I. Transformations of Attributes [Z] and Use of Quantitative Methods: Generalization and Modeling

Part I. Introduction

Chapter 1. From the Description to the Generalization of an Attribute Variable

1.1. Preliminary data analysis: a crucial step

1.2. Discretization: a constraint with several choices

1.3. Two essential requirements: choosing and assessing the methods

1.4. Conclusion

Chapter 2. Generalization of Thematic Attributes

2.1. Graphical transformations of reduction and generalization

2.2. From mathematical structuring to standardized cartographic results . .

2.3. From mathematical classifications to the interpretation of the results

2.4. Conclusion

Chapter 3. Modeling Thematic Attributes: Generalizable Cartographic Choices

3.1. Thematic models based on the concept of regression..

3.2. Models incorporating space via calculations.

3.3. Models incorporating space by construction and by calculations

3.4. Conclusion

Part I. Conclusion

PART II. New Cartographic Transformations and 3D Representations

Part II. Introduction

Chapter 4. Cartographic Transformations of Position

4.1. Cartographic transformations of position: aims and characteristics

4.2. Thematic CTPs of weight

4.3. Thematic CTPs of links and directions..

4.4. Differential CTPs or CTPs of comparison

4.5. Conclusion

Chapter 5. Taking a Third Dimension into Account, Transformation of Display

5.1. From perception of relief to the diversity of “3D” products

5.2. Basic principles of representations with a third dimension

5.3. DTMs as examples of possibilities of DSMs

5.4. A new way: true 3D

5.5. Conclusion

Part II. Conclusion

General Conclusion

Bibliography

Software Used

Appendices

Appendix 1. Table of standardized normal distribution

Appendix 2. Critical values of Bravais-Pearson's correlation coefficient R

Appendix 3. Critical values of Student’s t

Appendix 4a. Critical values of Fisher-Snedecor’s F, significance level 0.05

Appendix 4b. Critical values of Fisher-Snedecor’s F, significance level 0.01

List of Authors

Index

To Waldo Tobler

who developed the concept of transformation and opened up for us so many new paths in cartography

To Jean-Claude Müller

who regularly made us realize the benefits of the latest technologies for cartography

To Henri Reymond

who helped us put these new paths to work, offering guidance and scientific support to our reasoning

The authors would like to thank the Laboratoire Image et Ville (UMR 7011, CNRS) and all the people who, in one way or another, have helped in the production of this book. We would like to mention in particular, Jean-Philippe Antoni and Hélène Haniotou who have made a tremendous contribution to the creation of the figures in all three volumes, as well as Jimena Martínez who created the website (http://www.geogra.uah.es/carto-thematique-hermes/).

First published 2010 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc. Adapted and updated from two volumes Cartographie thématique 3 et 4 published 2008 in France by Hermes Science/Lavoisier © LAVOISIER 2008

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd

John Wiley & Sons, Inc.

27-37 St George's Road

111 River Street

London SW19 4EU

Hoboken, NJ 07030

UK

USA

www.iste.co.uk

www.wiley.com

© ISTE Ltd 2010

The rights of Colette Cauvin, Francisco Escobar and Aziz Serradj to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Cataloging-in-Publication Data

Cauvin, C.Thematic cartography and transformations / Colette Cauvin, Francisco Escobar, Aziz Serradj.  p. cm. Includes bibliographical references and index. ISBN 978-1-84821-109-4 -- ISBN 978-1-84821-110-0 --ISBN 978-1-84821-111-7 - ISBN 978-1-84821-112-4 1.Cartography. 2. Visualization. I. Escobar, Francisco. II. Serradj, Aziz. III. Title. GA108.7.C38 2010 526--dc22

2009048722

British Library Cataloguing-in-Publication DataA CIP record for this book is available from the British LibraryISBN 978-1-84821-109-4 (Set of 3 volumes)ISBN 978-1-84821-111-7 (Volume 2)

General Introduction

In the first volume of this book on thematic cartography we established the essential elements of making a map and advocated the concept of transformation, introduced by W. Tobler [TOB 61]. A map is thought of as a result of a transformation process. The cartographic reasoning associated with this idea can be adapted for the production of any type of map, regardless of its aim or the study phase for which a given map is intended. In order to be able to make the necessary decisions, a cartographer needs to keep in mind all the different stages of the mapmaking process.

Nevertheless, only those transformations that are indispensable in making a map were described and explained previously. We now turn to the more inventive transformations, to which the second volume is devoted. This volume consists of two large parts, and aims to show the contributions of different manners of processing the attributes, and those representations which are difficult to create without the help of a computer.

The first part concerns the stage T2b, which was not discussed in the first volume of this book. Thus, in this part we address the processing of the attributes [Z] and the role of quantitative methods in cartography. Indeed, whilst for a long time cartographers have been superimposing and juxtaposing variables (often making the maps illegible), the inclusion of statistical tools in order to process data before representing them produced a fundamental revolution in the discipline. From now on, maps visualize the results of data processing, which summarize the available information or stress a particular feature of the studied phenomenon. Depending on the phase in the study for which the map is required, and also depending on whether the map is needed for a preliminary exploration or for a verification of a proposed assumption, the data processing can be very basic or very complex. It ranges from simple structuring of a single variable to combining k variables or creating a model, which may or may not incorporate the geographic space explicitly.

The second part puts forward the transformations connecting the coordinates [XYZ]. The principles of these transformations have been known for a long time, but their application was difficult, if not impossible, without a computer. This part examines the techniques which were long known but later renewed thanks to the computing revolution. These include cartographic transformations of position on the one hand, and 3D representations on the other. The former are more often encountered under the name of anamorphosis. These are original models, often revealing the underlying structure, which is not visible directly. The latter (3D representations) are characterized by the presence of a variable which is expressed vertically. These representations comprise several distinct categories with different meanings: 2.5D, 3D and virtual reality.

It is certain that this second volume will leave unanswered some questions about the future and the new opportunities in cartography. Therefore, the goal of the last volume will be to describe the contributions of new technologies. Although cartographers should always be open to these, it is important to judge them critically in order not to end up with aberrant maps and not to make ill-advised decisions.

PART I

Transformations of Attributes [Z]and Use of Quantitative Methods:Generalization and Modeling

Part I

Introduction

As was mentioned in the Introduction, this part will concern the transformation of attributes (T2b), a stage which is too often overlooked and often not entirely used. It is indispensable, however, for achieving a legible and useful map. Once the attributes have been collected, they are transformed into information which can be more or less elaborate depending on the assumptions and the recipient’s requirements (see Volume 1, Chapters 2 and 5). Whatever the technique, it is only rarely that this information can be turned directly into a map. The map aim is to reveal the underlying structures relating to the attributes and their locations, to understand the hidden relationships between the data and to discover whether a global or local organization exists with spatio-thematic subsets. Therefore, to achieve these aims it is necessary to proceed with a more or less pronounced abstraction, in order to reveal “the essential features” by transforming the variables either graphically or mathematically.

Various transformations are possible and they can be grouped into three large families corresponding to each of the phases of mapmaking and thus to a particular type of request by the user (Figure I.1). The first family is description. It contains all the procedures which aim at describing the data with the aid of statistics and graphical diagrams. This family unequivocally belongs to the domain of Exploratory Spatial Data Analysis (ESDA) pioneered by J. W. Tukey [TUK 77] as Exploratory Data Analysis (EDA) and later changed to ESDA by L. Anselin [ANS 88]. The techniques of this family help the map author to learn about the phenomenon and the characteristics of its variables, in order to avoid inadequate representations and processing later on. These techniques constitute an indispensable step in the mapmaking process, and in recent years they have been completely modernized by the introduction of ESDA.

Figure I.1. Attribute transformations: the choice criteria

The second family is thematic generalization. It is used to simplify the data and to reveal the spatio-thematic subsets using techniques which allow the mapping of groups of individuals instead of each and every one. This reduces the number of variables compared to the initial one. In fact, some very specific cases aside, representing all the values and all modalities present in the variables always seems difficult, even though it is now technically possible. It creates a risk of producing a completely illegible and incomprehensible document. Even recently, this point was the subject of many discussions [CRO 95], but some authors, such as W. Tobler [TOB 73c], proposed more exact, unclassed maps, which are legible despite containing all the information [PET 79b]. It is certain that modern-day techniques facilitate the representation of all the values. Interactivity makes it possible to introduce the values into a map progressively, in an increasing or decreasing order. At the exploratory level, such a technique turns out to be very interesting, as testified by the works of O. Klein [KLE 07] on flow representations. Nevertheless, when it comes to presenting the results to a less expert public, classes are indispensable and this point will be considered in detail in Chapter 1.

Regardless of the decision concerning classes, the generalization stage belongs to a time when the assumptions are implicit and we are looking to isolate the spatial characteristics from the thematic phenomenon, by reducing the information. Hence, the processing amounts to a thematic generalization, and its principal goal is to reduce the number of objects (or groups of objects) to represent, and then create, classes. Processing can be graphical or mathematical. This choice depends primarily on the number of variables and on their measurement level. In each case we are trying to create a graphical sign or a “composite” variable obtained after processing which “sums up” the initial information. In light of the fact that all the processing methods lead to the construction of a single “synthetic” attribute, the techniques for creating classes – called discretization – will be considered in Chapter 1. They enable the cartographer to complete the description of the variables within ESDA. The second chapter is devoted to working with multiple data and places greater emphasis on quantitative methods rather than on the graphical solutions abundantly described in a number of works.

The third family is modeling. It can be employed only if the map to be made is at an advanced stage in the study of a phenomenon, when the assumptions are explicit and can be subjected to verification. The results of various modeling techniques or those concerning their steps are the subject of cartographic representation. Some models take only the attributes into account. The resulting values are subsequently transferred onto the cartographic support. Others involve both the attributes and the spatial components, producing a map directly. The aims of the map, its recipients and the phase of study also play a fundamental role in the choice of a model. This will be explored in Chapter 3.

Figure I.2. Attribute transformations: the choice criteria

There are several criteria for choosing a particular attribute transformation technique from the wide range of possibilities (Figure I.2). The first concerns the map author and the stage in the study for which the map is required. It implies that the processing is performed within an exploratory, inductive or hypothetic-deductive approach. The second has to do with the requirements for the map, hence with the recipients. Depending on their knowledge and their goals, the recipients may need a map for research, exploration, reflection or presentation. The latter may call for further simplification and then successive transformations of the attributes. Finally, the third criterion relates to the variables characterizing the represented phenomenon. Their number, level of measurement and formalization will play different roles depending on the phase of the study. It is quite obvious that the general framework of map production is also important. When making choices, we should never forget the place which the map occupies in the scheme developed in Chapter 3 of Volume 1.

Attribute transformations are obviously based on statistics and spatial modeling, using the associated software which has developed alongside the progress of the computer-science revolution. It is in relation to this revolution that the integration of quantitative methods has taken place in cartography, as well as in other sciences (geology, botany, geography, etc.). It has brought about a revitalization of the attribute processing phase within the cartographic process.

Chapter 1

From the Description to the Generalization of an Attribute Variable Z

A thematic variable [Z] can be represented on a spatial support [XY] in various ways and with various visual results, depending on the processing to which the variable is exposed, especially if it is a quantitative variable. The cartographer may decide to generalize the map excessively and, for example, only distinguish the values of Z above and below the mean. Conversely, the cartographer may remain closer to the initial data and preserve a very large part of their values. The resulting maps will necessarily be different and the commentaries will be very dissimilar to each other. How best to make a decision? Why prefer one solution to another?

It is questions like these that the present chapter attempts to answer and to suggest an approach and criteria on which cartographers may rely in making their choice. As we wrote in Chapter 5 (Volume 1), thematic data can have several forms and can be obtained from various sources or generated in diverse ways [ROB 98]. The basic information, for example, can be encrypted if it comes from official sources (for instance, INSEE1), or from questionnaire surveys, or else if it is a field survey. Therefore, in order to make justifiable and consistent choices it is crucial to:

– Know the characteristics of the variables which are about to be represented. At this stage, exploratory analysis retraces the statistical approaches and constitutes a significant help.

– Choose how to process the variable [Z], that is, choose the discretization mode.

– Validate the choices with the aid of tests, plots or indices.

These three facets of transforming the variable [Z] are meant to generalize the data and in this way obtain a representation which highlights the important features of the studied phenomenon according to the issue considered.

1.1. Preliminary data analysis: a crucial step

An analysis preceding the representation is essential since it helps the cartographer in making decisions at different levels. It creates the variables whose characteristics can be found with the help of statistics, for as A. von Humboldt wrote in 1811 [FUN 37], “Statistical projections which speak directly to the senses without fatiguing the mind, possess the advantage of fixing the attention on a great number of important facts.” Thus, plots present a certain interest, renewed and strengthened with the development of exploratory data analysis (EDA).

1.1.1. From classical description to exploratory data analysis (EDA)

The statistical description of data supplies the cartographer with a certain number of indices whose meaning it is important to know. It enables comparisons to be made between the variables, and their range and dispersion to be known. But it does not supply a global and immediate vision of their distribution. Exploratory data analysis (EDA) is one of the possible ways of deciphering what the data say using mathematics and, most importantly, simple graphical diagrams.

1.1.1.1. History and relevance

The idea of data exploration as a useful and irreplaceable statistical technique is attributed to J. W. Tukey [TUK 77]. Data exploration was subsequently elaborated and improved with the developments in mathematical visualization and dynamical graphics to the point of considerably changing the analysis of quantitative [MAC 92] and even qualitative data. The analysis of localized data is a conceptualization in which the central element of the broad investigation process is the connection between the data and the technique, involving the supporting theory and interpretation of the results [ROB 98]. In all cases, as we will see below, the links connecting the theory, data construction, analysis and interpretation are neither simple nor static.

Researchers’ and cartographers’ analysis of gathered data has always been associated with the application of analytical techniques such as plots, diagrams, and the like. The use of EDA revitalized this approach and presented a double benefit. On the one hand, it facilitates learning of the data analysis and statistics; on the other hand it raises important questions about the formal representation of data and thus leads to the emergence of new working hypotheses.

1.1.1.2. Approach and definition

Research work starts with a theory and ends with the presentation of results. Although at a glance its steps are clearly distinct or sequential, in practice it is rarely so. In its broadest sense, the analysis may include:

– initial evaluation of the data;

– creation or construction of the data;

– application of specific analytical techniques to probe the data.

Figure 1.1. Linear statistical reasoning and EDA iterative process

While the classic statistical reasoning is based on linear thinking (meaning that the tasks are performed in a sequential manner), EDA adopts an iterative process at several analytical levels. This is shown in Figures 1.1a and 1.1b, produced based on the proposals of M. Theus [THE 05]. This author stresses that EDA does not begin with a predetermined set of tools and that graphical exploration may be of interest for some data but not for all.

Naturally, geographers already used simple graphical and statistical methods before the quantitative revolution. But it was only recently that they began to use these methods widely. The methods are grouped under the name exploratory data analysis [BAR 79], or EDA. Instead of following the usual assumptions this approach to data analysis is restricted to finding a model which explains the data behavior. It is a more direct approach which helps to reveal the underlying structure of the data and their model. EDA is not a simple collection of techniques but a philosophy on the subject of dissecting a data series by asking the following questions: what are we looking for? What do we see? How can we interpret it? EDA achieves its goal using a broad range of techniques called “statistical plots” which can be interpreted to open new research directions and not just to illustrate the numerical statistical results as in the traditional approach.

In the original version of EDA the geographic component was not directly integrated into the exploratory analysis. With the advent of geographic information systems, EDA evolved towards exploratory spatial data analysis (ESDA). As M. Theus mentions [THE 05], ESDA is a natural extension of EDA in the way it treats thinking and methods concerning geographic problems. ESDA experienced rapid development after EDA was implemented on computers at the end of the 1980s. Among others the works of M. S. Monmonier [MON 88, MON 89] show the usefulness of combining scatterplots2 with interactivity. In fact, selecting a set of points by scatterplot brushing has the advantage of automatically updating different plots and thus highlighting the correlations that exist between pairs of variables. Thus, in one example given by M. S. Monmonier [MON 89], it becomes clear that some common a priori regarding the cause and effect relationships between thematic variables have to be revised. It seemed natural and logical to assume that in each of the fifty United States of America the proportion of households with a cable TV connection depends on the household income and on the proportion of urban population within the state. The use of scatterplot brushing shows that this is not so. It turns out that the lowest cable-connection rate is in the most urbanized states with a high per capita income, such as Maryland and New York. The author explains that the delay in the penetration of cable television is certainly related to local restrictions and the huge capital requirements. In this category we also find the Midwest states where dispersed farms are not easily served by cable systems. The map obtained from this analysis suggests the cartographer or researcher needs to look into some additional factors. This example clearly shows that interactive statistical plots bring questions to the surface rather than giving definite answers.

1.1.1.3. EDA: predominance of graphical techniques

The majority of EDA techniques are graphical, accompanied by some quantitative (statistical) methods. The reason for the importance of plots is related to the fact that the main task of EDA and ESDA is the impartial exploration of plots which give the analyst an incomparable power of discovery. Not only do the data reveal their structural secrets, but also the inherent data elements – often previously unsuspected – become visible. Combined with our natural capacity for form recognition, plots offer us an unparalleled power for extracting information.

Graphical techniques used in EDA and ESDA are often rather elementary. Variables can be represented by simple, cumulative or two-dimensional histograms. The usual statistical parameters are added to this: means and medians, standard deviations, coefficients of variation, etc. Only the “boxplot” (also known as a “box-and-whisker plot”) is really specific to EDA. When these various representations are united in the same document they allow the optimal use of our shape recognition to extract information concerning the data characteristics.

1.1.2. Exploratory data analysis and graphical representations3

The purpose of EDA in analyzing the data to be represented is to synthesize, structure and summarize the information contained in the data. In this way it highlights the properties of the thematic attributes and suggests hypotheses.

A long time ago, Horace (65 – 8 BC) said that what we hear excites us less than what we see. Humankind knows just how true this is. The graphical representation of ideas is one of the most ancient and universal features of human activity. The oldest known language consists of ideographic drawings left to us by cave dwellers. The writings of the Egyptian, Babylonian and Maya civilizations are to a large extent pictorial symbols and hieroglyphs. The American Indians used pictorial methods to communicate their thoughts and ideas. It is true that the pictographic techniques of cavemen and the hieroglyphic writing of the Egyptians disappeared because of their inferiority, or at least ceased to be the sole means of communicating ideas. Nevertheless, drawings persisted and graphical representations can be found in any age, in one form or another.

In the last 30 years, the use of visualization materials to present ideas has increased greatly. Nowhere is this trend better illustrated than in statistics, where experts have developed a very widespread use of plots. They are so important that we could say that the graphical method is quickly becoming the universal language [FUN 37].

Despite the robustness and variety of statistical tests and computational techniques, the power of graphical representation as an analysis tool for supporting a hypothesis or explaining a phenomenon cannot be denied. A deep analysis based on a graphical representation is often considered only as preliminary. But it enables us to reveal important information and trends with a higher accuracy than a simple numerical-data table.

As we mentioned earlier, once the numerical data are collected the first step is to look for their characteristics. For a cartographer this preliminary investigation may mean not only examining the spatial distribution of the data but also representing the data graphically. Graphical processing became a “device for showing the obvious to the ignorant” until the 1970s when J. W. Tukey revived cartographers’ interest in graphical representations by developing new forms of data presentation and by computer technology which facilitated this kind of representation [ROB 98].

Thus, it becomes clear that visualization is a permanent component of EDA. It is of help in the process of detecting the properties of data, which remains the fundamental goal of EDA. Thanks to certain software, a large number of data graphics can be produced. They are used as simple finished products intended in theory to communicate the identified characteristics of processed data to novices. They are used quite frequently, and some of these graphical representations will be described in the sections below in order to illustrate their importance in data visualization and to show those data characteristics which it is possible to extract.

We will see how useful different graphical representations are for data which are originally in the form of statistical tables. Obviously, reading a table of numbers does not give a clear and quick visualization. Nor does it let us understand the relative positions of the presented values or see the data set as a whole. Cartographers are therefore led to use particular representation types where they can translate the “statistical magnitudes” into “geometrical figures” which are much more evocative to the reader [TAV 83].

1.1.2.1.Non-mathematical graphical representations: linear diagrams

When the number of observations is not very large they are easy to represent by a point or a mark on a normed linear axis. This representation gives a practical means of examining the dispersion of two or more sets of data on either side of the same axis, and also to compare them. Figure 1.2 shows this using the data from Luxembourg4.

1.1.2.2. Common graphical representations

Conventional graphical representations help to visualize the form of a statistical distribution of a quantitative variable. They are applied to the quantitative data prepared in the form of a table with modalities grouped into classes.

Figure 1.2. A line plot

1.1.2.2.1. Bar plots

The study of the spread of values in a statistical series often starts with a plot of the distribution of the data, as shown in Figure 1.3. The plot normally takes on one of the following forms: a bar plot, a histogram or a plot (smooth or not) of the frequencies or the cumulative frequency. If the data consists entirely of integer – and therefore discrete – values, the bar plot illustrates the frequency of a given value. Thus, the frequency bar-plot is a graphical representation of the distribution of the values of a variable. The horizontal axis corresponds to different discrete values (or modalities of the variable) and the vertical axis shows the total number of occurrences (or the frequency) of each modality. For each value Ziof the variable there is a vertical segment (a bar) whose length is proportional to the number of occurrences ni, or to the frequency fiof the value Zi5.

Figure 1.3. A bar plot

1.1.2.2.2. Histograms

Histograms are employed to represent continuous data such as altitude, age, agricultural productivity, precipitation, etc. These data can take any value within a given interval on a continuous scale. In order to construct a histogram we need to break the data values down into classes. The difference between the upper and lower limits of a class is called the class interval, or class size. It can be varied to show the specifics of the distribution. The number of classes should be chosen with the desired level of detail kept in mind. The higher the number of classes, the more details of the distribution will be visible. Conversely, if the number of classes is small the distribution will suffer a strong generalization and its fine details will not be observable.

A histogram is a set of rectangles arranged side-by-side. The base of each rectangle corresponds to the class interval. Typically classes in the same distribution have equal intervals. The case of different class sizes will be considered further on. The rectangle constructed for each class has an area proportional to the total number of values in this class. The total area of the histogram is therefore equal to the sum of all the rectangles. This area corresponds to the total number of values in the studied data set. For example, Figure 1.4a shows the distribution of unemployment rates by municipality in Luxembourg. By dividing it into ten classes we obtain the histogram shown in Figure 1.4b.

Figure 1.4. Dividing the data into classes of equal size and the corresponding histogram

Proportionality is the essential feature of histograms. It makes their construction a little more complex if the class intervals vary. In fact, it often happens that the distribution of a studied phenomenon makes no sense unless it is divided into “useful” classes, in other words classes which are meaningful to the user. There are many examples of this: road slope, size of agricultural fields, household income, city size or rent in Luxembourg. We will use the last example for an illustration. Given that the areas of the rectangles are proportional to the frequencies, the histogram in Figure 1.4 respects this proportionality. In fact, if we consult the table in Figure 1.5a we will see that the frequencies in classes 1 and 4 are almost identical and that the corresponding areas in the “correct” histogram in Figure 1.5c are also very similar, while those in the “incorrect” histogram in Figure 1.5b are not. Nevertheless, it is worth mentioning that the frequency axis in the “correct” histogram requires special attention now that the correction has been performed. This axis should be removed and replaced by the actual class frequencies written underneath each rectangle, as shown in Figure 1.5d.

1.1.2.2.3. Cumulative frequency curves

A cumulative frequency curve represents the distribution function F(z) of a continuous variable [Z]. This function is defined for each value of [Z]. It is constant within each interval separating two consecutive possible values. Thus, it allows us to determine the frequency density in a given interval [CAL 73]. The convenience of cumulative frequency curves is in knowing “how many individuals have a character value below or above a certain threshold”. If we plot the relative cumulative frequencies, they will range between 0 and 1. Hence, they have the meaning of probabilities.

Figure 1.5. Construction of a histogram

Figure 1.6. A cumulative frequency curve

In fact, if a relative cumulative frequency equals 0.5 we can say that there is an equal chance of finding an individual with a value less than or equal to the value which corresponds to 0.5. As shown in Figure 1.6, the value for which the cumulative frequency equals 50% is 2.8. It corresponds to the center of the sixth class. Therefore, there exists a 50% probability of finding a municipality in Luxembourg where the unemployment rate is less than or equal to 2.8%.

1.1.2.3. “Boxplot”: a specific representation of EDA

The idea of representing the central tendency and the dispersion of a statistical data set by “box and whiskers” was stated by J.W. Tukey in 1977. The statistical parameters for graphical representation are easy to obtain from the data set once the values are ranked (in increasing or decreasing order). We can extract the extreme values (minimum and maximum), the median6 and the first and third quartiles. These parameters can then be represented in a simplified “box and whiskers” form, as shown in Figure 1.7b1. The median together with the first and third quartiles form the “box”, while the extreme values (minimum and maximum) form the “whiskers”. A more elaborate form of this representation requires the calculation of the hinge values, which become the “whiskers” (Figure 1.7b2). The calculation of the hinge values is done by subtracting one and a half times’ the value of the interquartile range (Q3-Q1) from the first quartile (for the lower hinge value) or by adding the same quantity to the third quartile (for the upper hinge value). The observations located outside the hinge values are considered as exceptional values. We can note that for large data sets, deciles or even centiles are used instead of quartiles, depending on the number of observations.

Representations of this type allow us to “see at a glance” the features of the distribution (symmetry, asymmetry or dispersion, among others) thanks to the positioning of the characteristic statistical values. The summary of five numbers in Figure 1.7a shows this. What is more, these representations help to compare different distributions in time and/or space (Figure 1.7c).

G. M. Robinson notes [ROB 98] that this form of representation allows a better comparison of the characteristics of various statistical data sets. D. Sibley [SIB 90], cited in [ROB 98], improves the comparison even further by modifying the technique of the “boxplot” representation. The modification consists of centering the data with respect to their median value. This operation shifts the values and their distribution around the median, so that the median value is equal to zero. The operation of centering the values also eliminates size effects. If the centered values are divided by their interquartile range we obtain standardized values which are very convenient for comparing several data sets with each other (Figure 1.7d).

The various graphical representations described above explore the data characteristics, such as the shape of their distribution, which is essential for the purposes of cartography. However, we should not forget that these plots are suited to variables with a quantitative level of measurement – a particular point which leads to serious constraints in their representation.

Figure 1.7. Construction of the “boxplot”. Examples with the Luxembourg population data

1.1.3. Quantitative level of measurement and graphical representation

In the case of a quantitative variable with known characteristics the following question becomes fundamental: Since the variable in question is quantitative with a large number of values, should we keep all its values or is it more sensible to group them into classes, in other words, perform a discretization? There is still no definitive answer to this question. The opinions of different cartographers who have looked into this problem are very controversial [CAU 87a]. Two main criteria are used: one depending on the user and the other having to do with the variable’s continuity property:

– at user level, the dominant opinion suggests the necessity of discretization in order to facilitate reading of the map. Indeed, too many thematic values and hence too many marks divert the reader's attention. This point will be pursued further in section 1.2.2;

– the continuity property of a variable is of crucial importance in preliminary processing, as we emphasized in Chapter 5 (Volume 1). It requires knowledge of whether all the possible values have a meaning whatever the successive divisions used. Consider the example of a continuous variable: a study of the average height of people in a certain geographical area. The physical interpretation of the height 1.72 m is not different from 1.723 m or 1.7231 m. On the contrary, if we deal with a discrete variable corresponding to the count of individuals or objects, fractional numbers are meaningless: 101.45 doctors or 1.5 TV sets have no interpretation [CAU 87a]. In the last two cases discretization is still possible but the class limits need to be chosen differently, according to their meaning.

Figure 1.8. Types of distribution of one variable

Once the decision is made to proceed with discretization, the question is how best to choose the classes? Indeed, as G. F. Jenks and M. R. C. Coulson mention [JEN 63b], if the intervals are well chosen the readers will have a clearer understanding of the spatial relationships, something they would not have from the initial data. In the opposite case, however, the result will be an inexact or distorted distribution of the data.

To get a grip on the problem of forming classes it is convenient to start by studying the variable’s distribution while plotting it: is it symmetric, asymmetric, uniform or multi-modal (Figure 1.8)?

Based on this knowledge of the mapping variable it becomes possible to make discretization-related decisions. Discretization is almost always applied in cartography. It offers a very large variety of techniques which need to be known in detail.

1.2. Discretization: a constraint with several choices

Discretization7 is an operation which consists of sorting data into classes and thus converting a given thematic variable into mapping classes. The number of classes and the variability of their intervals has a strong influence on the resulting map and can adversely affect its precision, utility, legibility and attractiveness [MAC 55]. Hence, the division of data into classes is an important and decisive step in making a map since the data will not be reported on the map in their original form and the reader will only see the values of the class limits chosen by the cartographer. The choice of the class limits and thus the degree of generalization has a strong influence on the message transmitted to the reader. Thus, a good discretization requires that a number of preconditions be satisfied:

– data pre-processing;

– choice of the optimal number of classes;

– selection of the dividing method according to the distribution type of the variable.

Only then can we – using simple or complex tools – evaluate different dividing methods and decide which one is “theoretically” the most suitable.

1.2.1. From data to the basic rules

On the one hand, performing a correct discretization requires certain preliminary actions to be performed which are intended as data preparation for a better understanding of the phenomenon Z. On the other hand, discretization requires knowledge of the rules in order to avoid aberrations during the construction of the classes.

1.2.1.1. Data preparation

This step essentially consists of converting the data into a standard form and making a number of graphical representations such as were described in the part on exploratory data analysis. This is done with the following operations:

– order the values (passing from a statistical data set to a distribution);

– determine the number of observations;

– determine the extreme values (minimum and maximum);

– calculate the range of the variable;

– calculate the conventional statistical indices:

– the central tendency parameters: mean, median and mode,

– the dispersion parameters: variance and standard deviation,

– additional parameters: coefficient of variation, indices of asymmetry (skewness) and “peakiness” (kurtosis);

– complete the preparation by constructing a series of plots, with or without a mathematical meaning: the distribution diagram and the histogram of simple or cumulative frequencies. The last two allow the distribution function of the data to be determined [JEN 63b];

– draw the clinographic curve which involves the areas covered by the thematic variable in question, should such areas exist [MAC 55].

1.2.1.2. Fundamental elements and basic rules

When creating class intervals the cartographer should observe four fundamental rules:

– the classes must encompass the full range of the data set [JEN 63b];

– a value must be a member of one and only one class (exclusiveness);

– there must not be any vacant classes (exhaustiveness) [MON 75];

– the precision of the values of the class limits should follow on from the precision of the original variable [GRI 77, MON 82a].

1.2.2. Choice of the number of classes

The number of classes determines the amount of details which can be read off a map and hence the degree of generalization performed [ROB 78b]. Its choice depends on three elements which are more suggestive than decisive constraints. They have to do with the logic of the reasoning, the technical possibilities and the capacities of visual perception [CAU 86].

1.2.2.1. Logical constraint

It is universally accepted that the higher the number of classes, the “better” the generalization, in the sense that it becomes closer to the original data. Thus, if the number of classes is small, we will end up with a map which is “poor in information” because a coarse partition creates a loss of detail. This is unacceptable if we hold the view that the information which a map communicates may be generalized but not distorted. Conversely, too fine a partition produces a reconstruction of the initial distribution, with the risk of having few elements in some classes or even vacant classes [BRO 53, CAU 87a, GRI 77, HUN 61].

C. E. P. Brooks and N. Carruthers [BRO 53] suggest that the number of classes should be less than or equal to five times the decimal logarithm of the number of observations N:

where k is the number of classes, and N is the number of observations.

The authors of this method consider it as a reasonable ratio between the precision of the map and the mass of the available data [HAG 73b].

D. V. Huntsberger [HUN 61] also proposes a formula for calculating the optimal number of classes. This formula is fairly similar to the preceding one and gives very close results. The number of classes should equal 3.3 times' the decimal logarithm of the number of observations N plus 1:

where k is the number of classes, and N is the number of observations.

D. V. Huntsberger’s formula always gives a smaller answer than that obtained from the formula of C. E. P. Brooks and N. Carruthers. Even so, this number appears to be too high for a legible map and should be viewed as a maximum number, a recommended but non-definitive upper limit [CAU 87a].

Applying these two methods8 to the population densities of 118 municipalities (N=118) in Luxembourg in 2003, we obtain the following results:

– for C. E. P. Brooks and N. Carruthers:

k ≤ 5 [log10 (118)], from which k ≤ 10.35;

– for D. V. Huntsberger:

Since the number of classes must be integer, k needs to be rounded-off to the nearest whole value. In our case, the number of classes according to the first method is less than or equal to ten, and according to the second method it is equal to eight.

1.2.2.2. Technical constraint

Prior to the creation of laser printers at the end of the 1970s9 it was technically impossible to create a progression with more than a certain number of distinct shades of gray. Above the limit the differences between the shades could no longer be discerned, regardless of the case of a progression of dots, lines or shaded areas. Presently, however, thanks to computers and high-definition laser printers it is possible to establish proportionality between levels of gray and the values of a variable. The technical constraint is therefore disappearing.

W. Tobler [TOB 73c] suggests a correspondence between the density of lines (the ratio of the amounts of black and white) and the thematic values by means of a non-linear mathematical relationship which takes into account the human eye's response to what it views (see Chapter 7, Volume 1). Tests on this have been performed by A. Kimmerling [KIM 75], J. C. Muller and J. Honsacker [MUL 78c], J. C. Muller [MUL 79] and M. Peterson [PET 79b]. They produced very interesting results.

It is obvious that if we work manually and want the differences between the degrees of gray to remain perceptible, this limits us to six or seven degrees as a maximum. A computer and a good laser printer make it possible to obtain, in theory,

1.2.2.3. Visual constraint

Thus, it would seem that the technical constraint no longer presents a real problem since the introduction of computers and high-definition laser printers. Nevertheless, we can ask ourselves: “What good is having the technology to create a multitude of different shades of color if the reader cannot distinguish them?” According to the works of some psychologists whose tests focusing on this particular question were not especially abundant, and some cartographers such as J. Olson [OLS 75a] and A. Gilchrist [GUI 77], we come to the conclusion that it is difficult to distinguish more than six or seven shades, and that the optimal number is five, as indicated by A. H. Robinson et al. [ROB 78b]. There are nuances to this statement, however. Indeed, the eye can only clearly distinguish between five levels of color (so that the reader can easily distinguish between five levels), but the visual associations can vary. Thus, if a relatively light area is surrounded by darker areas, the lighter area will be very easily attributed to the darker class. On the contrary, if this area neighbors on even lighter areas, it will be assimilated with the lighter classes. Therefore, the same graphical value can be perceived differently depending on the existing and observable proximities.

To summarize, it seems reasonable to state that the choice of the number of classes closely depends on the aim of the map. If the aim is simply to show the progression of a phenomenon, such as altitude, the number of classes may be relatively large, the aim being not to discern each class but to present a global, continuous vision of the phenomenon.

On the contrary, if the aim of the map is to demonstrate the specific classes which have a meaning to the author who wishes to communicate a message, then it is crucial that the value classes and the graphical signs be in unambiguous correspondence. In this case the reader should have the opportunity to differentiate between all levels. Six or seven levels would then seem to be the maximum which must not be exceeded10. Finally, if the map can be a working map for a detailed analysis of the considered phenomenon, intended for use by the author themselves, then the number of classes can go up to ten or even higher [CAU 87a].

In conclusion, we can say that no strict and definite rules exist and we only have recommendations. No-one can really determine in an absolute manner what number of classes to use, all the more so as the choice of the class limits has an influence.

1.2.3. Class limits and ranges

The abundance of discretization methods can be sorted in a number of different ways, such as those suggested by J. P. Grimmeau [GRI 77] and I. S. Evans [EVA 77]. Based on their work we can identify and present six major families of methods:

– intuitive discretizations;

– exogenous discretizations;

– mathematical discretizations;

– statistical and probabilistic discretizations;

– graphical discretizations;

– experimental discretizations.

Some of these methods will be applied to the population densities of 118 municipalities in Luxembourg in 2003.

1.2.3.1. Intuitive discretizations

These divisions into classes should be distinguished from partitions based on observable and noted discontinuities. Discretizations of this family are based on intuition, with the author making a priori decisions about the division generally relying on his or her experience. Intuitive divisions are based on familiarity with the studied phenomenon, but they generally vary from one author to another, which makes comparisons impossible. These methods should be completely avoided except in some very particular cases.

1.2.3.2. Exogenous discretizations

For these discretization methods the class limits are defined with the external reference to the data and not just based on the thematic variables which are to be mapped. For instance, we can choose the limits based on such an external reference as the mean value for the studied country. These data divisions are used when it is desirable to locate a given region or a zone with respect to a larger area. They are practical for making comparisons but they do not necessarily reveal the specific distribution of the variable in question. One of the risks of this method is that it shows no difference among the spatial units in the representation of a region.

1.2.3.3. Mathematical discretizations

Discretization methods based on mathematical principles can be applied with or without transforming the variable which is being mapped. First of all we will focus on methods without transformations.

There are three basic methods which are differentiated by the class ranges, varying according to mathematical principles. In the first method the intervals are equal. In the other two the intervals increase from the minimum to the maximum of the data set, so that smaller values are more favored when we switch from a discretization method with equal intervals to a method with an arithmetic or a geometric progression of the intervals.

For practical reasons and in order to avoid redefining the parameters which are present in almost all discretization methods, we will describe them here before reviewing the various types of dividing data into classes:

1.2.3.3.1. Discretization with equal class intervals

In this method all the classes have the same range. It is one of the most common methods, almost “the method” for constructing graphical representations such as diagrams and histograms of simple and cumulative frequencies.

Figure 1.9 gives the procedure for dividing the data and giving the results of its application on the population densities of the municipalities in Luxembourg in 2003. Figure 1.12a presents the graphical and cartographic results of this method.

Figure 1.9. Grouping into classes of equal intervals: procedure and application

Advantages and disadvantages

This method is simple and easy to carry out. It is suitable if each class is well populated and the distribution is not asymmetric. If the distribution is uniform the number of data values in all classes will be the same. But if the distribution has pronounced discontinuities then some classes may end up being empty. It may even happen that all the values save one are found in the first class and a single remaining value is in the last (or vice versa if the asymmetry is skewed to the right). The case of extreme concentration of values is obviously exaggerated, but it shows the risk of following the customary procedure without examining the behavior of the distribution first. A cartographic representation with such a division will not describe the phenomenon and will only present a caricature of it. Moreover, with this method comparisons are not possible because the Z variable range is specific to each data set.

1.2.3.3.2. Discretization with an arithmetic progression

In this case the class intervals are different: they increase from the minimum to the maximum value of the data set in an arithmetic progression. The sum of the intervals of all the classes equals the range of the variable (E) and the sum of the class numbers from 1 to k gives us a way of knowing how many partitions have to be distributed among the classes. Therefore, we can deduce the ratio of the progression (r).

Figure 1.10 shows the procedure for dividing the data and giving the results of its application on the population densities of the municipalities in Luxembourg in 2003. Figure 1.12b presents the graphical and cartographic results of this method.

Figure 1.10. Dividing the data with an arithmetic progression: procedure and application

Advantages and disadvantages

This method helps the smaller values to stand out somewhat. It provides more detailed classes to the lower part of the distribution at the expense of the larger values. If, however, the distribution is very irregular, then empty or nearly empty classes may appear and bring about a loss of information.

1.2.3.3.3. Discretization with a geometric progression

This type of discretizations is equivalent to a mathematical transformation of the variable into its logarithm. Therefore, it only accentuates the existing trend in the values since the class interval now increases according to a geometric progression, in other words, according to a multiplicative and not an additive principle. To calculate the base of the progression we start by noticing that the maximum of the data set is equal to its minimum multiplied by the base of the progression to the power k – the number of classes.

Figure 1.11 shows the procedure for dividing the data and giving the results of its application on the population densities of the municipalities in Luxembourg in 2003. Figure 1.12c presents the graphical and cartographic results of this method.

Advantages and disadvantages

This method highlights the class sharpness of the smaller values. If the distribution is very asymmetric and skewed to the left then such a discretization acts as if to straighten the distribution out and, in a sense, renormalize it. Actually this procedure is simply a particular case of the discretization methods with a logarithmic transformation of values. It is therefore obvious that the calculations are only possible if the minimum value is not zero. As I. S. Evans points out [EVA 77], this method is of great interest with S-shaped distributions.

Figure 1.11. Dividing the data with a geometric progression: procedure and application

These three methods of mathematical division can be adapted to various distributions. If the distribution is normal then classes with equal intervals show it features rather well. If the distribution is asymmetric and skewed to the left then, depending on the degree of asymmetry, the other two methods (arithmetic and geometric progression) straighten out the values and effectively act as a division with equal intervals but applied to a normalized variable.

Finally, the divisions with increasing intervals allow us to obtain more details and more information about the small values.

Figure 1.12. Graphical and cartographic results of divisions with constant intervals (1.12a), an arithmetic progression (1.12b) and a geometric progression (1.12c)

1.2.3.4. Statistical and probabilistic discretizations

These discretization methods make use of the conventional statistical parameters or the probabilities, or both at the same time. Some of the methods can only work with normally distributed variables or variables which have been transformed to have a nearly normal distribution.

1.2.3.4.1. Discretization by quantiles

This method does not require the distribution to be normal. It is fairly well suited to a uniform distribution. It can work with equal or almost-equal numbers of values in each class. The class intervals are calculated by counting the number of individuals n in each class of the ordered distribution. It is also possible to apply statistical formulas to find the median value, the quartiles, the quintiles, the deciles or the centiles to divide the data into two, four, five, ten or a hundred sets of equal frequency, respectively. This means that we are no longer interested in the values themselves, but only in their ranking in the ordered data set, in an increasing or a decreasing order.

As mentioned above, this method accepts well-defined numbers of classes which have a statistical meaning related to dispersion. As an example we will apply this method to Luxembourg using five classes (which correspond to the quintiles) with 20% of the total number of values (N) in each class.

Figure 1.13 shows the procedure for dividing the data and giving the results of its application on the population densities of the municipalities in Luxembourg in 2003. Figure 1.18a presents the graphical and cartographic results of this method.

Figure 1.13. Division of data into quintiles: procedure and application

Advantages and disadvantages

Calculating the “limits” may give rise to problems of a diverse nature. If the statistical data set contains a lot of tied values it may not be possible to have the same number of them in each class. Sometimes the variation may be significant. If the distribution contains discontinuities, on the one hand it makes it difficult to choose the limit values, and on the other hand, it can lead to the grouping together of spatial units with very dissimilar values. As J. P. Grimmeau points out [GRI 77], this method ignores the specific features of the distribution. It attributes the same importance to every class regardless of the distribution type: normal, multi-modal, uniform, etc.

Conversely, this method has certain advantages. It allows us to get rid of the weight of the extreme values. In fact, it is completely independent of the values and only concerns itself with their ordering. This takes us from a ratio to an ordinal scale, in other words we consider the observations based on their rank and not on their values.

Moreover, from a cartographic point of view, if the areas of the spatial units are not too different then the map is “balanced” in the sense that there appears to be an equal amount of each shade of gray. The notion of equal amount of surface area for each class is related to the concept of entropy. The entropy is maximum in a partition where there are equal frequencies in each class. Consequently, it offers the readers a maximum amount of information for each class.

1.2.3.4.2. “Mean-standard deviation” discretizations (or standardized discretizations)

In principle, these discretization techniques are meant for a normal distribution and the classes resulting from it all have the same range corresponding to the standard deviation of the data set. If the variable being dealt with does not obey the normal distribution law, it is useful to transform it before applying this type of discretization in order to avoid any risk of error.

Two options are available for calculating the class limits. The first one entails an odd number of classes with the central class straddling the mean value. The second gives an even number of classes lying symmetrically around the mean value. The division straddling the mean is shown in Figure 1.14a. It allows us to have a central class which comprises all the values lying within 0.5 of the standard deviation on both sides of the mean. This is very useful if we want to highlight the values close to the mean. The division centered on the mean is shown in Figure 1.14b. Here the classes are situated on both sides of the mean, and the mean value itself becomes a class limit. It separates the classes below the mean from those above the mean. The choice between these two methods depends on the aim of the map. In other words, the cartographer should ask the following question: “What kind of information do I want to communicate? Do I need a division with a central class which clearly shows the data values close to the mean, or a division which indicates which values are below and which are above the mean?”

Figure 1.14. Types of data division using the mean and the standard deviation straddling the mean value (1.14a) or centered symmetrically around the mean value (1.14b)

Figure 1.15. Division using the mean and the standard deviation (with a central class straddling the mean): procedure and application

Figure 1.15 shows the procedure for dividing the data and giving the results of its application on the population densities of the municipalities in Luxembourg in 2003. Figure 1.18b presents the graphical and cartographic results of this method.

Advantages and disadvantages

1.2.3.4.3. Discretization using nested-means

This method is attributed to W. Scripter [SCR 70]. It is derived from the concept of the arithmetic mean. The idea is that the arithmetic mean can be considered as a “value” separating two sets and that it represents a balance point of the distribution, much like the center of gravity in a triangle. This notion is subsequently applied to the complete data set and then to its subsets. It creates partitions within each group. The number of classes is predetermined: it is necessarily given by powers of 2 (21, 22, 23, 24