91,99 €
A new, full-color, completely updated edition of the key practical guide to chemometrics
This new edition of this practical guide on chemometrics, emphasizes the principles and applications behind the main ideas in the field using numerical and graphical examples, which can then be applied to a wide variety of problems in chemistry, biology, chemical engineering, and allied disciplines. Presented in full color, it features expansion of the principal component analysis, classification, multivariate evolutionary signal and statistical distributions sections, and new case studies in metabolomics, as well as extensive updates throughout. Aimed at the large number of users of chemometrics, it includes extensive worked problems and chapters explaining how to analyze datasets, in addition to updated descriptions of how to apply Excel and Matlab for chemometrics.
Chemometrics: Data Driven Extraction for Science, Second Edition offers chapters covering: experimental design, signal processing, pattern recognition, calibration, and evolutionary data. The pattern recognition chapter from the first edition is divided into two separate ones: Principal Component Analysis/Cluster Analysis, and Classification. It also includes new descriptions of Alternating Least Squares (ALS) and Iterative Target Transformation Factor Analysis (ITTFA). Updated descriptions of wavelets and Bayesian methods are included.
Chemometrics: Data Driven Extraction for Science, Second Edition is recommended for post-graduate students of chemometrics as well as applied scientists (e.g. chemists, biochemists, engineers, statisticians) working in all areas of data analysis.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1064
Veröffentlichungsjahr: 2018
Cover
Title Page
Copyright
Preface to Second Edition
Preface to First Edition
Acknowledgements
About the Companion Website
Chapter 1: Introduction
1.1 Historical Parentage
1.2 Developments since the 1970s
1.3 Software and Calculations
1.4 Further Reading
References
Chapter 2: Experimental Design
2.1 Introduction
2.2 Basic Principles
2.3 Factorial Designs
2.4 Central Composite or Response Surface Designs
2.5 Mixture Designs
2.6 Simplex Optimisation
Problems
Chapter 3: Signal Processing
3.1 Introduction
3.2 Basics
3.3 Linear Filters
3.4 Correlograms and Time Series Analysis
3.5 Fourier Transform Techniques
3.6 Additional Methods
Problems
Chapter 4: Principal Component Analysis and Unsupervised Pattern Recognition
4.1 Introduction
4.2 The Concept and Need for Principal Components Analysis
4.3 Principal Components Analysis: The Method
4.4 Factor Analysis
4.5 Graphical Representation of Scores and Loadings
4.6 Pre-processing
4.7 Comparing Multivariate Patterns
4.8 Unsupervised Pattern Recognition: Cluster Analysis
4.9 Multi-way Pattern Recognition
Problems
Chapter 5: Classification and Supervised Pattern Recognition
5.1 Introduction
5.2 Two-Class Classifiers
5.3 One-Class Classifiers
5.4 Multi-Class Classifiers
5.5 Optimisation and Validation
5.6 Significant Variables
Problems
Chapter 6: Calibration
6.1 Introduction
6.2 Univariate Calibration
6.3 Multiple Linear Regression
6.4 Principal Components Regression
6.5 Partial Least Squares Regression
6.6 Model Validation and Optimisation
Problems
Chapter 7: Evolutionary Multivariate Signals
7.1 Introduction
7.2 Exploratory Data Analysis and Pre-processing
7.3 Determining Composition
7.4 Resolution
Problems
Appendix
A.1 Vectors and Matrices
A.2 Algorithms
A.3 Basic Statistical Concepts
A.4 Excel for Chemometrics
A.5 Matlab for Chemometrics
Answers to the Multiple Choice Questions
Index
End User License Agreement
xi
xii
xiii
xiv
xv
xvii
1
2
3
4
5
6
7
8
9
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
364
365
366
367
368
369
370
371
372
373
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
429
430
431
433
434
435
436
437
438
439
Cover
Table of Contents
Preface
Begin Reading
Chapter 2: Experimental Design
Figure 2.1 Yield of a reaction as a function of pH and catalyst concentration.
Figure 2.2 Cross-section through surface in Figure 2.1 at 2 mM catalyst concentration.
Figure 2.3 Cross-section through surface in Figure 2.1 at pH 3.4.
Figure 2.4 Choice of nine molecules based on two properties.
Figure 2.5 Graph of spectroscopic peak height against concentration at five concentrations.
Figure 2.6 Experiment with high instrumental errors.
Figure 2.7 Experiment with low instrumental errors.
Figure 2.8 Degree-of-freedom tree.
Figure 2.9 Graph of peak height against concentration for ANOVA example, data set A.
Figure 2.10 Graph of peak height against concentration for ANOVA example, data set B.
Figure 2.11 Design matrix.
Figure 2.12 Relationship between response, design matrix and coefficients.
Figure 2.13 Graph of estimated response versus pH at the central temperature and concentration of the design in Table 4.6.
Figure 2.14 Seven lines, equally spaced in area, dividing the normal distribution into eight regions, including six central regions and two extreme regions whose summed area equals those of the central regions.
Figure 2.15 Normal probability plot for data in Table 2.13 with significant factors marked.
Figure 2.16 Method of calculating equation for leverage term for the coefficient of , sum the shaded areas.
Figure 2.17 Graph of leverage for designs in Table 2.15, from top to bottom, designs A, B and C.
Figure 2.18 Two-factor design consisting of five experiments.
Figure 2.19 Two experimental arrangements together with the corresponding leverage for a linear model.
Figure 2.20 Graph of levels of one term against another in the design in Table 2.18.
Figure 2.21 Three- and four-level full factorial designs.
Figure 2.22 Representation of a three-factor, two-level design.
Figure 2.23 Fractional factorial design.
Figure 2.24 Poorly designed calibration experiment.
Figure 2.25 Well-designed calibration experiment.
Figure 2.26 Cyclic permuter.
Figure 2.27 Graph of factor levels for design in Table 2.29: top factors 1 versus 2, bottom factors 1 versus 7.
Figure 2.28 Elements of a central composite design: each axis represents a factor.
Figure 2.29 Degrees of freedom for central composite design.
Figure 2.30 Three-component mixture space.
Figure 2.31 Simplex in one, two and three dimensions.
Figure 2.32 Three-component simplex centroid design.
Figure 2.33 Three-component simplex lattice design.
Figure 2.34 Four situations encountered in constrained mixture designs. (a) Lower bounds defined, (b) upper bounds defined, (c) upper and lower bounds defined, fourth factor as filler and (d) upper and lower bounds defined.
Figure 2.35 Mixture design with process variables.
Figure 2.36 Initial experiments (a, b and c) on the edge of a simplex: two factors and the new conditions if experiment results in the worst response.
Figure 2.37 Progress of a fixed sized simplex.
Figure 2.38 Modified simplex; the original simplex is indicated in bold, with the responses ordered from 1 (worse) to 3 (best). The test conditions are indicated.
Chapter 3: Signal Processing
Figure 3.1 Main parameters that characterise a symmetric peak.
Figure 3.2 Main parameters that characterise an asymmetric peak.
Figure 3.3 Gaussian and Lorentzian peak shapes of equal half heights.
Figure 3.4 Asymmetric peak shapes often described by a Gaussian/Lorentzian model. (a) Tailing: left is Gaussian and right is Lorentzian (b) Fronting: left is Lorentzian and right is Gaussian.
Figure 3.5 Three peaks forming a cluster.
Figure 3.6 Influence on the appearance of a peak as digital resolution is reduced corresponding to Table 3.1.
Figure 3.7 Examples of noise. From top to bottom: underlying signal, homoscedastic and heteroscedastic.
Figure 3.8 Selection of points to be used in a three-point moving average filter.
Figure 3.9 Filtering of data. (a) Raw data, (b) moving average filters, (c) quadratic/cubic Savitzky–Golay filters.
Figure 3.10 Comparison of moving average and running median smoothing.
Figure 3.11 A Gaussian together with its first and second derivative.
Figure 3.12 Two closely overlapping peaks together with their first and second derivatives.
Figure 3.13 From top to bottom, a three-point moving average, a Hanning window and a five-point Savitzky–Golay quadratic second-derivative window convolution functions.
Figure 3.14 A time series.
Figure 3.15 Auto-correlogram of the data in Figure 3.14.
Figure 3.16 Two time series (a,b) and their corresponding cross-correlogram (c).
Figure 3.17 Fourier transformation from a time domain to a frequency domain.
Figure 3.18 Typical time series consisting of several components.
Figure 3.19 Transformation of a real time series to real and imaginary pairs.
Figure 3.20 Fourier transform of a spike.
Figure 3.21 Absorption and dispersion line shapes.
Figure 3.22 Illustration of phase errors (time series (a–d) and real transform (e–h)).
Figure 3.23 A sparsely sampled time series sampled at the Nyquist frequency. Blue: underlying time series, red observed time series if sparsely sampled.
Figure 3.24 Fourier transformation of a rapidly decaying time series.
Figure 3.25 Result of multiplying the time series in Figure 3.24 by a positive exponential, the transform of the original time series being represented by a dotted blue line.
Figure 3.26 Result of multiplying a noisy time series by a positive exponential and transforming the new signal.
Figure 3.27 Multiplying the data in Figure 3.25 by a double exponential.
Figure 3.28 Use of a double exponential filter.
Figure 3.29 Fourier self-deconvolution of a peak cluster.
Figure 3.30 Progress of the Kalman filter, showing the fitted and raw data.
Figure 3.31 Change in the three coefficients predicted by the Kalman filter with time.
Figure 3.32 Raw data and wavelet filtered data in Table 3.11.
Figure 3.33 Haar wavelets of levels 1 and 2 corresponding to data in Table 3.11.
Figure 3.34 Frequency distribution for the toss of a die.
Figure 3.35 Another, but less likely, frequency distribution for toss of a die.
Chapter 4: Principal Component Analysis and Unsupervised Pattern Recognition
Figure 4.1 Factor analysis in psychology.
Figure 4.2 Matrix representation of results from a metabolomics experiment.
Figure 4.3 Case study 1: chromatographic peak profiles, involving summing intensities of the data from Table 4.1 over all wavelengths.
Figure 4.4 Case study 2: superimposed NIR spectra corresponding to the data in Table 4.2.
Figure 4.5 Principal components analysis.
Figure 4.6 PCA as a form of variable reduction.
Figure 4.7 Graph of log of PRESS (top) and RSS (bottom) for data set in Table 4.8.
Figure 4.8 Relationship between PCA and factor analysis in coupled chromatography.
Figure 4.9 Plot of scores of PC2 versus PC1 for case study 2.
Figure 4.10 3D plot for the scores of case study 2.
Figure 4.11 1D plot of the scores of PCs 1 and 2 for case study 1.
Figure 4.12 Scores of PC2 (vertical axis) versus PC1 (horizontal axis) for case study 1.
Figure 4.13 Scores of the first two PCs of case study 1 versus sample number.
Figure 4.14 Scores of principal component 2 (vertical axis) versus principal component 1 (horizontal axis) for the standardised data of case study 3.
Figure 4.15 Loadings plot of PC2 (vertical axis) against PC1 (horizontal axis) for case study 1, with wavelengths indicated in nanometre.
Figure 4.16 Pure spectra of compounds in case study 1.
Figure 4.17 Loadings of PC2 versus PC1 for case study 2.
Figure 4.18 Loadings of the first two PCs against wavelength for case study 2.
Figure 4.19 Loadings of principal component 2 versus principal component 1 for the standardised data of case study 3.
Figure 4.20 Scores of the first two PCs of the data in Table 4.11, (a) raw data (b) log scaled data.
Figure 4.21 Scores of the first two PCs of the data in Table 4.12 (a) raw data (b) row scaled data to constant total.
Figure 4.22 PC scores plot of PC2 versus PC1 for raw data of Table 4.12.
Figure 4.23 PC scores plot of PC2 versus PC1 for data after centring of Table 4.12.
Figure 4.24 Scores plot of PC2 versus PC1 for case study 1 after centring.
Figure 4.25 Plot of the scores of the first two PCs of the standardised data in Table 4.15.
Figure 4.26 Biplot of scores of the first two PCs of case study 3.
Figure 4.27 (a) Euclidean and (b) Manhattan distances.
Figure 4.28 Dendrogram for cluster analysis example.
Figure 4.29 Two-way and three-way data.
Figure 4.30 Possible method of arranging environmental sampling data.
Figure 4.31 Tucker3 decomposition.
Figure 4.32 Parallel factor analysis (PARAFAC).
Figure 4.33 Unfolding.
Chapter 5: Classification and Supervised Pattern Recognition
Figure 5.1 Data set in Table 5.1.
Figure 5.2 Two-class classifiers. (a) Linearly separable classes. (b) Linear inseparable classes.
Figure 5.3 A class distance plot. Top illustrates two classes, with their centroids marked by crosses. A sample is indicated, with its distances to the centroids of the blue and red classes. Bottom projects onto a class distance plot, with the specific sample noted.
Figure 5.4 Boundaries between groups A and B in Table 5.1, (a) EDC, (b) LDA and (c) QDA together with equidistant contours from the centroids for each criterion.
Figure 5.5 kNN boundaries for data set in Table 5.1; (a)
k
= 3 and (b)
k
= 5.
Figure 5.6 Appearance of kNN boundaries if the distance of a sample to itself is excluded for
k
= 3 and data in Table 5.1; sample 4 and its three neighbours marked.
Figure 5.7 One-class classifiers: (a) separable class; (b) classes with ambiguous and outlying samples.
Figure 5.8 Typical Gaussian density estimator for a data set characterised by two variables, with contour lines at different levels of certainty indicated.
Figure 5.9 QDA one-class boundaries using 90% confidence (
p
= 0.1) for data in Table 5.1.
Figure 5.10 Principles of disjoint PC models.
Figure 5.11 Class A disjoint model for PC1 for data set in Table 5.1, centred according to class A.
Figure 5.12 Multi-class classifiers.
Figure 5.13 PLS1 multi-class models.
Figure 5.14 Division of data into training and test sets.
Figure 5.15 Two different seating plans.
Figure 5.16 Dividing the data in Figure 5.15 into training and test sets.
Figure 5.17 Monte Carlo methods: bars represent frequency of results of several permutations for the %CC, whereas the red line represents the unpermuted data.
Figure 5.18 Typical bootstrap sampling.
Figure 5.19 Division of data into test set and bootstrap test set and a typical iterative approach.
Figure 5.20 Distribution of Heads if an unbiased coin is tossed 10 times.
Figure 5.21 PLS-DA scores and loadings of component 1 for the standardised data in Table 5.13.
Figure 5.22 Values of
t
for the 10 variables in Table 5.13.
Chapter 6: Calibration
Figure 6.1 Different notations for calibration and experimental design as used in this book.
Figure 6.2 Absorbance at 335 nm for the PAH case study plotted against concentration of pyrene.
Figure 6.3 Spectra of pure standards, digitised at 5 nm intervals, pyrene indicated in bold.
Figure 6.4 Difference between errors in (a) classical and (b) inverse calibration.
Figure 6.5 Best-fit straight lines for classical and inverse calibration, data for pyrene at 335 nm, no intercept, forcing the model through the origin.
Figure 6.6 Best-fit straight line using inverse calibration and an intercept term.
Figure 6.7 Predicted (vertical) versus known (horizontal) concentrations using methods of Section 6.2.3.
Figure 6.8 Absorbances of Pyr, Fluor, Benz and Ace between 330 and 345 nm.
Figure 6.9 Predicted versus known concentration of pyrene, using a four-component model and the wavelengths 330, 335, 340 and 345 nm (uncentred).
Figure 6.10 Spectra of the 10 PAHs estimated by MLR, with pyrene indicated in bold.
Figure 6.11 Root mean square errors of estimation of pyrene using uncentred PCR between 1 and 15 PCs.
Figure 6.12 Principles of PLS1.
Figure 6.13 Root mean square errors in
x
and
c
blocks, PLS1 centred and pyrene using between 1 and 15 PCs.
Figure 6.14 Residual errors in
x
and
c
blocks, PLS1 centred and acenaphthene.
Figure 6.15 Principles of PLS2.
Figure 6.16 Unfolding a data matrix.
Figure 6.17 Representation of tri-linear PLS1.
Figure 6.18 Matricisation in three-way calibration (
x
block only illustrated).
Figure 6.19
RMSEC
auto-predictive errors for acenaphthylene using PLS1.
Figure 6.20
RMSECV
for acenaphthylene using PLS1.
Figure 6.21
RMSEP
using data in Table 6.1 as a training set and data in Table 6.20 as a test set, PLS1 (centred) and acenaphthylene.
Figure 6.22
RMSEP
using data in Table 6.20 as a training set and data in Table 6.1 as a test set, PLS1 (centred) and acenaphthylene.
Chapter 7: Evolutionary Multivariate Signals
Figure 7.1 Sequential multivariate data matrix.
Figure 7.2 Three possible sequential patterns that would be treated identically using standard multivariate techniques.
Figure 7.3 Dividing data into regions before baseline correction.
Figure 7.4 Profile of data from data set A.
Figure 7.5 Scores and loadings plots of raw data from data set A for PC2 versus PC1.
Figure 7.6 Profile of data set B.
Figure 7.7 Scores and loadings plots of data set B from Table 7.2 for PC2 versus PC1.
Figure 7.8 Three-dimensional projections of scores (a) and loadings - top (b) for data set A - bottom.
Figure 7.9 Three-dimensional projections of scores (a) and loadings - top (b) for data set B - bottom.
Figure 7.10 Scores plots of data set A with each PC normalised.
Figure 7.11 Scores plots of data set A, each row summed to a constant total. (a) Entire data set, (b) expansion of region data points 5–19 and (c) performing the scaling and then PCA exclusively over points 5–19.
Figure 7.12 Scores plot of data set B with rows summed to a constant total between data points 5 and 20 and three main directions indicated. (a) Two PCs and (b) three PCs.
Figure 7.13 Scores and loadings after data set B has been standardised.
Figure 7.14 Intensity profile and unscaled scores and loadings from data set C in Table 7.3.
Figure 7.15 Scores and loadings after the data set C in Table 7.3 has been standardised.
Figure 7.16 Scores and loadings of the ranked data in Table 7.4.
Figure 7.17 Optimum size for variable reduction.
Figure 7.18 Different types of problems in chromatography.
Figure 7.19 Ratios of peak intensities for the case studies (a)–(d) assuming ideal peak shapes and peaks detectable over an indefinite region.
Figure 7.20 Regions of chromatogram (a) in Figure 7.18. Region
a
is where the ratio of the two components is between 50:1 and 1:50 and region
b
where the overall intensity is more than 1% of the maximum intensity.
Figure 7.21 Ratio of intensity of measurements D to F for data set A. (a) Raw information, (b) logarithmic scale between points 5 and 18 and (c) the minimum of the ratio of intensity D:F and F:D between points 5 and 18.
Figure 7.22 Intensities for wavelengths C and G using data of data set A summing the measurements at each successive point to constant total of 1.
Figure 7.23 Graph of correlation between successive points in the data of data set A.
Figure 7.24 Correlation between point 15 and the data of data set A.
Figure 7.25 Graph corresponding to that of Figure 7.23 for data set B in Table 7.2.
Figure 7.26 Forward and backward EFA plots of the first three eigenvalues from data set A.
Figure 7.27 Three-point FSW graph for data set A.
Figure 7.28 Derivative purity plot for data set A with purest points indicated.
Figure 7.29 Composition of regions in chromatogram deriving from data set A.
Figure 7.30 Profiles of variables
C
and
F
in Table 7.3.
Figure 7.31 Reconstructed profiles for data set A using MLR.
Figure 7.32 Profiles obtained as described in Section 7.4.1.3.
Figure 7.33 Profiles of three peaks obtained as in Section 7.4.2.
Appendix
Figure A.1 Changing to numeric cell addresses.
Figure A.2 The range A2:C3.
Figure A.3 The operation =AVERAGE(A1:B5,C8,B9:D11).
Figure A.4 Dragging a cell so that the reference is invariant.
Figure A.5 Naming a range.
Figure A.6 Matrix multiplication in Excel.
Figure A.7 Matrix transpose in Excel.
Figure A.8 Matrix inverse in Excel.
Figure A.9 Pseudo-inverse of a matrix.
Figure A.10 Correlation between two ranges.
Figure A.11 Finding the slope and intercept when fitting a linear model to two ranges.
Figure A.12 Use of IF in Excel.
Figure A.13 Finding the Analysis Toolpak.
Figure A.14 Data Analysis Add-in dialog box.
Figure A.15 Linear regression using the Excel Data Analysis Add-in.
Figure A.16 Generating random numbers in Excel.
Figure A.17 Adding an extra series in Excel.
Figure A.18 Finalised chart from Excel.
Figure A.19 Labelling a graph in Excel.
Figure A.20 Setup screen for the Excel chemometrics add-in.
Figure A.21 Selecting the Multivariate Analysis Add-in
Figure A.22 Multivariate analysis dialog box.
Figure A.23 PCA dialog box.
Figure A.24 PCR dialog box.
Figure A.25 PLS dialog box.
Figure A.26 MLR dialog box.
Figure A.27 Default Matlab window.
Figure A.28 File and array listing in Matlab.
Figure A.29 Running an m file script in Matlab.
Figure A.30 Running an m file function in Matlab.
Figure A.31 Obtaining vectors from matrices.
Figure A.32 Simple matrix operations in Matlab.
Figure A.33 Calculating a pseudo-inverse in Matlab.
Figure A.34 Mean function in Matlab.
Figure A.35 Calculating standard deviations in Matlab: the second calculation is preferred for most chemometric calculations where the aim is to scale a matrix.
Figure A.36 Mean centring a matrix in Matlab.
Figure A.37 Importing from Excel to Matlab.
Figure A.38 A simple loop used for mean centring.
Figure A.39 Blank Figure window.
Figure A.40 Use of hold on.
Figure A.41 Use of multiple plot facility.
Figure A.42 Use of specifiers to change the properties of a graph in Matlab.
Figure A.43 Use of axis square statement to view correct angles between vectors.
Figure A.44 Matlab Property Editor.
Figure A.45 Use of text command in Matlab.
Figure A.46 A 3D scores plot.
Figure A.47 Using the rotation icon to obtain a better view.
Figure A.48 Changing the appearance of the 3D plot.
Figure A.49 Loadings plot with identical orientation to the scores plot, labelled and copied into Word.
Chapter 2: Experimental Design
Table 2.1 Three experimental designs
Table 2.2 Numerical information for data sets A and B
Table 2.3 Calculation of errors for data set A, model including intercept
Table 2.4 Error analysis for data sets A and B
Table 2.5 ANOVA table: two-parameter model, data set B
Table 2.6 Typical experimental design
Table 2.7 Design matrix for the experiment in Table 2.6 using the model discussed in Section 2.2.3.1
Table 2.8 The vectors
b
and
ŷ
for data in Table 2.6
Table 2.9 Coding of data
Table 2.10 Coded design matrix together with estimated values of coded coefficients
Table 2.11 Calculation of
t
-statistic
Table 2.12
F
-ratio for experiment with low experimental error
Table 2.13 Normal probability calculation
Table 2.14 Leverage values for a two-factor design and a model of the form
Table 2.15 Leverage for three possible single-variable designs using a two-parameter linear model
Table 2.16 Coding of a simple two factor, two level design and corresponding responses
Table 2.17 Design matrix
Table 2.18 Four-factor, two-level full factorial design
Table 2.19 Correlated factors
Table 2.20 Full factorial designs corresponding to Figure 2.21
Table 2.21 Full factorial design for three factors together with the design matrix
Table 2.22 Fractional factorial design
Table 2.23 Confounding factor 5 with the product of factors 1–4
Table 2.24 Confounding interaction terms in design in Table 2.23
Table 2.25 Quarter factorial design
Table 2.26 A Plackett–Burman design for 11 factors, generator outlined by a box
Table 2.27 Generators for Plackett–Burman design, first row is at − level
Table 2.28 Equivalence of Plackett–Burman and fractional factorial designs for seven factors, the arrows showing how the rows are related
Table 2.30 Parameters for construction of a multi-level calibration design
Table 2.29 Development of a multi-level partial factorial design
Table 2.31 Construction of a central composite design
Table 2.32 Three possible two-factor central composite designs
Table 2.33 Position of the axial points for rotatability and orthogonality for central composite designs with varying number of replicates (one less than the number of central points)
Table 2.34 Three-component simplex centroid mixture design
Table 2.36 A {5,2} simplex centroid design
Table 2.37 Two-component simplex lattice design
Table 2.38 Number of experiments required for various simplex lattice designs, with different numbers of components and interactions
Table 2.39 Constrained mixture design with three lower bounds
Table 2.41 Example of simultaneous constraints in mixture designs
Table 2.42 Constrained mixture design where both upper and lower limits are known in advance
Chapter 3: Signal Processing
Table 3.1 Reducing digital resolution
Table 3.2 Stationary and moving average noise
Table 3.3 Savitzky–Golay coefficients
c
i
+
j
for smoothing
Table 3.4 Results of various filters on a data set
Table 3.5 A sequential process: illustration of moving average and median smoothing
Table 3.6 Savitzky–Golay coefficients for derivatives
Table 3.7 Data in Figure 3.14 together with the data lagged by five points in time
Table 3.8 Two time series, for which the cross-correlogram is presented in Figure 3.16
Table 3.9 Equivalence between parameters in the time domain and frequency domain
Table 3.10 Kalman filter calculation
Table 3.11 Numerical example for wavelet transform: left raw data, centre transformed data after level 1 wavelet and right after level 2 wavelet, without scaling
Table 3.12 Maximum entropy calculation for unbiased die, logarithms to the base 10
Table 3.13 Maximum entropy calculation for biased die
Chapter 4: Principal Component Analysis and Unsupervised Pattern Recognition
Table 4.1 Case study 1: a chromatogram recorded at 30 points in time and 28 wavelengths
Table 4.2 Case study 2: NIR spectra of 72 oils in AU recorded at 32 wavelengths, consisting of four groups A: corn oil, B: olive oil, C: safflower oil, D: corn margarine, after baseline correction and suiTable pre-processing
Table 4.3 Case study 3: properties of some elements
Table 4.4 Scores and loadings for case study 1
Table 4.5 Eigenvalues for case study 1 (raw data)
Table 4.6 Eigenvalues for case study 3 (standardised data)
Table 4.7 Size of eigenvalues for case study 1 after column centring
Table 4.8 Cross-validation example
Table 4.9 Calculation of cross-validated error for sample 1
Table 4.10 Calculation of RSS and PRESS
Table 4.11 Example for logarithmic scaling; the first five samples belong to one group and the last five to a separate group
Table 4.12 Example for row scaling
Table 4.13 How the data in Table 4.12 were simulated as discussed in the text
Table 4.14 Example for Mean Centring
Table 4.15 Standardising the data of Table 4.11
Table 4.16 Example for cluster analysis
Table 4.17 Correlation matrix.
Table 4.18 Euclidean distance matrix.
Table 4.19 Manhattan distance matrix.
Table 4.20 Nearest neighbour cluster analysis, using correlation coefficients for similarity measures, and data in Table 4.16
Chapter 5: Classification and Supervised Pattern Recognition
Table 5.1 Case study in Section 5.1.2: the data involve 20 samples in two classes (first 10 = class A, second 10 = class B) recorded using two variables
Table 5.2 Class distances for the data in Table 5.1 using EDC, LDA and QDA together with the predicted class memberships
Table 5.3 PLS-DA components of data in Table 5.1
Table 5.4 PLS-DA predictions of c for one-component and two-component models for the centred data in Table 5.4
Table 5.5 kNN for data in Table 5.1; the five nearest neighbours are listed and the assignments using
k
= 3 and
k
= 5
Table 5.6 QDA Mahalanobis distance to classes A and B for data in Table 5.1 together with the classification at a confidence limit of 90% (cut-off 2.146); shaded cells are outside the limits
Table 5.7 Class A model using one PC (centred) for SIMCA and data in Table 5.1
Table 5.8 Q and D one-PC class A models for data in Table 5.1
Table 5.9 Division into training and test set
Table 5.10 EDC model of data in Table 5.1 divided into training and test sets
Table 5.11 A simple contingency table
Table 5.12 A 2 × 2 contingency table
Table 5.13 Data set mentioned in Section 5.6
Chapter 6: Calibration
Table 6.1 Case study consisting of 25 spectra recorded at 27 wavelengths in nanometre, absorbances in AU
Table 6.2 Concentrations of the 10 PAHs in the data in Table 6.1
Table 6.3 Concentration of pyrene, absorbance at 335 nm and predictions of absorbance, using single-parameter classical calibration using method of Section 2.2.1
Table 6.4 Concentration of pyrene, absorbance at 335 nm and predictions of absorbance, using single-parameter inverse calibration using method of Section 2.2.2
Table 6.5 Matrices for four components
Table 6.6 Matrix
B
for Section 6.3.2
Table 6.7 Estimated concentration for four components as described in Section 6.3.2
Table 6.8 Estimated concentrations for the case study using uncentred MLR and all wavelengths
Table 6.9 Estimates for three PAHs using the full data set and MLR but including only three compounds in the model
Table 6.10 Scores of the first 10 PCs for PAH case study
Table 6.11 Vector
r
for pyrene
Table 6.12 Concentration estimates of the PAHs using PCR and 10 components (uncentred)
Table 6.13 Calculation of concentration estimates for pyrene using two PLS components
Table 6.14 Magnitudes of first 15 PLS1 components (centred data) for pyrene
Table 6.15 Concentration estimates of the PAHs using PLS1 and 10 components (centred)
Table 6.16 Concentration estimates of the PAHs using PLS2 and 10 components (centred)
Table 6.17 Three-way calibration data set
Table 6.18 Four methods of mean centring the data in Table 6.17, illustrated by the variable
x
i
,1,1
as discussed in Section 6.5.3.1
Table 6.19 Calculation of three tri-linear PLS1 components for the data in Table 6.17 and residuals for sample 1
Table 6.20 Independent test set.
Chapter 7: Evolutionary Multivariate Signals
Table 7.1 Data set A
Table 7.2 Data set B
Table 7.3 Data set C
Table 7.4 Method for ranking variables using data set C
Table 7.5 Correlation coefficients for data set A between successive points (left-hand column) and between point 15 (right-hand column)
Table 7.6 Results of forward and backward EFA for the data set A
Table 7.7 Fixed sized window factor analysis applied to data set A using a three-point window
Table 7.8 Derivative calculation for determining purity of regions in data set A
Table 7.9 Estimated spectra obtained from the composition 1 regions in data set A
Table 7.10 Estimation of profiles using PCA for the data in Table 7.9
Table 7.11 Key steps in the calculation of rotation matrix for data set A using scores in composition 1 regions
Table 7.12 Determining spectrum and elution profiles of an embedded peak
Appendix
Table A.1 Cumulative standardised normal distribution
Table A.2 Critical values of
χ
2
Table A.3 Critical values of two-tailed
t
-distribution
Table A.4 One-tailed critical values of the
F
-distribution at 1% level
Table A.5 One-tailed critical values of the
F
-distribution at 5% level
Richard G. Brereton
University of Bristol (Emeritus) UK
Second Edition
This edition first published 2018
© 2018 John Wiley & Sons Ltd
John Wiley & Sons Ltd (1e, 2009)
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Richard G. Brereton to be identified as the author of this work has been asserted in accordance with law.
Registered Office(s)
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data:
Names: Brereton, Richard G., author.
Title: Chemometrics : data driven extraction for science / Richard G. Brereton.
Description: Second edition. | Hoboken, NJ : John Wiley & Sons, 2018. | Originally published in 2003 as: Chemometrics : data analysis for the laboratory and chemical plant. |
Identifiers: LCCN 2017054468 (print) | LCCN 2017059486 (ebook) | ISBN 9781118904688 (epub) | ISBN 9781118904671 (pdf) | ISBN 9781118904664 (pbk.)
Subjects: LCSH: Chemometrics-Data processing. | Chemical processes-Statistical methods-Data processing.
Classification: LCC QD75.4.C45 (ebook) | LCC QD75.4.C45 B74 2018 (print) | DDC 543.01/5195-dc23
LC record available at https://lccn.loc.gov/2017054468
Cover design by Wiley
Cover images: (Background) © LiliKo/Gettyimages; (Diagram)/ Courtesy of Richard G. Brereton
The first edition of this book has been well received, with a special emphasis on numerical illustration of a wide range of chemometric methods. Of particular importance were the problems at the end of each chapter that readers could work through in their own favourite environment, such as Excel or Matlab, but also R or Python or Fortran or any number of languages or computational packages if desired. I have performed calculations in both Matlab and Excel, but readers should not feel restricted if they have an alternative.
The reader of this book is likely to be an applied scientist or statistician who wishes to understand the basis and motivation of many of the main methods used in chemometrics.
Since the first edition, chemometrics has become much more widespread, including outside mainstream chemistry. In the early 2000s, the major applications were quantitative laboratory analytical science and chemical engineering including process control. Over the past few years, application areas have broadened, as large analytical laboratory-generated data sets become more widely available, for example, in metabolomics, heritage science and food science, reflecting a larger emphasis on pattern recognition in the second edition including some practical case studies from metabolomics in the form of worked problem sets.
Despite this, many of the original building blocks of the subject remain unchanged. A factorial design and a principal component is still the same, so parts of the text only involve small changes from the first edition. Nevertheless, feedback both from students and co-workers of mine and also from comments via the Internet have provided valuable guidance as to what changes are desirable for a second edition. Important structural changes such as multiple choice questions throughout the book and colour printing update the original edition as a modern day textbook.
Some major updates are as follows.
•
Short multiple choice questions at the end of every section of the main text.
•
Colour printing involving redrawing many figures.
•
New chapter on supervised pattern recognition (classification) involving enhanced discussions of SIMCA, PLS-DA, LDA, QDA, EDC, kNN as well as validation.
•
New case studies on NIR for distinguishing edible oils, and properties of elements, to illustrate unsupervised pattern recognition methods.
•
New case studies in metabolomics, including Arabidopsis genotyping by MS, Raman of cancerous lymph nodes and NMR for diagnosing diabetes, as new problem sets.
•
Additional description of MCR and ITTFA.
•
New and expanded discussions of wavelets and of Bayesian methods in signal analysis.
•
Updated description of Matlab R2016a under Windows 10, and Excel 2016 under Windows 10, in the context of the needs of the chemometrician.
•
Enhanced discussion of the main statistical distributions.
•
Enhanced discussions on validation and optimisation, including description of the bootstrap and of performance indicators.
To supplement this book, all data sets in this book, both from the main text and the problems at the end of each chapter, are downloadable. In addition, there is a downloadable Excel add-in to perform most of the common multivariate methods and a macro for labelling graphs. Matlab routines corresponding to many of the main methods are also available. The answers to the problems at the end of each chapter can also be found. These are available on the Wiley website associated with this book.
It is hoped that this text will be useful for students wishing to obtain a fundamental understanding of many chemometric methods. It will also be useful for any practicing chemometrician who needs to work through methods they may have only recently encountered, using numerical examples: as a researcher, when I encounter an unfamiliar approach, I usually like to reproduce numerical data from published case studies to check how it works before I am confident to use the method. For people encountering chemometrics for the first time, for example, in metabolomics and heritage science, this book presents many of the most widespread methods and so will serve as a good reference. And as a refresher, the multiple choice questions test the basic understanding. The worked case studies can be collected together and are helpful for courses.
Finally, I thank the publishers who have encouraged the development of this rather complex project, especially Jenny Cossham, through many stages and also colleagues who have provided data as listed in the acknowledgements.
Bristol, May 2017
Richard G. Brereton
This book is a product of several years' activities from myself. First and foremost, the task of educating graduate students in my research group from a large variety of backgrounds over the past 10 years has been a significant formative experience, and this has allowed me to develop a large series of problems which we set every 3 weeks and present answers in seminars. From my experience, this is the best way to learn chemometrics! In addition, I have had the privilege to organise international quality courses mainly for industrialists with the participation of many representatives as tutors of the best organisations and institutes around the world, and I have learnt from them. Different approaches are normally taken while teaching industrialists who may be encountering chemometrics for the first time in mid-career and have a limited period of a few days to attend a condensed course, and university students that have several months or even years to practice and improve. However, it is hoped that this book represents a symbiosis of both needs.
In addition, it has been a great inspiration for me to write a regular fortnightly column for Chemweb (available to all registered users on www.chemweb.com) and some of the material in this book is based on articles first available in this format. Chemweb brings a large reader base to chemometrics, and feedback via e-mails or even travels around the world have helped me formulate my ideas. There is a very wide interest in this subject, but it is somewhat fragmented. For example, there is a strong group of Near Infrared Spectroscopists, primarily in the USA, that has led to the application of advanced ideas in process monitoring who see chemometrics as a quite technical industrially oriented subject. There are other groups of mainstream chemists that see chemometrics as applicable to almost all branches of research, ranging from kinetics to titrations to synthesis optimisation. Satisfying all these diverse people is not an easy task.
This book relies mainly on numerical examples: many in the body of the text come from my favourite research interests that are primarily in analytical chromatography and spectroscopy, to expand the text more to produce a huge book of twice the size, so I ask the indulgence of readers if your area of application differs. Certain chapters such as those on calibration could be approached from widely different viewpoints, but the methodological principles are the most important, and if you understand how the ideas can be applied in one area, you will be able to translate to your own favourite application. In the problems at the end of each chapter, I cover a wider range of applications to illustrate the broad basis of these methods. The emphasis of this book is on understanding ideas, which can then be applied to a wide variety of problems in chemistry, chemical engineering and allied disciplines.
It is difficult to select what material to include in this book without making it too long. Every expert I have shown this book to has made suggestions for new material. Some I have taken into account and I am most grateful for every proposal, and others I have mentioned briefly or not at all, mainly for the reason of length and also to ensure that this book sees the light of day rather than constantly expands without an end. There are many outstanding specialist books for the enthusiast. It is my experience, although, that if you understand the main principles (which are quite a few in number), and constantly apply them to a variety of problems, you will soon pick up the more advanced techniques, so it is the building blocks that are most important.
In a book of this nature, it is very difficult to decide on what detail is required for the various algorithms, some readers will have no real interest in the algorithms, whereas others will feel the text is incomplete without comprehensive descriptions. The main algorithms for common chemometric methods are presented in Appendix A.2. Step by step descriptions of methods, rather than algorithms, are presented in the text. A few approaches that will interest some readers such as cross-validation in PLS are described in the problems at the end of appropriate chapters which supplement the text. It is expected that readers will approach this book with different levels of knowledge and expectations, so it is possible to gain a great deal without having an in-depth appreciation of computational algorithms, but for interested readers, the information is nevertheless available. People rarely read texts in a linear fashion, they often dip in and out of parts of it according to their background and aspirations, and chemometrics is a subject which people approach with very different previous knowledge and skills, so it is possible to gain from this book without covering every topic in full. Many readers will simply use add-ins or Matlab commands and be able to produce all the results in this text.
Chemometrics uses a very large variety of software. In this book, we recommend two main environments, Excel and Matlab, the examples have been tried using both environments, and you should be able to get the same answers in both cases. Users of this book will vary from people that simply want to plug the data into existing packages to those that are curious and want to reproduce the methods in their own favourite language such as Matlab, VBA or even C. In some cases, instructors may use the information available with this book to tailor examples for problem classes. Extra software supplements are available via the publishers' website www.SpectroscopyNOW.com, together with all the data sets in this book.
The problems at the end of each chapter form an important part of the text, the examples being a mixture of simulations (which have an important role in chemometrics) and real case studies from a wide variety of sources. For each problem, the relevant sections of the text that provide further information are referenced. However, a few problems build on the existing material and take the reader further: a good chemometrician should be able to use the basic building blocks to understand and use new methods. The problems are of various types; thus, not every reader will to solve all the problems. In addition, instructors can use the data sets to construct workshops or course material that goes further than the book.
I am very grateful for the tremendous support I have had from many people when asking for information and help with data sets and permission where required. I thank Chemweb for agreement to present material modified from articles originally published in their e-zine, The Alchemist, and the RSC for permission to base the text of Chapter 5 on material originally published in the Analyst (125, 2125–2154 (2000)). A full list of acknowledgements for the data sets used in this text is presented after this foreword.
I thank Tom Thurston and Les Erskine for a superb job on the Excel add-in, and Hailin Shen for outstanding help in Matlab. Numerous people have tested the answers to the problems. Special mention should be given to Christian Airiau, Kostas Zissis, Tom Thurston, Conrad Bessant and Cevdet Demir for access to a comprehensive set of answers on disc for a large number of exercises so I can check mine. In addition, several people have read chapters and made detailed comments particularly checking numerical examples; in particular, I thank Hailin Shen for suggestions about improving Chapter 6 and Mohammed Wasim for careful checking of errors. In some ways, the best critics are the students and postdocs working with me because they are the people that have to read and understand a book of this nature, and it gives me great confidence that my co-workers in Bristol have found this approach useful and have been able to learn from the examples.
Finally, I thank the publishers for taking a germ on an idea and making valuable suggestions as to how this could be expanded and improved to produce what I hope is a successful textbook and having faith and patience over a protracted period.
Bristol, February 2002
Richard G. Brereton
The following have provided me with sources of data for this text. All other case studies are simulations.
Data set
Source
Problem 2.2
A. Nordin, L. Eriksson, M. Öhman,
Fuel
, 74, 128–135 (1995)
Problem 2.6
G. Drava, University of Genova
Problem 2.7
I.B. Rubin, T.J. Mitchell, G. Goldstein,
Anal Chem
, 43, 717–721 (1971)
Problem 2.10
G. Drava, University of Genova
Problem 2.11
Y. Yifeng, S. Dianpeng, H. Xuebing, W. Shulan,
Bull Chem Soc Japan
, 68, 1115–1118 (1995)
Problem 2.12
D.V. McCalley, University of West of England, Bristol
Problem 2.15
D. Vojnovic, B. Campisi, A. Mattei, L. Favreto,
Chemometrics Intell Lab Systems
, 27, 205–219 (1995)
Problem 2.16
L.E. Garcia-Ayuso, M.D. Luque de Castro,
Anal Chim Acta
, 382, 309–316 (1999)
Problem 3.8
K.D. Zissis, University of Bristol
Problem 3.9
C. Airiau, University of Bristol
Table 4.1
S. Dunkerley, University of Bristol
Table 4.2
S. Goswami and K. Olafsson, Camo ASA
Table 4.3
A. Javey, Chemometrics On-line
Problem 4.3
D. Duewer, National Institute of Standards Technology, US
Problem 4.5
S. Dunkerley, University of Bristol
Problem 5.3
S. Wold, University of Umeå (based on R. Cole and K. Phelps,
J Sci Food Agric
, 30, 669–676 (1979)
Problem 5.4
P. Bruno, M. Caselli, M.L. Curri, A. Genga, R. Striccoli, A. Traini,
Anal Chim Acta
, 410, 193–202 (2000)
Problem 5.5
R. Vendrame, R.S. Braga, Y. Takahata, D.S. Galvão,
J Chem Inf Comp Sci
, 39, 1094–1104 (1999)
Problem 5.7
R. Goodacre, University of Manchester (based on M. Kusano, A. Fukushima, M. Arita, P. Jonsson, T. Moritz, M. Kobayashi, M., et al.,
BMC System Biology
, 1, 53 (2007) – Metabolights accession MTBLS40)
Problem 5.8
R. Goodacre, University of Manchester (based on R.M. Salek, M.L. Maguire, E. Bentley, D.V. Rubtsov, T. Hough, M. Cheeseman, et al.,
Physiol Genomics
, 29, 99–10 (2007) – Metabolights accession MTBLS1)
Problem 5.9
G.R. Lloyd (based on G.R. Lloyd, L.E. Orr, J. Christie-Brown et al.,
Analyst
, 138, 3900–3908 (2013))
Table 6.1
S.D. Wilkes, University of Bristol
Table 6.20
S.D. Wilkes, University of Bristol
Problem 6.1
M.C. Pietrogrande, F. Dondi, P.A. Borea, C. Bighi,
Chemometrics Intell Lab Systems
, 5, 257–262 (1989)
Problem 6.3
H. Martens, M. Martens,
Multivariate Analysis of Quality
, Wiley, Chichester, 2001, p. 14
Problem 6.6
P.M. Vacas, University of Bristol
Problem 6.9
K.D. Zissis, University of Bristol
Problem 7.1
S. Dunkerley, University of Bristol
Problem 7.3
S. Dunkerley, University of Bristol
Problem 7.5
R. Tauler, University of Barcelona (results published in R. Gargallo, R. Tauler, A. Izquierdo-Ridorsa,
Quimica Analitica
, 18, 117–120)
Problem 7.6
S.P. Gurden, University of Bristol
Do not forget to visit the companion website for this book:
http://booksupport.wiley.com
The accompanying website for this text, http://booksupport.wiley.com, provides valuable material designed to enhance your learning, including:
•
Answers to problems at the end of each chapter
•
Software
•
Associated data sets
•
Figures in PPT
There are many opinions about the origin of chemometrics. Until quite recently, the birth of chemometrics was considered to have happened in the 1970s. Its name first appeared in 1972 in an article by Svante Wold [1]: in fact, the topic of this article was not one that we would recognise as being core to chemometrics, being relevant to neither multivariate analysis nor experimental design. For over a decade, the word chemometrics was considered to be of very low profile, and it developed a recognisable presence only in the 1980s, as described below.
However, if an explorer describes a new species in a forest, the species was there long before the explorer. Thus, the naming of the discipline just recognises that it had reached some level of visibility and maturity. As people re-evaluate the origins of chemometrics, the birth can be traced many years back.
Chemometrics burst into the world due to three fundamental factors, applied statistics (multivariate and experimental design), statistics in analytical and physical chemistry, and scientific computing.
The ideas of multivariate statistics have been around a long time. R.A. Fisher and colleagues working in Rothamsted, UK, formalised many of our modern ideas while applying primarily to agriculture. In the UK, before the First World War, many of the upper classes owned extensive land and relied on their income from tenant farmers and agricultural labourers. After the First World War, the cost of labour became higher, with many moving to the cities, and there was stronger competition of food from global imports. This meant that historic agricultural practices were seen to be inefficient and it was hard for landowners (or companies that took over large estates) to be economic and competitive, hence a huge emphasis on agricultural research, including statistics to improve these. R.A. Fisher and co-workers published some of the first major books and papers that we would regard as defining modern statistical thinking [2, 3], introducing ideas ranging from the null hypothesis to discriminant analysis to ANOVA. Some of the work of Fisher followed from the pioneering work of Karl Pearson in the University College London who had founded the world's first statistics department previously and had first formulated ideas such as p values and correlation coefficients.
During the 1920s and 1930s, a number of important pioneers of multivariate statistics published their work, many strongly influenced or having worked with Fisher, including Harold Hotelling, credited by many as defining principal components analysis (PCA) [4], although Pearson had independently described this method some 30 years ago, but under a different guise. As so often ideas are reported several times over in science, it is the person that names it and popularises it that often gets the credit: in the early twentieth century, libraries were often localised and there were very few international journals (Hotelling working mainly in the US) and certainly no internet; therefore, parallel work was often reported.
The principles of statistical experimental design were also formulated at around this period. There had been early reports on what we regard as modern approaches to formal designs before that, for example James Lind's work on scurvy in the eighteenth century and Charles Pierce's discussion on randomised trials in the nineteenth century, but Fisher's classic work of the 1930s put all the concepts together in a rigorous statistical format [5].
Much non-Bayesian, applied statistical thinking has been based on principles established in the 1920s and 1930s, for nearly a century. Early applications include agriculture, psychology, finance and genetics. After the Second World War, the chemical industry took an interest. In the 1920s, an important need was to improve agricultural practice, but by the 1950s, a major need was to improve processes in manufacturing, especially chemical engineering; hence, many more statisticians were employed within the industry. O.L. Davies edited an important book on experimental design with contributions from colleagues in ICI [6]. Foremost was G.E.P. Box, son-in-law of Fisher, whose book with colleagues is one of the most important post-war classics in experimental design and multi-linear regression [7].
These statistical building blocks were already mature by the time people started calling themselves chemometricians and have changed only a little during the intervening period.
