Chemometrics - Richard G. Brereton - E-Book

Chemometrics E-Book

Richard G. Brereton

0,0
91,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

A new, full-color, completely updated edition of the key practical guide to chemometrics

This new edition of this practical guide on chemometrics, emphasizes the principles and applications behind the main ideas in the field using numerical and graphical examples, which can then be applied to a wide variety of problems in chemistry, biology, chemical engineering, and allied disciplines. Presented in full color, it features expansion of the principal component analysis, classification, multivariate evolutionary signal and statistical distributions sections, and new case studies in metabolomics, as well as extensive updates throughout. Aimed at the large number of users of chemometrics, it includes extensive worked problems and chapters explaining how to analyze datasets, in addition to updated descriptions of how to apply Excel and Matlab for chemometrics. 

Chemometrics: Data Driven Extraction for Science, Second Edition offers chapters covering: experimental design, signal processing, pattern recognition, calibration, and evolutionary data. The pattern recognition chapter from the first edition is divided into two separate ones: Principal Component Analysis/Cluster Analysis, and Classification. It also includes new descriptions of Alternating Least Squares (ALS) and Iterative Target Transformation Factor Analysis (ITTFA). Updated descriptions of wavelets and Bayesian methods are included.

  • Includes updated chapters of the classic chemometric methods (e.g. experimental design, signal processing, etc.)
  • Introduces metabolomics-type examples alongside those from analytical chemistry
  • Features problems at the end of each chapter to illustrate the broad applicability of the methods in different fields
  • Supplemented with data sets and solutions to the problems on a dedicated website, www.booksupport.wiley.com

Chemometrics: Data Driven Extraction for Science, Second Edition is recommended for post-graduate students of chemometrics as well as applied scientists (e.g. chemists, biochemists, engineers, statisticians) working in all areas of data analysis.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1064

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Preface to Second Edition

Preface to First Edition

Acknowledgements

About the Companion Website

Chapter 1: Introduction

1.1 Historical Parentage

1.2 Developments since the 1970s

1.3 Software and Calculations

1.4 Further Reading

References

Chapter 2: Experimental Design

2.1 Introduction

2.2 Basic Principles

2.3 Factorial Designs

2.4 Central Composite or Response Surface Designs

2.5 Mixture Designs

2.6 Simplex Optimisation

Problems

Chapter 3: Signal Processing

3.1 Introduction

3.2 Basics

3.3 Linear Filters

3.4 Correlograms and Time Series Analysis

3.5 Fourier Transform Techniques

3.6 Additional Methods

Problems

Chapter 4: Principal Component Analysis and Unsupervised Pattern Recognition

4.1 Introduction

4.2 The Concept and Need for Principal Components Analysis

4.3 Principal Components Analysis: The Method

4.4 Factor Analysis

4.5 Graphical Representation of Scores and Loadings

4.6 Pre-processing

4.7 Comparing Multivariate Patterns

4.8 Unsupervised Pattern Recognition: Cluster Analysis

4.9 Multi-way Pattern Recognition

Problems

Chapter 5: Classification and Supervised Pattern Recognition

5.1 Introduction

5.2 Two-Class Classifiers

5.3 One-Class Classifiers

5.4 Multi-Class Classifiers

5.5 Optimisation and Validation

5.6 Significant Variables

Problems

Chapter 6: Calibration

6.1 Introduction

6.2 Univariate Calibration

6.3 Multiple Linear Regression

6.4 Principal Components Regression

6.5 Partial Least Squares Regression

6.6 Model Validation and Optimisation

Problems

Chapter 7: Evolutionary Multivariate Signals

7.1 Introduction

7.2 Exploratory Data Analysis and Pre-processing

7.3 Determining Composition

7.4 Resolution

Problems

Appendix

A.1 Vectors and Matrices

A.2 Algorithms

A.3 Basic Statistical Concepts

A.4 Excel for Chemometrics

A.5 Matlab for Chemometrics

Answers to the Multiple Choice Questions

Index

End User License Agreement

Pages

xi

xii

xiii

xiv

xv

xvii

1

2

3

4

5

6

7

8

9

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

364

365

366

367

368

369

370

371

372

373

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

429

430

431

433

434

435

436

437

438

439

Guide

Cover

Table of Contents

Preface

Begin Reading

List of Illustrations

Chapter 2: Experimental Design

Figure 2.1 Yield of a reaction as a function of pH and catalyst concentration.

Figure 2.2 Cross-section through surface in Figure 2.1 at 2 mM catalyst concentration.

Figure 2.3 Cross-section through surface in Figure 2.1 at pH 3.4.

Figure 2.4 Choice of nine molecules based on two properties.

Figure 2.5 Graph of spectroscopic peak height against concentration at five concentrations.

Figure 2.6 Experiment with high instrumental errors.

Figure 2.7 Experiment with low instrumental errors.

Figure 2.8 Degree-of-freedom tree.

Figure 2.9 Graph of peak height against concentration for ANOVA example, data set A.

Figure 2.10 Graph of peak height against concentration for ANOVA example, data set B.

Figure 2.11 Design matrix.

Figure 2.12 Relationship between response, design matrix and coefficients.

Figure 2.13 Graph of estimated response versus pH at the central temperature and concentration of the design in Table 4.6.

Figure 2.14 Seven lines, equally spaced in area, dividing the normal distribution into eight regions, including six central regions and two extreme regions whose summed area equals those of the central regions.

Figure 2.15 Normal probability plot for data in Table 2.13 with significant factors marked.

Figure 2.16 Method of calculating equation for leverage term for the coefficient of , sum the shaded areas.

Figure 2.17 Graph of leverage for designs in Table 2.15, from top to bottom, designs A, B and C.

Figure 2.18 Two-factor design consisting of five experiments.

Figure 2.19 Two experimental arrangements together with the corresponding leverage for a linear model.

Figure 2.20 Graph of levels of one term against another in the design in Table 2.18.

Figure 2.21 Three- and four-level full factorial designs.

Figure 2.22 Representation of a three-factor, two-level design.

Figure 2.23 Fractional factorial design.

Figure 2.24 Poorly designed calibration experiment.

Figure 2.25 Well-designed calibration experiment.

Figure 2.26 Cyclic permuter.

Figure 2.27 Graph of factor levels for design in Table 2.29: top factors 1 versus 2, bottom factors 1 versus 7.

Figure 2.28 Elements of a central composite design: each axis represents a factor.

Figure 2.29 Degrees of freedom for central composite design.

Figure 2.30 Three-component mixture space.

Figure 2.31 Simplex in one, two and three dimensions.

Figure 2.32 Three-component simplex centroid design.

Figure 2.33 Three-component simplex lattice design.

Figure 2.34 Four situations encountered in constrained mixture designs. (a) Lower bounds defined, (b) upper bounds defined, (c) upper and lower bounds defined, fourth factor as filler and (d) upper and lower bounds defined.

Figure 2.35 Mixture design with process variables.

Figure 2.36 Initial experiments (a, b and c) on the edge of a simplex: two factors and the new conditions if experiment results in the worst response.

Figure 2.37 Progress of a fixed sized simplex.

Figure 2.38 Modified simplex; the original simplex is indicated in bold, with the responses ordered from 1 (worse) to 3 (best). The test conditions are indicated.

Chapter 3: Signal Processing

Figure 3.1 Main parameters that characterise a symmetric peak.

Figure 3.2 Main parameters that characterise an asymmetric peak.

Figure 3.3 Gaussian and Lorentzian peak shapes of equal half heights.

Figure 3.4 Asymmetric peak shapes often described by a Gaussian/Lorentzian model. (a) Tailing: left is Gaussian and right is Lorentzian (b) Fronting: left is Lorentzian and right is Gaussian.

Figure 3.5 Three peaks forming a cluster.

Figure 3.6 Influence on the appearance of a peak as digital resolution is reduced corresponding to Table 3.1.

Figure 3.7 Examples of noise. From top to bottom: underlying signal, homoscedastic and heteroscedastic.

Figure 3.8 Selection of points to be used in a three-point moving average filter.

Figure 3.9 Filtering of data. (a) Raw data, (b) moving average filters, (c) quadratic/cubic Savitzky–Golay filters.

Figure 3.10 Comparison of moving average and running median smoothing.

Figure 3.11 A Gaussian together with its first and second derivative.

Figure 3.12 Two closely overlapping peaks together with their first and second derivatives.

Figure 3.13 From top to bottom, a three-point moving average, a Hanning window and a five-point Savitzky–Golay quadratic second-derivative window convolution functions.

Figure 3.14 A time series.

Figure 3.15 Auto-correlogram of the data in Figure 3.14.

Figure 3.16 Two time series (a,b) and their corresponding cross-correlogram (c).

Figure 3.17 Fourier transformation from a time domain to a frequency domain.

Figure 3.18 Typical time series consisting of several components.

Figure 3.19 Transformation of a real time series to real and imaginary pairs.

Figure 3.20 Fourier transform of a spike.

Figure 3.21 Absorption and dispersion line shapes.

Figure 3.22 Illustration of phase errors (time series (a–d) and real transform (e–h)).

Figure 3.23 A sparsely sampled time series sampled at the Nyquist frequency. Blue: underlying time series, red observed time series if sparsely sampled.

Figure 3.24 Fourier transformation of a rapidly decaying time series.

Figure 3.25 Result of multiplying the time series in Figure 3.24 by a positive exponential, the transform of the original time series being represented by a dotted blue line.

Figure 3.26 Result of multiplying a noisy time series by a positive exponential and transforming the new signal.

Figure 3.27 Multiplying the data in Figure 3.25 by a double exponential.

Figure 3.28 Use of a double exponential filter.

Figure 3.29 Fourier self-deconvolution of a peak cluster.

Figure 3.30 Progress of the Kalman filter, showing the fitted and raw data.

Figure 3.31 Change in the three coefficients predicted by the Kalman filter with time.

Figure 3.32 Raw data and wavelet filtered data in Table 3.11.

Figure 3.33 Haar wavelets of levels 1 and 2 corresponding to data in Table 3.11.

Figure 3.34 Frequency distribution for the toss of a die.

Figure 3.35 Another, but less likely, frequency distribution for toss of a die.

Chapter 4: Principal Component Analysis and Unsupervised Pattern Recognition

Figure 4.1 Factor analysis in psychology.

Figure 4.2 Matrix representation of results from a metabolomics experiment.

Figure 4.3 Case study 1: chromatographic peak profiles, involving summing intensities of the data from Table 4.1 over all wavelengths.

Figure 4.4 Case study 2: superimposed NIR spectra corresponding to the data in Table 4.2.

Figure 4.5 Principal components analysis.

Figure 4.6 PCA as a form of variable reduction.

Figure 4.7 Graph of log of PRESS (top) and RSS (bottom) for data set in Table 4.8.

Figure 4.8 Relationship between PCA and factor analysis in coupled chromatography.

Figure 4.9 Plot of scores of PC2 versus PC1 for case study 2.

Figure 4.10 3D plot for the scores of case study 2.

Figure 4.11 1D plot of the scores of PCs 1 and 2 for case study 1.

Figure 4.12 Scores of PC2 (vertical axis) versus PC1 (horizontal axis) for case study 1.

Figure 4.13 Scores of the first two PCs of case study 1 versus sample number.

Figure 4.14 Scores of principal component 2 (vertical axis) versus principal component 1 (horizontal axis) for the standardised data of case study 3.

Figure 4.15 Loadings plot of PC2 (vertical axis) against PC1 (horizontal axis) for case study 1, with wavelengths indicated in nanometre.

Figure 4.16 Pure spectra of compounds in case study 1.

Figure 4.17 Loadings of PC2 versus PC1 for case study 2.

Figure 4.18 Loadings of the first two PCs against wavelength for case study 2.

Figure 4.19 Loadings of principal component 2 versus principal component 1 for the standardised data of case study 3.

Figure 4.20 Scores of the first two PCs of the data in Table 4.11, (a) raw data (b) log scaled data.

Figure 4.21 Scores of the first two PCs of the data in Table 4.12 (a) raw data (b) row scaled data to constant total.

Figure 4.22 PC scores plot of PC2 versus PC1 for raw data of Table 4.12.

Figure 4.23 PC scores plot of PC2 versus PC1 for data after centring of Table 4.12.

Figure 4.24 Scores plot of PC2 versus PC1 for case study 1 after centring.

Figure 4.25 Plot of the scores of the first two PCs of the standardised data in Table 4.15.

Figure 4.26 Biplot of scores of the first two PCs of case study 3.

Figure 4.27 (a) Euclidean and (b) Manhattan distances.

Figure 4.28 Dendrogram for cluster analysis example.

Figure 4.29 Two-way and three-way data.

Figure 4.30 Possible method of arranging environmental sampling data.

Figure 4.31 Tucker3 decomposition.

Figure 4.32 Parallel factor analysis (PARAFAC).

Figure 4.33 Unfolding.

Chapter 5: Classification and Supervised Pattern Recognition

Figure 5.1 Data set in Table 5.1.

Figure 5.2 Two-class classifiers. (a) Linearly separable classes. (b) Linear inseparable classes.

Figure 5.3 A class distance plot. Top illustrates two classes, with their centroids marked by crosses. A sample is indicated, with its distances to the centroids of the blue and red classes. Bottom projects onto a class distance plot, with the specific sample noted.

Figure 5.4 Boundaries between groups A and B in Table 5.1, (a) EDC, (b) LDA and (c) QDA together with equidistant contours from the centroids for each criterion.

Figure 5.5 kNN boundaries for data set in Table 5.1; (a)

k

= 3 and (b)

k

= 5.

Figure 5.6 Appearance of kNN boundaries if the distance of a sample to itself is excluded for

k

= 3 and data in Table 5.1; sample 4 and its three neighbours marked.

Figure 5.7 One-class classifiers: (a) separable class; (b) classes with ambiguous and outlying samples.

Figure 5.8 Typical Gaussian density estimator for a data set characterised by two variables, with contour lines at different levels of certainty indicated.

Figure 5.9 QDA one-class boundaries using 90% confidence (

p

= 0.1) for data in Table 5.1.

Figure 5.10 Principles of disjoint PC models.

Figure 5.11 Class A disjoint model for PC1 for data set in Table 5.1, centred according to class A.

Figure 5.12 Multi-class classifiers.

Figure 5.13 PLS1 multi-class models.

Figure 5.14 Division of data into training and test sets.

Figure 5.15 Two different seating plans.

Figure 5.16 Dividing the data in Figure 5.15 into training and test sets.

Figure 5.17 Monte Carlo methods: bars represent frequency of results of several permutations for the %CC, whereas the red line represents the unpermuted data.

Figure 5.18 Typical bootstrap sampling.

Figure 5.19 Division of data into test set and bootstrap test set and a typical iterative approach.

Figure 5.20 Distribution of Heads if an unbiased coin is tossed 10 times.

Figure 5.21 PLS-DA scores and loadings of component 1 for the standardised data in Table 5.13.

Figure 5.22 Values of

t

for the 10 variables in Table 5.13.

Chapter 6: Calibration

Figure 6.1 Different notations for calibration and experimental design as used in this book.

Figure 6.2 Absorbance at 335 nm for the PAH case study plotted against concentration of pyrene.

Figure 6.3 Spectra of pure standards, digitised at 5 nm intervals, pyrene indicated in bold.

Figure 6.4 Difference between errors in (a) classical and (b) inverse calibration.

Figure 6.5 Best-fit straight lines for classical and inverse calibration, data for pyrene at 335 nm, no intercept, forcing the model through the origin.

Figure 6.6 Best-fit straight line using inverse calibration and an intercept term.

Figure 6.7 Predicted (vertical) versus known (horizontal) concentrations using methods of Section 6.2.3.

Figure 6.8 Absorbances of Pyr, Fluor, Benz and Ace between 330 and 345 nm.

Figure 6.9 Predicted versus known concentration of pyrene, using a four-component model and the wavelengths 330, 335, 340 and 345 nm (uncentred).

Figure 6.10 Spectra of the 10 PAHs estimated by MLR, with pyrene indicated in bold.

Figure 6.11 Root mean square errors of estimation of pyrene using uncentred PCR between 1 and 15 PCs.

Figure 6.12 Principles of PLS1.

Figure 6.13 Root mean square errors in

x

and

c

blocks, PLS1 centred and pyrene using between 1 and 15 PCs.

Figure 6.14 Residual errors in

x

and

c

blocks, PLS1 centred and acenaphthene.

Figure 6.15 Principles of PLS2.

Figure 6.16 Unfolding a data matrix.

Figure 6.17 Representation of tri-linear PLS1.

Figure 6.18 Matricisation in three-way calibration (

x

block only illustrated).

Figure 6.19

RMSEC

auto-predictive errors for acenaphthylene using PLS1.

Figure 6.20

RMSECV

for acenaphthylene using PLS1.

Figure 6.21

RMSEP

using data in Table 6.1 as a training set and data in Table 6.20 as a test set, PLS1 (centred) and acenaphthylene.

Figure 6.22

RMSEP

using data in Table 6.20 as a training set and data in Table 6.1 as a test set, PLS1 (centred) and acenaphthylene.

Chapter 7: Evolutionary Multivariate Signals

Figure 7.1 Sequential multivariate data matrix.

Figure 7.2 Three possible sequential patterns that would be treated identically using standard multivariate techniques.

Figure 7.3 Dividing data into regions before baseline correction.

Figure 7.4 Profile of data from data set A.

Figure 7.5 Scores and loadings plots of raw data from data set A for PC2 versus PC1.

Figure 7.6 Profile of data set B.

Figure 7.7 Scores and loadings plots of data set B from Table 7.2 for PC2 versus PC1.

Figure 7.8 Three-dimensional projections of scores (a) and loadings - top (b) for data set A - bottom.

Figure 7.9 Three-dimensional projections of scores (a) and loadings - top (b) for data set B - bottom.

Figure 7.10 Scores plots of data set A with each PC normalised.

Figure 7.11 Scores plots of data set A, each row summed to a constant total. (a) Entire data set, (b) expansion of region data points 5–19 and (c) performing the scaling and then PCA exclusively over points 5–19.

Figure 7.12 Scores plot of data set B with rows summed to a constant total between data points 5 and 20 and three main directions indicated. (a) Two PCs and (b) three PCs.

Figure 7.13 Scores and loadings after data set B has been standardised.

Figure 7.14 Intensity profile and unscaled scores and loadings from data set C in Table 7.3.

Figure 7.15 Scores and loadings after the data set C in Table 7.3 has been standardised.

Figure 7.16 Scores and loadings of the ranked data in Table 7.4.

Figure 7.17 Optimum size for variable reduction.

Figure 7.18 Different types of problems in chromatography.

Figure 7.19 Ratios of peak intensities for the case studies (a)–(d) assuming ideal peak shapes and peaks detectable over an indefinite region.

Figure 7.20 Regions of chromatogram (a) in Figure 7.18. Region

a

is where the ratio of the two components is between 50:1 and 1:50 and region

b

where the overall intensity is more than 1% of the maximum intensity.

Figure 7.21 Ratio of intensity of measurements D to F for data set A. (a) Raw information, (b) logarithmic scale between points 5 and 18 and (c) the minimum of the ratio of intensity D:F and F:D between points 5 and 18.

Figure 7.22 Intensities for wavelengths C and G using data of data set A summing the measurements at each successive point to constant total of 1.

Figure 7.23 Graph of correlation between successive points in the data of data set A.

Figure 7.24 Correlation between point 15 and the data of data set A.

Figure 7.25 Graph corresponding to that of Figure 7.23 for data set B in Table 7.2.

Figure 7.26 Forward and backward EFA plots of the first three eigenvalues from data set A.

Figure 7.27 Three-point FSW graph for data set A.

Figure 7.28 Derivative purity plot for data set A with purest points indicated.

Figure 7.29 Composition of regions in chromatogram deriving from data set A.

Figure 7.30 Profiles of variables

C

and

F

in Table 7.3.

Figure 7.31 Reconstructed profiles for data set A using MLR.

Figure 7.32 Profiles obtained as described in Section 7.4.1.3.

Figure 7.33 Profiles of three peaks obtained as in Section 7.4.2.

Appendix

Figure A.1 Changing to numeric cell addresses.

Figure A.2 The range A2:C3.

Figure A.3 The operation =AVERAGE(A1:B5,C8,B9:D11).

Figure A.4 Dragging a cell so that the reference is invariant.

Figure A.5 Naming a range.

Figure A.6 Matrix multiplication in Excel.

Figure A.7 Matrix transpose in Excel.

Figure A.8 Matrix inverse in Excel.

Figure A.9 Pseudo-inverse of a matrix.

Figure A.10 Correlation between two ranges.

Figure A.11 Finding the slope and intercept when fitting a linear model to two ranges.

Figure A.12 Use of IF in Excel.

Figure A.13 Finding the Analysis Toolpak.

Figure A.14 Data Analysis Add-in dialog box.

Figure A.15 Linear regression using the Excel Data Analysis Add-in.

Figure A.16 Generating random numbers in Excel.

Figure A.17 Adding an extra series in Excel.

Figure A.18 Finalised chart from Excel.

Figure A.19 Labelling a graph in Excel.

Figure A.20 Setup screen for the Excel chemometrics add-in.

Figure A.21 Selecting the Multivariate Analysis Add-in

Figure A.22 Multivariate analysis dialog box.

Figure A.23 PCA dialog box.

Figure A.24 PCR dialog box.

Figure A.25 PLS dialog box.

Figure A.26 MLR dialog box.

Figure A.27 Default Matlab window.

Figure A.28 File and array listing in Matlab.

Figure A.29 Running an m file script in Matlab.

Figure A.30 Running an m file function in Matlab.

Figure A.31 Obtaining vectors from matrices.

Figure A.32 Simple matrix operations in Matlab.

Figure A.33 Calculating a pseudo-inverse in Matlab.

Figure A.34 Mean function in Matlab.

Figure A.35 Calculating standard deviations in Matlab: the second calculation is preferred for most chemometric calculations where the aim is to scale a matrix.

Figure A.36 Mean centring a matrix in Matlab.

Figure A.37 Importing from Excel to Matlab.

Figure A.38 A simple loop used for mean centring.

Figure A.39 Blank Figure window.

Figure A.40 Use of hold on.

Figure A.41 Use of multiple plot facility.

Figure A.42 Use of specifiers to change the properties of a graph in Matlab.

Figure A.43 Use of axis square statement to view correct angles between vectors.

Figure A.44 Matlab Property Editor.

Figure A.45 Use of text command in Matlab.

Figure A.46 A 3D scores plot.

Figure A.47 Using the rotation icon to obtain a better view.

Figure A.48 Changing the appearance of the 3D plot.

Figure A.49 Loadings plot with identical orientation to the scores plot, labelled and copied into Word.

List of Tables

Chapter 2: Experimental Design

Table 2.1 Three experimental designs

Table 2.2 Numerical information for data sets A and B

Table 2.3 Calculation of errors for data set A, model including intercept

Table 2.4 Error analysis for data sets A and B

Table 2.5 ANOVA table: two-parameter model, data set B

Table 2.6 Typical experimental design

Table 2.7 Design matrix for the experiment in Table 2.6 using the model discussed in Section 2.2.3.1

Table 2.8 The vectors

b

and

ŷ

for data in Table 2.6

Table 2.9 Coding of data

Table 2.10 Coded design matrix together with estimated values of coded coefficients

Table 2.11 Calculation of

t

-statistic

Table 2.12

F

-ratio for experiment with low experimental error

Table 2.13 Normal probability calculation

Table 2.14 Leverage values for a two-factor design and a model of the form

Table 2.15 Leverage for three possible single-variable designs using a two-parameter linear model

Table 2.16 Coding of a simple two factor, two level design and corresponding responses

Table 2.17 Design matrix

Table 2.18 Four-factor, two-level full factorial design

Table 2.19 Correlated factors

Table 2.20 Full factorial designs corresponding to Figure 2.21

Table 2.21 Full factorial design for three factors together with the design matrix

Table 2.22 Fractional factorial design

Table 2.23 Confounding factor 5 with the product of factors 1–4

Table 2.24 Confounding interaction terms in design in Table 2.23

Table 2.25 Quarter factorial design

Table 2.26 A Plackett–Burman design for 11 factors, generator outlined by a box

Table 2.27 Generators for Plackett–Burman design, first row is at − level

Table 2.28 Equivalence of Plackett–Burman and fractional factorial designs for seven factors, the arrows showing how the rows are related

Table 2.30 Parameters for construction of a multi-level calibration design

Table 2.29 Development of a multi-level partial factorial design

Table 2.31 Construction of a central composite design

Table 2.32 Three possible two-factor central composite designs

Table 2.33 Position of the axial points for rotatability and orthogonality for central composite designs with varying number of replicates (one less than the number of central points)

Table 2.34 Three-component simplex centroid mixture design

Table 2.36 A {5,2} simplex centroid design

Table 2.37 Two-component simplex lattice design

Table 2.38 Number of experiments required for various simplex lattice designs, with different numbers of components and interactions

Table 2.39 Constrained mixture design with three lower bounds

Table 2.41 Example of simultaneous constraints in mixture designs

Table 2.42 Constrained mixture design where both upper and lower limits are known in advance

Chapter 3: Signal Processing

Table 3.1 Reducing digital resolution

Table 3.2 Stationary and moving average noise

Table 3.3 Savitzky–Golay coefficients

c

i

+

j

for smoothing

Table 3.4 Results of various filters on a data set

Table 3.5 A sequential process: illustration of moving average and median smoothing

Table 3.6 Savitzky–Golay coefficients for derivatives

Table 3.7 Data in Figure 3.14 together with the data lagged by five points in time

Table 3.8 Two time series, for which the cross-correlogram is presented in Figure 3.16

Table 3.9 Equivalence between parameters in the time domain and frequency domain

Table 3.10 Kalman filter calculation

Table 3.11 Numerical example for wavelet transform: left raw data, centre transformed data after level 1 wavelet and right after level 2 wavelet, without scaling

Table 3.12 Maximum entropy calculation for unbiased die, logarithms to the base 10

Table 3.13 Maximum entropy calculation for biased die

Chapter 4: Principal Component Analysis and Unsupervised Pattern Recognition

Table 4.1 Case study 1: a chromatogram recorded at 30 points in time and 28 wavelengths

Table 4.2 Case study 2: NIR spectra of 72 oils in AU recorded at 32 wavelengths, consisting of four groups A: corn oil, B: olive oil, C: safflower oil, D: corn margarine, after baseline correction and suiTable pre-processing

Table 4.3 Case study 3: properties of some elements

Table 4.4 Scores and loadings for case study 1

Table 4.5 Eigenvalues for case study 1 (raw data)

Table 4.6 Eigenvalues for case study 3 (standardised data)

Table 4.7 Size of eigenvalues for case study 1 after column centring

Table 4.8 Cross-validation example

Table 4.9 Calculation of cross-validated error for sample 1

Table 4.10 Calculation of RSS and PRESS

Table 4.11 Example for logarithmic scaling; the first five samples belong to one group and the last five to a separate group

Table 4.12 Example for row scaling

Table 4.13 How the data in Table 4.12 were simulated as discussed in the text

Table 4.14 Example for Mean Centring

Table 4.15 Standardising the data of Table 4.11

Table 4.16 Example for cluster analysis

Table 4.17 Correlation matrix.

Table 4.18 Euclidean distance matrix.

Table 4.19 Manhattan distance matrix.

Table 4.20 Nearest neighbour cluster analysis, using correlation coefficients for similarity measures, and data in Table 4.16

Chapter 5: Classification and Supervised Pattern Recognition

Table 5.1 Case study in Section 5.1.2: the data involve 20 samples in two classes (first 10 = class A, second 10 = class B) recorded using two variables

Table 5.2 Class distances for the data in Table 5.1 using EDC, LDA and QDA together with the predicted class memberships

Table 5.3 PLS-DA components of data in Table 5.1

Table 5.4 PLS-DA predictions of c for one-component and two-component models for the centred data in Table 5.4

Table 5.5 kNN for data in Table 5.1; the five nearest neighbours are listed and the assignments using

k

= 3 and

k

= 5

Table 5.6 QDA Mahalanobis distance to classes A and B for data in Table 5.1 together with the classification at a confidence limit of 90% (cut-off 2.146); shaded cells are outside the limits

Table 5.7 Class A model using one PC (centred) for SIMCA and data in Table 5.1

Table 5.8 Q and D one-PC class A models for data in Table 5.1

Table 5.9 Division into training and test set

Table 5.10 EDC model of data in Table 5.1 divided into training and test sets

Table 5.11 A simple contingency table

Table 5.12 A 2 × 2 contingency table

Table 5.13 Data set mentioned in Section 5.6

Chapter 6: Calibration

Table 6.1 Case study consisting of 25 spectra recorded at 27 wavelengths in nanometre, absorbances in AU

Table 6.2 Concentrations of the 10 PAHs in the data in Table 6.1

Table 6.3 Concentration of pyrene, absorbance at 335 nm and predictions of absorbance, using single-parameter classical calibration using method of Section 2.2.1

Table 6.4 Concentration of pyrene, absorbance at 335 nm and predictions of absorbance, using single-parameter inverse calibration using method of Section 2.2.2

Table 6.5 Matrices for four components

Table 6.6 Matrix

B

for Section 6.3.2

Table 6.7 Estimated concentration for four components as described in Section 6.3.2

Table 6.8 Estimated concentrations for the case study using uncentred MLR and all wavelengths

Table 6.9 Estimates for three PAHs using the full data set and MLR but including only three compounds in the model

Table 6.10 Scores of the first 10 PCs for PAH case study

Table 6.11 Vector

r

for pyrene

Table 6.12 Concentration estimates of the PAHs using PCR and 10 components (uncentred)

Table 6.13 Calculation of concentration estimates for pyrene using two PLS components

Table 6.14 Magnitudes of first 15 PLS1 components (centred data) for pyrene

Table 6.15 Concentration estimates of the PAHs using PLS1 and 10 components (centred)

Table 6.16 Concentration estimates of the PAHs using PLS2 and 10 components (centred)

Table 6.17 Three-way calibration data set

Table 6.18 Four methods of mean centring the data in Table 6.17, illustrated by the variable

x

i

,1,1

as discussed in Section 6.5.3.1

Table 6.19 Calculation of three tri-linear PLS1 components for the data in Table 6.17 and residuals for sample 1

Table 6.20 Independent test set.

Chapter 7: Evolutionary Multivariate Signals

Table 7.1 Data set A

Table 7.2 Data set B

Table 7.3 Data set C

Table 7.4 Method for ranking variables using data set C

Table 7.5 Correlation coefficients for data set A between successive points (left-hand column) and between point 15 (right-hand column)

Table 7.6 Results of forward and backward EFA for the data set A

Table 7.7 Fixed sized window factor analysis applied to data set A using a three-point window

Table 7.8 Derivative calculation for determining purity of regions in data set A

Table 7.9 Estimated spectra obtained from the composition 1 regions in data set A

Table 7.10 Estimation of profiles using PCA for the data in Table 7.9

Table 7.11 Key steps in the calculation of rotation matrix for data set A using scores in composition 1 regions

Table 7.12 Determining spectrum and elution profiles of an embedded peak

Appendix

Table A.1 Cumulative standardised normal distribution

Table A.2 Critical values of

χ

2

Table A.3 Critical values of two-tailed

t

-distribution

Table A.4 One-tailed critical values of the

F

-distribution at 1% level

Table A.5 One-tailed critical values of the

F

-distribution at 5% level

Chemometrics

Data Driven Extraction for Science

 

Richard G. Brereton

University of Bristol (Emeritus) UK

 

Second Edition

 

 

 

 

This edition first published 2018

© 2018 John Wiley & Sons Ltd

John Wiley & Sons Ltd (1e, 2009)

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Richard G. Brereton to be identified as the author of this work has been asserted in accordance with law.

Registered Office(s)

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data:

Names: Brereton, Richard G., author.

Title: Chemometrics : data driven extraction for science / Richard G. Brereton.

Description: Second edition. | Hoboken, NJ : John Wiley & Sons, 2018. | Originally published in 2003 as: Chemometrics : data analysis for the laboratory and chemical plant. |

Identifiers: LCCN 2017054468 (print) | LCCN 2017059486 (ebook) | ISBN 9781118904688 (epub) | ISBN 9781118904671 (pdf) | ISBN 9781118904664 (pbk.)

Subjects: LCSH: Chemometrics-Data processing. | Chemical processes-Statistical methods-Data processing.

Classification: LCC QD75.4.C45 (ebook) | LCC QD75.4.C45 B74 2018 (print) | DDC 543.01/5195-dc23

LC record available at https://lccn.loc.gov/2017054468

Cover design by Wiley

Cover images: (Background) © LiliKo/Gettyimages; (Diagram)/ Courtesy of Richard G. Brereton

Preface to Second Edition

The first edition of this book has been well received, with a special emphasis on numerical illustration of a wide range of chemometric methods. Of particular importance were the problems at the end of each chapter that readers could work through in their own favourite environment, such as Excel or Matlab, but also R or Python or Fortran or any number of languages or computational packages if desired. I have performed calculations in both Matlab and Excel, but readers should not feel restricted if they have an alternative.

The reader of this book is likely to be an applied scientist or statistician who wishes to understand the basis and motivation of many of the main methods used in chemometrics.

Since the first edition, chemometrics has become much more widespread, including outside mainstream chemistry. In the early 2000s, the major applications were quantitative laboratory analytical science and chemical engineering including process control. Over the past few years, application areas have broadened, as large analytical laboratory-generated data sets become more widely available, for example, in metabolomics, heritage science and food science, reflecting a larger emphasis on pattern recognition in the second edition including some practical case studies from metabolomics in the form of worked problem sets.

Despite this, many of the original building blocks of the subject remain unchanged. A factorial design and a principal component is still the same, so parts of the text only involve small changes from the first edition. Nevertheless, feedback both from students and co-workers of mine and also from comments via the Internet have provided valuable guidance as to what changes are desirable for a second edition. Important structural changes such as multiple choice questions throughout the book and colour printing update the original edition as a modern day textbook.

Some major updates are as follows.

Short multiple choice questions at the end of every section of the main text.

Colour printing involving redrawing many figures.

New chapter on supervised pattern recognition (classification) involving enhanced discussions of SIMCA, PLS-DA, LDA, QDA, EDC, kNN as well as validation.

New case studies on NIR for distinguishing edible oils, and properties of elements, to illustrate unsupervised pattern recognition methods.

New case studies in metabolomics, including Arabidopsis genotyping by MS, Raman of cancerous lymph nodes and NMR for diagnosing diabetes, as new problem sets.

Additional description of MCR and ITTFA.

New and expanded discussions of wavelets and of Bayesian methods in signal analysis.

Updated description of Matlab R2016a under Windows 10, and Excel 2016 under Windows 10, in the context of the needs of the chemometrician.

Enhanced discussion of the main statistical distributions.

Enhanced discussions on validation and optimisation, including description of the bootstrap and of performance indicators.

To supplement this book, all data sets in this book, both from the main text and the problems at the end of each chapter, are downloadable. In addition, there is a downloadable Excel add-in to perform most of the common multivariate methods and a macro for labelling graphs. Matlab routines corresponding to many of the main methods are also available. The answers to the problems at the end of each chapter can also be found. These are available on the Wiley website associated with this book.

It is hoped that this text will be useful for students wishing to obtain a fundamental understanding of many chemometric methods. It will also be useful for any practicing chemometrician who needs to work through methods they may have only recently encountered, using numerical examples: as a researcher, when I encounter an unfamiliar approach, I usually like to reproduce numerical data from published case studies to check how it works before I am confident to use the method. For people encountering chemometrics for the first time, for example, in metabolomics and heritage science, this book presents many of the most widespread methods and so will serve as a good reference. And as a refresher, the multiple choice questions test the basic understanding. The worked case studies can be collected together and are helpful for courses.

Finally, I thank the publishers who have encouraged the development of this rather complex project, especially Jenny Cossham, through many stages and also colleagues who have provided data as listed in the acknowledgements.

Bristol, May 2017

Richard G. Brereton

Preface to First Edition

This book is a product of several years' activities from myself. First and foremost, the task of educating graduate students in my research group from a large variety of backgrounds over the past 10 years has been a significant formative experience, and this has allowed me to develop a large series of problems which we set every 3 weeks and present answers in seminars. From my experience, this is the best way to learn chemometrics! In addition, I have had the privilege to organise international quality courses mainly for industrialists with the participation of many representatives as tutors of the best organisations and institutes around the world, and I have learnt from them. Different approaches are normally taken while teaching industrialists who may be encountering chemometrics for the first time in mid-career and have a limited period of a few days to attend a condensed course, and university students that have several months or even years to practice and improve. However, it is hoped that this book represents a symbiosis of both needs.

In addition, it has been a great inspiration for me to write a regular fortnightly column for Chemweb (available to all registered users on www.chemweb.com) and some of the material in this book is based on articles first available in this format. Chemweb brings a large reader base to chemometrics, and feedback via e-mails or even travels around the world have helped me formulate my ideas. There is a very wide interest in this subject, but it is somewhat fragmented. For example, there is a strong group of Near Infrared Spectroscopists, primarily in the USA, that has led to the application of advanced ideas in process monitoring who see chemometrics as a quite technical industrially oriented subject. There are other groups of mainstream chemists that see chemometrics as applicable to almost all branches of research, ranging from kinetics to titrations to synthesis optimisation. Satisfying all these diverse people is not an easy task.

This book relies mainly on numerical examples: many in the body of the text come from my favourite research interests that are primarily in analytical chromatography and spectroscopy, to expand the text more to produce a huge book of twice the size, so I ask the indulgence of readers if your area of application differs. Certain chapters such as those on calibration could be approached from widely different viewpoints, but the methodological principles are the most important, and if you understand how the ideas can be applied in one area, you will be able to translate to your own favourite application. In the problems at the end of each chapter, I cover a wider range of applications to illustrate the broad basis of these methods. The emphasis of this book is on understanding ideas, which can then be applied to a wide variety of problems in chemistry, chemical engineering and allied disciplines.

It is difficult to select what material to include in this book without making it too long. Every expert I have shown this book to has made suggestions for new material. Some I have taken into account and I am most grateful for every proposal, and others I have mentioned briefly or not at all, mainly for the reason of length and also to ensure that this book sees the light of day rather than constantly expands without an end. There are many outstanding specialist books for the enthusiast. It is my experience, although, that if you understand the main principles (which are quite a few in number), and constantly apply them to a variety of problems, you will soon pick up the more advanced techniques, so it is the building blocks that are most important.

In a book of this nature, it is very difficult to decide on what detail is required for the various algorithms, some readers will have no real interest in the algorithms, whereas others will feel the text is incomplete without comprehensive descriptions. The main algorithms for common chemometric methods are presented in Appendix A.2. Step by step descriptions of methods, rather than algorithms, are presented in the text. A few approaches that will interest some readers such as cross-validation in PLS are described in the problems at the end of appropriate chapters which supplement the text. It is expected that readers will approach this book with different levels of knowledge and expectations, so it is possible to gain a great deal without having an in-depth appreciation of computational algorithms, but for interested readers, the information is nevertheless available. People rarely read texts in a linear fashion, they often dip in and out of parts of it according to their background and aspirations, and chemometrics is a subject which people approach with very different previous knowledge and skills, so it is possible to gain from this book without covering every topic in full. Many readers will simply use add-ins or Matlab commands and be able to produce all the results in this text.

Chemometrics uses a very large variety of software. In this book, we recommend two main environments, Excel and Matlab, the examples have been tried using both environments, and you should be able to get the same answers in both cases. Users of this book will vary from people that simply want to plug the data into existing packages to those that are curious and want to reproduce the methods in their own favourite language such as Matlab, VBA or even C. In some cases, instructors may use the information available with this book to tailor examples for problem classes. Extra software supplements are available via the publishers' website www.SpectroscopyNOW.com, together with all the data sets in this book.

The problems at the end of each chapter form an important part of the text, the examples being a mixture of simulations (which have an important role in chemometrics) and real case studies from a wide variety of sources. For each problem, the relevant sections of the text that provide further information are referenced. However, a few problems build on the existing material and take the reader further: a good chemometrician should be able to use the basic building blocks to understand and use new methods. The problems are of various types; thus, not every reader will to solve all the problems. In addition, instructors can use the data sets to construct workshops or course material that goes further than the book.

I am very grateful for the tremendous support I have had from many people when asking for information and help with data sets and permission where required. I thank Chemweb for agreement to present material modified from articles originally published in their e-zine, The Alchemist, and the RSC for permission to base the text of Chapter 5 on material originally published in the Analyst (125, 2125–2154 (2000)). A full list of acknowledgements for the data sets used in this text is presented after this foreword.

I thank Tom Thurston and Les Erskine for a superb job on the Excel add-in, and Hailin Shen for outstanding help in Matlab. Numerous people have tested the answers to the problems. Special mention should be given to Christian Airiau, Kostas Zissis, Tom Thurston, Conrad Bessant and Cevdet Demir for access to a comprehensive set of answers on disc for a large number of exercises so I can check mine. In addition, several people have read chapters and made detailed comments particularly checking numerical examples; in particular, I thank Hailin Shen for suggestions about improving Chapter 6 and Mohammed Wasim for careful checking of errors. In some ways, the best critics are the students and postdocs working with me because they are the people that have to read and understand a book of this nature, and it gives me great confidence that my co-workers in Bristol have found this approach useful and have been able to learn from the examples.

Finally, I thank the publishers for taking a germ on an idea and making valuable suggestions as to how this could be expanded and improved to produce what I hope is a successful textbook and having faith and patience over a protracted period.

Bristol, February 2002

Richard G. Brereton

Acknowledgements

The following have provided me with sources of data for this text. All other case studies are simulations.

Data set

Source

Problem 2.2

A. Nordin, L. Eriksson, M. Öhman,

Fuel

, 74, 128–135 (1995)

Problem 2.6

G. Drava, University of Genova

Problem 2.7

I.B. Rubin, T.J. Mitchell, G. Goldstein,

Anal Chem

, 43, 717–721 (1971)

Problem 2.10

G. Drava, University of Genova

Problem 2.11

Y. Yifeng, S. Dianpeng, H. Xuebing, W. Shulan,

Bull Chem Soc Japan

, 68, 1115–1118 (1995)

Problem 2.12

D.V. McCalley, University of West of England, Bristol

Problem 2.15

D. Vojnovic, B. Campisi, A. Mattei, L. Favreto,

Chemometrics Intell Lab Systems

, 27, 205–219 (1995)

Problem 2.16

L.E. Garcia-Ayuso, M.D. Luque de Castro,

Anal Chim Acta

, 382, 309–316 (1999)

Problem 3.8

K.D. Zissis, University of Bristol

Problem 3.9

C. Airiau, University of Bristol

Table 4.1

S. Dunkerley, University of Bristol

Table 4.2

S. Goswami and K. Olafsson, Camo ASA

Table 4.3

A. Javey, Chemometrics On-line

Problem 4.3

D. Duewer, National Institute of Standards Technology, US

Problem 4.5

S. Dunkerley, University of Bristol

Problem 5.3

S. Wold, University of Umeå (based on R. Cole and K. Phelps,

J Sci Food Agric

, 30, 669–676 (1979)

Problem 5.4

P. Bruno, M. Caselli, M.L. Curri, A. Genga, R. Striccoli, A. Traini,

Anal Chim Acta

, 410, 193–202 (2000)

Problem 5.5

R. Vendrame, R.S. Braga, Y. Takahata, D.S. Galvão,

J Chem Inf Comp Sci

, 39, 1094–1104 (1999)

Problem 5.7

R. Goodacre, University of Manchester (based on M. Kusano, A. Fukushima, M. Arita, P. Jonsson, T. Moritz, M. Kobayashi, M., et al.,

BMC System Biology

, 1, 53 (2007) – Metabolights accession MTBLS40)

Problem 5.8

R. Goodacre, University of Manchester (based on R.M. Salek, M.L. Maguire, E. Bentley, D.V. Rubtsov, T. Hough, M. Cheeseman, et al.,

Physiol Genomics

, 29, 99–10 (2007) – Metabolights accession MTBLS1)

Problem 5.9

G.R. Lloyd (based on G.R. Lloyd, L.E. Orr, J. Christie-Brown et al.,

Analyst

, 138, 3900–3908 (2013))

Table 6.1

S.D. Wilkes, University of Bristol

Table 6.20

S.D. Wilkes, University of Bristol

Problem 6.1

M.C. Pietrogrande, F. Dondi, P.A. Borea, C. Bighi,

Chemometrics Intell Lab Systems

, 5, 257–262 (1989)

Problem 6.3

H. Martens, M. Martens,

Multivariate Analysis of Quality

, Wiley, Chichester, 2001, p. 14

Problem 6.6

P.M. Vacas, University of Bristol

Problem 6.9

K.D. Zissis, University of Bristol

Problem 7.1

S. Dunkerley, University of Bristol

Problem 7.3

S. Dunkerley, University of Bristol

Problem 7.5

R. Tauler, University of Barcelona (results published in R. Gargallo, R. Tauler, A. Izquierdo-Ridorsa,

Quimica Analitica

, 18, 117–120)

Problem 7.6

S.P. Gurden, University of Bristol

About the Companion Website

Do not forget to visit the companion website for this book:

http://booksupport.wiley.com

The accompanying website for this text, http://booksupport.wiley.com, provides valuable material designed to enhance your learning, including:

Answers to problems at the end of each chapter

Software

Associated data sets

Figures in PPT

Chapter 1Introduction

1.1 Historical Parentage

There are many opinions about the origin of chemometrics. Until quite recently, the birth of chemometrics was considered to have happened in the 1970s. Its name first appeared in 1972 in an article by Svante Wold [1]: in fact, the topic of this article was not one that we would recognise as being core to chemometrics, being relevant to neither multivariate analysis nor experimental design. For over a decade, the word chemometrics was considered to be of very low profile, and it developed a recognisable presence only in the 1980s, as described below.

However, if an explorer describes a new species in a forest, the species was there long before the explorer. Thus, the naming of the discipline just recognises that it had reached some level of visibility and maturity. As people re-evaluate the origins of chemometrics, the birth can be traced many years back.

Chemometrics burst into the world due to three fundamental factors, applied statistics (multivariate and experimental design), statistics in analytical and physical chemistry, and scientific computing.

1.1.1 Applied Statistics

The ideas of multivariate statistics have been around a long time. R.A. Fisher and colleagues working in Rothamsted, UK, formalised many of our modern ideas while applying primarily to agriculture. In the UK, before the First World War, many of the upper classes owned extensive land and relied on their income from tenant farmers and agricultural labourers. After the First World War, the cost of labour became higher, with many moving to the cities, and there was stronger competition of food from global imports. This meant that historic agricultural practices were seen to be inefficient and it was hard for landowners (or companies that took over large estates) to be economic and competitive, hence a huge emphasis on agricultural research, including statistics to improve these. R.A. Fisher and co-workers published some of the first major books and papers that we would regard as defining modern statistical thinking [2, 3], introducing ideas ranging from the null hypothesis to discriminant analysis to ANOVA. Some of the work of Fisher followed from the pioneering work of Karl Pearson in the University College London who had founded the world's first statistics department previously and had first formulated ideas such as p values and correlation coefficients.

During the 1920s and 1930s, a number of important pioneers of multivariate statistics published their work, many strongly influenced or having worked with Fisher, including Harold Hotelling, credited by many as defining principal components analysis (PCA) [4], although Pearson had independently described this method some 30 years ago, but under a different guise. As so often ideas are reported several times over in science, it is the person that names it and popularises it that often gets the credit: in the early twentieth century, libraries were often localised and there were very few international journals (Hotelling working mainly in the US) and certainly no internet; therefore, parallel work was often reported.

The principles of statistical experimental design were also formulated at around this period. There had been early reports on what we regard as modern approaches to formal designs before that, for example James Lind's work on scurvy in the eighteenth century and Charles Pierce's discussion on randomised trials in the nineteenth century, but Fisher's classic work of the 1930s put all the concepts together in a rigorous statistical format [5].

Much non-Bayesian, applied statistical thinking has been based on principles established in the 1920s and 1930s, for nearly a century. Early applications include agriculture, psychology, finance and genetics. After the Second World War, the chemical industry took an interest. In the 1920s, an important need was to improve agricultural practice, but by the 1950s, a major need was to improve processes in manufacturing, especially chemical engineering; hence, many more statisticians were employed within the industry. O.L. Davies edited an important book on experimental design with contributions from colleagues in ICI [6]. Foremost was G.E.P. Box, son-in-law of Fisher, whose book with colleagues is one of the most important post-war classics in experimental design and multi-linear regression [7].

These statistical building blocks were already mature by the time people started calling themselves chemometricians and have changed only a little during the intervening period.

1.1.2 Statistics in Analytical and Physical Chemistry