128,99 €
Overview of methods for analyzing high-dimensional experimental data, including theory, methodologies, and applications
Analysis of Variance for High-Dimensional Data summarizes all the methods to analyze high-dimensional data that are obtained through applying an experimental design in the life, food, and chemical sciences, especially those developed in recent years.
Written by international experts who lead development in the field, Analysis of Variance for High-Dimensional Data includes information on:
Analysis of Variance for High-Dimensional Data is an essential reference for practitioners involved in data analysis in the natural sciences, including professionals working in chemometrics, bioinformatics, data science, statistics, and machine learning. The book is valuable for developers of new methods in high dimensional data analysis.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 582
Veröffentlichungsjahr: 2025
Cover
Table of Contents
Title Page
Copyright
About the Authors
Foreword
Preface
1 Introduction
1.1 Types of Data
1.2 Statistical Design of Experiments
1.3 High‐Dimensional Data
1.4 Examples
1.5 Complexities
1.6 Direct Versus Indirect Methods
1.7 Some History
1.A Appendix
Notes
2 Basic Theory and Concepts
2.1 Mathematical Background
2.2 Statistical Background
2.3 Association Measures
2.A Appendix
Notes
3 Linear Models
3.1 Introduction
3.2 Simple ANOVA Models
3.3 Regression Formulation, Estimability, and Contrasts
3.4 Coding Schemes
3.5 Advanced Models
3.6 Hasse Diagrams
3.7 Validation
3.8 Miscellaneous Models
3.A Appendix
Notes
4 ASCA and Related Methods
4.1 ASCA
4.2 APCA
4.3 ASCA+
4.4 Principal Response Curves
4.5 SMART
4.6 ASCA, PRC, and SMART Compared
4.7 MSCA
4.A Appendix
Notes
5 Alternative Methods
5.1 General Introduction
5.2 PLSR‐Based Methods
5.3 LMM‐Based Methods
5.4 Miscellaneous Methods
6 Distance‐based Methods
6.1 Introduction
6.2 Methods
6.3 ANOSIM
Note
7 Reviews and Reflections
7.1 Reviews
7.2 Reflections
8 Software
8.1 HD‐ANOVA Software
8.2 R Package
HDANOVA
8.3 Installing and Starting the Package
8.4 Data Handling
8.5 Analysis of Variance (ANOVA)
8.6 Basic ASCA Family
8.7 Alternative Methods
8.8 Software Packages
References
Index
End User License Agreement
Chapter 1
Table 1.1 Effect sizes and significance in the microbiome example.
Table 1.2 Overview of types of measurement scales and their permissible tran...
Table 1.A.1 Overview of types of measurements in different fields.
Table 1.A.2 Overview of notation as used in this book.
Table 1.A.3 Abbreviations of the different methods.
Chapter 2
Table 2.1 ASV values of three samples.
Table 2.2 Distances and dissimilarities between the samples.
Table 2.3 ANOVA table for a crossed two‐way design.
Chapter 3
Table 3.1 Overview of one‐way ANOVA model for the balanced case.
Table 3.2 Overview of one‐way ANOVA model for isoleucine example.
Table 3.3 Overview of a two‐way ANOVA model for the balanced case.
Table 3.4 Overview of a two‐way ANOVA model with interaction for the balance...
Table 3.5 Overview of the two‐way ANOVA model with interaction for the alani...
Table 3.6 Overview of the ANOVA model for the two‐factor nested example.
Table 3.7 The ANOVA table for a one‐way variance component model for the bal...
Table 3.8 Reference‐coded design matrix for categorical Factor
Time
.
Table 3.9 Reference‐coded design matrix for categorical Factor
Treatment
.
Table 3.10 Design matrix for
Treatment:Time
interaction.
Table 3.11 Design matrix for random effect matrix for the Factor
Individual
....
Table 3.12 Design matrix for quantitative time factor and its interactions f...
Table 3.13 Design matrix for quantitative time factor and its interactions f...
Table 3.14 Results of EMS calculation for a two‐factor LMM.
Table 3.A.1 Summary of the centering properties of the different coding syst...
Table 3.A.2 Simulation setup for data used in three‐factor fixed effect ANOV...
Table 3.A.3 Cell counts for unbalanced data,
i.e
., the number of samples for...
Table 3.A.4 ‐values for fixed effect ANOVAs using combinations of types of ...
Table 3.A.5 ‐values for fixed effect ANOVAs using combinations of types of ...
Chapter 4
Table 4.1 ASCA SS and explained variance per effect for the candy data.
Chapter 5
Table 5.1 Results of permutation testing of the residuals for estimating t...
Table 5.2 RSR values and results of permutation testing for estimating the e...
Table 5.3 Overview of two‐way ANOVA models on the first five PCs extracted f...
Table 5.4 Results of the F test on the block saliences on CC1 for testing th...
Chapter 6
Table 6.1 Frequency table for variables in samples and .
Table 6.2 Counts, proportions, and CLR‐transformed values of example data.
Table 6.3 Simple example with three samples and five features.
Table 6.4 Design and samples for the simple PERMANOVA example.
Table 6.5 Euclidean distances between the samples of the simple PERMANOVA ex...
Table 6.6 The matrix in the simple PERMANOVA example.
Table 6.7 The matrix in the PERMANOVA example.
Table 6.8 The matrix in the PERMANOVA example.
Table 6.9 The matrix in the PERMANOVA example.
Table 6.10 Results of the PERMANOVA analysis of the Loue data, in which the ...
Chapter 7
Table 7.1 Overview of metabolomics applications using HD‐ANOVA in the plant ...
Table 7.2 Overview of metabolomics applications using HD‐ANOVA in microbiolo...
Table 7.3 Overview of metabolomics applications using multivariate ANOVA in ...
Table 7.4 Overview of metabolomics applications using HD‐ANOVA in human scie...
Table 7.5 Overview of applications using HD‐ANOVA (except PERMANOVA) in the ...
Table 7.6 Overview of applications using HD‐ANOVA in genomics applications....
Table 7.7 Overview of applications using HD‐ANOVA in proteomics.
Table 7.8 Overview of applications using HD‐ANOVA in food science.
Table 7.9 Overview of applications using HD‐ANOVA in sensory science.
Table 7.10 Overview of applications using HD‐ANOVA in chemistry.
Table 7.11 Overview of the reviews.
Table 7.12 Properties of methods.
Chapter 8
Table 8.1 R packages on CRAN having one or more HD‐ANOVA methods.
Table 8.2 MATLAB toolboxes and functions having one or more HD‐ANOVA methods...
Table 8.3 Python packages having one or more HD‐ANOVA methods.
Chapter 1
Figure 1.1 Two different types of data sets: (a) long, tall, or skinny, (b) ...
Figure 1.2 Two different types of data fusion: (a) along the sample mode or ...
Figure 1.3 Data blocks arranged in a pathway topology (a) or in a three‐way ...
Figure 1.4 Prior knowledge available for a data set.
Figure 1.5 Design information about the samples: (a) encoding for skinny dat...
Figure 1.6 The curse of dimensionality. For an explanation, see the text....
Figure 1.7 The treatment design of the hepatocyte differentiation study. Det...
Figure 1.8 The ASCA results of the hepatocyte study: (a) cell type scores pl...
Figure 1.9 The ASCA results of the (
Risk:Time
) interaction effect of the app...
Figure 1.10 Three different ways to deal with microbiome data: (a) calculate...
Figure 1.11 Schematic representation of low‐level data fusion of microbiome ...
Figure 1.12 The APCA results of the
E. coli
study: (a) score plot of the mat...
Figure 1.13 The ASCA results of the pasta study: (a) the permutation results...
Figure 1.14 The ASCA results of the sensory study: (a) visualizes the scores...
Figure 1.15 The ASCA results of the HILIC study: (a) the scores of a PCA of ...
Figure 1.16 The ASCA results of the phytoremediation study: (a) scores of PC...
Figure 1.17 A graphical depiction of the history of the development of HD‐AN...
Chapter 2
Figure 2.1 Row‐space (a) and column‐space (b) representation of the raw data...
Figure 2.2 Row‐space (a) and column‐space (b) representation of the centered...
Figure 2.3 Two variables and are correlated as shown by the 95% contour ...
Figure 2.4 The vector is projected onto the vector giving . For more ex...
Figure 2.5 The vector is projected onto the vector along the vector gi...
Figure 2.6 Extended figure for oblique projections.
Figure 2.7 The vector is projected onto the subspace spanned by the vector...
Figure 2.8 Projection on a set of dependent vectors.
Figure 2.9 The two green lines are the loadings, and the blue dots numbered ...
Figure 2.10 The value of the test statistic (squared difference between the ...
Figure 2.11 A problematic correlation depending completely on only one obser...
Figure 2.12 Data of two measurements obtained for a treatment factor at two ...
Figure 2.13 Histograms of the distribution of the Pearson correlation betwee...
Figure 2.14 Histograms of the distribution of the Pearson correlation betwee...
Chapter 3
Figure 3.1 Pictorial representation of a one‐way balanced design with three ...
Figure 3.2 Concentration of isoleucine after 160 minutes exposure to differe...
Figure 3.3 Pictorial representation of a balanced crossed two‐way design. Th...
Figure 3.4 An example of interaction. The height of plants in a greenhouse i...
Figure 3.5 Concentration of alanine as a function of the exposure to differe...
Figure 3.6 Pictorial representation of a balanced nested two‐way design. The...
Figure 3.7 Scheme of the two‐factor nested ANOVA example, where Factor
A
is ...
Figure 3.8 Pictorial representation of a one‐way unbalanced design. The draw...
Figure 3.9 Pictorial representation of a two‐way crossed unbalanced design. ...
Figure 3.10 Pictorial representation of a two‐way nested unbalanced design. ...
Figure 3.11 A variable is repeatedly measured for four individuals at six ti...
Figure 3.12 Ab profiles of different subjects. PRE is pre‐dose, PI(D30) is a...
Figure 3.13 Fit of the categorical model of Equation 3.120 for the Alum grou...
Figure 3.14 Hasse diagrams for the crossed two‐factor design: (a) diagram fo...
Figure 3.15 Hasse diagram for the calculation of the expected mean squares o...
Figure 3.16 Randomized complete block design of the Alfalfa experiment with ...
Figure 3.A.1 An unbalanced one‐way ANOVA with four levels. The bubble size r...
Figure 3.A.2 Cross‐products of the different factor codings (
i.e
., the mat...
Chapter 4
Figure 4.1 Pictorial presentation of a two‐way ASCA model without interactio...
Figure 4.2 ASCA analysis of metabolomics data. The scores and loadings of PC...
Figure 4.3 Simulated variables for the ASCA model showing different time and...
Figure 4.4 Results of different types of scaling of the ASCA between‐effects...
Figure 4.5 Results of different types of scaling of the ASCA within‐effects....
Figure 4.6 The profiles of the concentrations of the blood metabolites (v1–v...
Figure 4.7 The between‐effect results for the first component. Left panels: ...
Figure 4.8 The between‐effect results for the second component. Left panels:...
Figure 4.9 The within‐effect results for the first component. Left panels: s...
Figure 4.10 Arrangement of the data of the rat example in a matrix.
Figure 4.11 Plot of the data of the rat example. Each cross represents an in...
Figure 4.12 First step in ASCA: overall centering and centering per group. T...
Figure 4.13 Estimating the main factor effects by taking averages across fac...
Figure 4.14 Estimating the interaction effect by subtracting the main effect...
Figure 4.15 APCA analysis of metabolomics data.
Figure 4.16 Performance plots for the Factors
A
(, left) and
B
(, right). ...
Figure 4.17 Score plot with back‐projected samples, model ellipsoids, and da...
Figure 4.18 PRC analysis of metabolomics data.
Figure 4.19 Small example of two metabolites. Treatment groups are indicated...
Figure 4.20 Small example of two metabolites. Representation of the ASCA mod...
Figure 4.21 Small example of two metabolites. Representation of the PRC mode...
Figure 4.22 Comparing ASCA and PRC for time point 1. The vectors and are...
Figure 4.23 The scores of the different methods on the first component; (a) ...
Figure 4.24 The scores of the different methods on the third component; (a) ...
Figure 4.25 The loadings of the different methods on the third component; (a...
Figure 4.26 Score plot of the between‐monkey variation of the normality stud...
Figure 4.27 Score plot of the within‐monkey variation of monkey 6; (a) the w...
Chapter 5
Figure 5.1 Residual‐augmented effect matrix for
Temperature
, which is used a...
Figure 5.2 Comparison of the sum‐of‐squares of the residuals of the PLS mo...
Figure 5.3 Scores of the samples onto the two target projection vectors of t...
Figure 5.4 Variable loadings for the first (a) and the second (b) target pro...
Figure 5.5 (a) Orthogonal samples scores onto the two target projection comp...
Figure 5.6 Block weights on the 33 predictive components for the AMOPLS mode...
Figure 5.7 (a) Samples scores onto the first predictive components of the AM...
Figure 5.8 RM‐ASCA+ decomposition of
Treatment
+
Time:Treatment
interaction ...
Figure 5.9 PCA Score and loading plots
Time +Time:Treatment
effect varia...
Figure 5.10 (a) PCA Score of the
Time:Treatment
random individual variation....
Figure 5.11 LiMM‐PCA decomposition of
Treatment
and
Time:Treatment
interacti...
Figure 5.12 Scores of the 140 experiments along the first two PCs extracted ...
Figure 5.13 Loading of the 67 metabolites along the first two PCs extracted ...
Figure 5.14 Scores of the 4
Light
levels along PC2 and PC4 extracted from th...
Figure 5.15 Scores of along PC2 plotted longitudinally against
Time
.
Figure 5.16 Scores on the first PC of a PCA on the toxicology data showing t...
Figure 5.17 ASCA scores on the first (a) and second (b) PARAFAC component of...
Figure 5.18 PARAFASCA scores on the first (a) and second (b) factor estimate...
Figure 5.19 Simulated example of PE‐ASCA. The simulation of the data (left) ...
Figure 5.20 PE‐ASCA result of the experimental NMR‐metabolomics data, (botto...
Figure 5.21 Scores along the two Canonical Variates for the effect of
Temper
...
Figure 5.22 Variable loadings along the two Canonical Variates for the effec...
Figure 5.23 Effect matrices resulting from the ANOVA decomposition of the pa...
Figure 5.24 Saliences of the different residual‐augmented data blocks result...
Figure 5.25 Sample scores along CC3 and CC19 which, based on the values of t...
Figure 5.26 Loadings of the block corresponding to the residual‐augmented ef...
Chapter 6
Figure 6.1 The horseshoe effect; (a) layout of the data (black is nonzero); ...
Figure 6.2 The compositionality issue. In sample 2, species C is doubled in ...
Figure 6.3 (a) Three samples in space of variables , , and ; (b) samples ...
Figure 6.4 PCoA plot of the Loue river example.
Figure 6.5 Equivalence between the sum of squared distances between points a...
Figure 6.6 (a) Original samples; (b) ‐coordinates of PCoA analysis.
Figure 6.7 (a) PCA scores of indicating group centroids of the three group...
Chapter 8
Figure 8.1 Model assessment plot for assessing homogeneity and normality.
Figure 8.2 Boxplots over repetitions within daughters.
Figure 8.3 Histogram of SSQ values for assessor effect. The estimated effect...
Figure 8.4 Loading plot and score plot for Candy effect.
Figure 8.5 Loading plot and score plot for Assessor effect.
Figure 8.6 Score plot for Assessor effect without backprojection and as spid...
Figure 8.7 Data ellipsoids and confidence ellipsoids for the
Candy
effect.
Figure 8.8 Data ellipsoids and confidence ellipsoids for the
Candy
effect in...
Figure 8.9 Scoreplot as a function of time for combined effects in ASCA.
Figure 8.10 Loading plot and score plot of Candy effect in APCA model.
Figure 8.11 Loading plot and score plot for global PCA model in PC‐ANOVA.
Figure 8.12 Score plot for between‐individuals effect in MSCA.
Figure 8.13 Score plot for within‐individuals effect in MSCA.
Figure 8.14 Score plot for within‐individuals effect – one factor level at a...
Figure 8.15 Score plot for fixed effect in LiMM‐PCA.
Figure 8.16 Score plot for fixed effect in LiMM‐PCA when using least squares...
Figure 8.17 Plot of
Treatment + Time:Treatment
for PRC.
Cover
Table of Contents
Title Page
Copyright
About the Authors
Foreword
Preface
Begin Reading
References
Index
End User License Agreement
iii
iv
xi
xii
xiii
xiv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
Age K. SmildeSwammerdam Institute for Life SciencesUniversity of AmsterdamThe Netherlands
and
Department of Plant and Environmental SciencesUniversity of CopenhagenDenmark
Federico MariniDepartment of ChemistryUniversity of Rome “La Sapienza”Italy
Johan A. WesterhuisSwammerdam Institute for Life SciencesUniversity of AmsterdamThe Netherlands
Kristian H. LilandFaculty of Science and TechnologyNorwegian University of Life SciencesNorway
This edition first published 2025©2025 John Wiley & Sons Ltd
All rights reserved, including rights for text and data mining and training of artificial intelligence technologies or similar technologies. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Age K. Smilde, Federico Marini, Johan A. Westerhuis, and Kristian H. Liland to be identified as the authors of this work / the editorial material in this work has been asserted in accordance with law.
Registered Office(s)John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USAJohn Wiley & Sons Ltd, New Era House, 8 Oldlands Way, Bognor Regis, West Sussex, PO22 9NQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
The manufacturer's authorized representative according to the EU General Product Safety Regulation is Wiley‐VCH GmbH, Boschstr. 12, 69469 Weinheim, Germany, e‐mail: [email protected].
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty
In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data applied for:
Hardback ISBN: 9781394211210
Cover Design: WileyCover Image: Courtesy of Age K. Smilde, Federico Marini, Johan A. Westerhuis, and Kristian H. Liland
Age K. Smilde is an Emeritus Professor of Biosystems Data Analysis at the Swammerdam Institute for Life Sciences, University of Amsterdam. He also holds a part‐time position at the Department of Plant and Environmental Sciences, University of Copenhagen. His research interests include modeling complex biological systems using multiway, multiset, and latent‐path models.
Federico Marini is a Professor of Analytical Chemistry at the Department of Chemistry, University of Rome “La Sapienza”. His main research interest includes model development, particularly for classification, variable selection, non‐linear modeling, and data fusion, with applications in spectroscopy and imaging, food science, and various omics.
Johan A. Westerhuis is an Assistant Professor of Biosystems Data Analysis at the Swammerdam Institute for Life Sciences, University of Amsterdam. His main research is in the development and validation of chemometric and machine learning tools and their application to high‐dimensional metabolomics and microbiome data.
Kristian H. Liland is a Professor of Statistics at the Faculty of Science and Technology, Norwegian University of Life Sciences. His main research interests include model development and applications using multivariate statistics and machine learning within spectroscopy, bioinformatics, various omics, health data, and industrial applications.
The reason for studying a biological system is the desire to acquire an understanding of that system. The classical scientific royal road proceeds via experimental designs, accompanied by statistical analyses of variance. By realizing that manipulations can affect the system in different ways, the necessity of studying experimental effects on multiple aspects jointly became eminently clear. With the recent major developments in measurement instruments, it is within common reach to collect high‐quality data on very many variables, which are supposed to be indicative of different aspects of a system.
Yet, then the key challenge becomes to unravel the nature of the experimental effects from the observed high‐dimensional data. This is exactly where analysis of variance for high‐dimensional data comes in. As simple as its basic idea is – combine analysis of variance with dimension reduction – as intricate are the possible decisions to get meaningful results. This book equips the reader with the necessary knowledge to do so. It brings a unified treatment of the methods, carefully built upon the mathematical and conceptual foundations introduced. It discusses the properties and their implications for empirical use, supported by clarifying figures. Further, it treats the similarities and differences between the various methods and their implications. It discusses all essential aspects required for insightful modeling, including preprocessing the data, model diagnostics, statistical testing, validation, and visualization.
The empirical use of the methods is illustrated with examples of high‐dimensional experimental data from various fields in the life, food, and chemical sciences. These inspiring examples provide the reader with a good sense of the line of reasoning guiding the specific choices to be made and show what can be learned from high‐dimensional experimental data. Further inspiration can be gained from the references provided in the review of currently available applications. The kind invitation to the reader to apply and profit from the methods is made complete by the overview of the available software implementations of the HD‐ANOVA methods, and specifically the gentle introduction to the R package HDANOVA.
Though the book is clearly rooted in and focused on the life, food, and chemical sciences, it is also of great value to empirical scientists in other fields, like the behavioral and social sciences and economics. After all, the principles of and methods for studying a system based on multivariate data resulting from an experimental design are generally applicable and provide a rapid path to knowledge about the system. The book provides a state‐of‐the‐art overview and thereby clearly indicates the open issues that deserve attention. I do express the hope that the book will also serve as a starting point for method developers. Specifically, the framework of restricted multivariate multiple regression offers ample opportunities for further useful developments. This will provide method developers with a deeper understanding of the methods themselves, as well as enrich our method toolbox – with the ultimate aim to help us understand our world and the ways we can fruitfully contribute to it.
Marieke Timmerman
Groningen, The NetherlandsSeptember 2024
In many cases in the life, food, and chemical sciences, measurements are performed of a system under various experimental conditions generated according to an experimental design, which provides samples with a grouping structure. Traditionally, such data is analyzed by analysis of variance (ANOVA) and generalizations thereof, such as linear mixed models. These methods are, however, no longer at par with the increasing complexity of the measurements: in many cases, a large number of variables are measured, and there is even the tendency to measure not just one set of related variables, but multiple sets pertaining to the same problem, such as gene expression and metabolomics measurements on the same samples.
Analyzing such complex data requires methods that go beyond ANOVA. In statistics, a generalization to ANOVA for multivariate data is Multivariate ANOVA. This method, however, breaks down when the number of variables exceeds the number of samples which is often the case in modern measuring devices. About 20 years ago, a method was proposed, called Analysis of Variance Simultaneous Component Analysis (ASCA), which attempts to remedy this problem and can be used to analyze high‐dimensional data with a grouping structure. This method has undergone many changes and generalizations in the last two decades, which urgently calls for a book with an overview of these methods.
This book offers such an overview and summarizes the most important developments in the field of high‐dimensional ANOVA. We try to accommodate our readers at different levels. In this book, we explain the theory and show when to use which method underpinned by many examples. This should give the practitioners an idea of which method to use when and how to use it and interpret the results. In elaborations, we give more detail for the readers who like to go in more depth and the algorithms serve the method developers to encourage them to critically see if they can improve on these methods and corresponding algorithms.
A minimum read of the book should be Chapters 1, 4, and 5 for those who are already familiar with matrix algebra and linear models. For those who need to refresh the latter, we propose to include reading Chapters 2 and 3 as well. We also provide software (R‐scripts) that contain most of the presented methods.
We are very grateful to all our friendly colleagues, Fentaw Abegaz, Jasper Engel, Fred van Eeuwijk, Peter Filzmoser, Anna Heintz‐Buschart, Margriet Hendriks, Ulf Indahl, Tormod Næs, Jean‐Michel Roger, Edoardo Saccenti, Joakim Skogholt, and Raffaele Vitale, who took the time to proofread the chapters of our book. We are also grateful to Marieke Timmerman who wrote the foreword. Needless to say, all remaining errors are our responsibility!
Age K. Smilde
Utrecht, The NetherlandsSeptember 2024
Federico Marini
Rome, ItalySeptember 2024
Johan A. Westerhuis
Dronten, The NetherlandsSeptember 2024
Kristian H. Liland
Ås, NorwaySeptember 2024
The amount of data in the life, food, and chemical sciences is exploding. This is mainly due to the rapid improvement and automation of measuring instruments and development of new measuring devices. This has also raised the expectations of scientists to receive answers to increasingly more complex questions. To fill the gap between the large amount of data generated on the one hand, and the complex questions on the other hand, advanced data analysis methods are needed.
Without claiming to be exhaustive, we will give a short overview of what kind of data structures are presently being generated in the life sciences. We will generally use the term life sciences, although also applications from food and chemical sciences will be provided. The first important difference in data sets to be made is shown in Figure 1.1. Tall or skinny data sets – in Figure 1.1a – are characterized by having many samples or cases ( ) and relatively few variables ( ). Data from cohort studies are examples of this, and these are also sometimes called multivariate data. Contrarily to that, there are short or fat data sets such as in Figure 1.1b. In those cases, the number of variables (far) exceeds the number of samples. These data sets are called high‐dimensional. In the life sciences, the latter types of data are far in the majority since collecting samples is usually costly or time‐consuming, and modern sequencing techniques and mass‐spectrometry instruments easily generate thousands of features. In this book, we will discuss methods for the analysis of multivariate data and for high‐dimensional data. We will always use the convention of having the samples in the rows of matrices and the variables or features in their columns.
Figure 1.1 Two different types of data sets: (a) long, tall, or skinny, (b) short or fat.
In many cases in the life sciences, there are multiple data blocks being measured simultaneously on the same or on comparable systems. This can result in different data set configurations, as shown in Figure 1.2. If different types of variables are measured for the same samples, then the data can be fused in the common sampling mode (Figure 1.2a). If the same variables are measured on different (but comparable) sets of samples, then the data can be fused in the common variable mode (Figure 1.2b). The methods to perform such data fusions are manifold, and some of these are explained in Smilde et al. (2022).
Figure 1.2 Two different types of data fusion: (a) along the sample mode or (b) along the variable mode.
Yet other data structures can be envisaged, such as shown in Figure 1.3. The separate data blocks may occur in a certain topology as can be inferred from substantive arguments (Figure 1.3a). This calls for approaches using pathway analysis such as path‐models (Kaplan, 2008). In other cases, the data can be arranged in a three‐way array (Figure 1.3b) and should then be analyzed with three‐way methods (Smilde et al., 2005; Kroonenberg, 2008). An example is the measurement of metabolites at different time points on the same individuals during a challenge test (Vis et al., 2015).
Figure 1.3 Data blocks arranged in a pathway topology (a) or in a three‐way array (b).
In science, data are always embedded in an environment of prior knowledge. This is visualized in Figure 1.4 where an example from life science is chosen. Suppose that for a group of people, samples are taken and their microbiome, metabolome, and transcriptome are measured. For the microbiome, there may be information available regarding the phylogeny of the bacteria and viruses involved; this information can be available at the genus or species level. For the metabolome, there may be information present about metabolic pathways and networks or knowledge on metabolites that are disease‐related. For the transcriptome, there may be information available on gene regulation, pathways, and also disease‐related gene‐expression patterns. On a general level, databases may be available, related cohort data, and of course biochemical knowledge.
Figure 1.4 Prior knowledge available for a data set.
Often there is also prior knowledge regarding a structure across the samples when analyzing multivariate or high‐dimensional data. This can be knowledge regarding a certain grouping structure (e.g., family relations) or meta‐data (disease status, lifestyle variables, diet information) or a clustering of the samples as a result of a preliminary cluster analysis. Another type of knowledge can be whether groups of samples are related to certain treatments. This can be in the form of a design that is used to generate a sampling structure. There are very many types of designs in use in the life sciences, and all these introduce a structure in the sampling mode of the data. The main goal of this book is to introduce, discuss, and compare data analysis methods that are able to estimate high‐dimensional treatment effects from such data where various treatment factors have been used to induce differences between groups of subjects. These methods are focused on partitioning the treatment‐induced variation according to the different factors used in the design, while explorative methods such as Principal Component Analysis (PCA) (Pearson, 1901), t‐distributed Stochastic Neighbor Embedding (tSNE) (Van der Maaten and Hinton, 2008), or Uniform Manifold and Projections (UMAP) (McInnes et al., 2018) analyze the total variation as a whole and are not able to partition the variation to specific causes.
We are going to discuss methods that can take into account a possible design or grouping structure underlying the sampling mode. Such information on design or grouping is usually encoded in the design matrix with zeros and ones (see Figure 1.5). Different types of codings exist, and these will be discussed in much detail in Chapter 3. This holds for encoding qualitative variables; for quantitative variables this works differently, as discussed in detail in Draper and Smith (1998). There may be a whole variety of designs underlying the sampling mode. In terms of classical statistical design, these can be crossed designs based on fixed effects, cross‐over designs, and nested designs to name a few. All these types of designs are used in the life sciences. We will also discuss random effect and mixed effect models, since these are also used in these fields.
Figure 1.5 Design information about the samples: (a) encoding for skinny data and (b) encoding for fat data.
In this book we will introduce univariate Analysis of Variance (ANOVA) models as background information, but the focus is on high‐dimensional ANOVA approaches. Many approaches in data science use univariate analyses with a subsequent multiple testing correction. Since we feel that these methods cannot truly be called multivariate, we will not discuss those. In the rest of the book, we will refer to the methods that use a multivariate approach to analyzing high‐dimensional data with the generic term high‐dimensional ANOVA or HD‐ANOVA for short.
In this section, we will discuss the main concepts in statistical design that are necessary to understand the various approaches to analyze the results obtained from data they provide. We will use definitions as reported in Casella (2008). The main goal of a statistical design is to systematically investigate the effect size of different levels of one or more treatment factors on the experimental units and compare this effect size to the natural variation between experimental units. The experimental unit is the unit that is randomly assigned to a treatment. This can be, e.g., a subject, an animal, a plant, or a cage. The experimental unit is not always the same as the sampling or observational unit. Multiple observations can be obtained from the experimental unit, but they depend on the experimental unit and, therefore, special care needs to be taken into account during the statistical analysis. As an example, multiple neuronal cells (observational unit) can be measured in the same mouse (experimental unit) to evaluate the effect of the different levels of treatment that are applied to the mice.
The variation between the experimental units is expected to be homogeneous. This means that if none of the treatments has an effect on the response, all response values come from the same distribution. However, the use of different batches, different plots, measurements on different days, or unknown confounding variables can lead to systematic differences between the experimental units. To reduce bias and maximize precision of treatment comparisons, three key design principles are being used.
Replication
: The separate levels of the treatment factor are provided to more than one experimental unit to average out differences between experimental units. Taking multiple measurements (observational units) from each experimental unit, can lead to pseudo‐replication, and give the impression that the variation between experimental units is smaller than what it actually is.
Randomization
: The treatment levels are randomly applied to the experimental units to ensure a fair assessment of the different levels of the treatment factor without bias. Randomization corrects for unknown confounders such as lifestyle or different amounts of light in a greenhouse.
Blocking
: If there are known sources of bias, such as the use of different microarrays, blocks of experimental units can be defined in which the variation between experimental units within a block is homogeneous. This is also referred to as local control. As an example, take an experiment in which three types of fertilizer are compared. Each fertilizer is added to 10 randomly selected pots of soil and each pot has 3 plants. The 30 pots are randomly assigned to 2 greenhouse benches, in such a way that an equal number of pots with each fertilizer are on both benches. In this example, the 2 benches are the blocks, and the pots are the experimental units, while the plants are observational units.
The total variation between the pots is due to the different types of fertilizer, due to the specific greenhouse bench, and due to the natural variation between the pots. The part of the design that takes care of how the different levels of the treatment factor are distributed over the experimental units is called the treatment design. The part of the design that takes care of the blocking, randomization, and replication structure is called the experimental design. In most omics experiments, samples of each experimental unit are obtained and stored. The actual measurement of the samples takes place later and can suffer from the same biases as explained above. To reduce bias and variation between experimental units, blocking and randomization are used to deal with measurements taken in different batches over different days. The definition of when and how the measurements are performed (sometimes called the measurement design) is part of the experimental design. The combination of the treatment design and experimental design is called the statistical design.
The treatment design defines how the different levels of each treatment factor are assigned to the experimental units. If there are two or more treatment factors, these treatment factors can be crossed or nested. For crossed treatment factors the exact same levels of one factor are repeated for each level of the other factor. If plants are subjected to different temperatures and to different light conditions the effect of these crossed treatment factors can be observed independently, as well as their interaction. Nested treatment factors describe hierarchical relationships between treatments. Treatments are nested if different levels of Factor B appear within the levels of Factor A. The different levels of A can be tested separately, but comparing the levels of B can only be done within a specific level of Factor A. Nested factors therefore have no interaction. If a treatment (Factor A) is applied to a plant and 4 leaves (Factor B) are measured, then Factor B is nested in A as the variation can be described as having two sources: due to treatment level and to the plant‐specific leaf level. A general leaf effect cannot be observed as it depends on which plant (Factor A) was used.
Often the experimental unit is also considered as a treatment factor. This factor is usually nested inside another treatment factor. For example, the pot experimental unit is nested inside the fertilizer treatment factor as for each type of fertilizer different pots are used. Sometimes the experimental unit factor can be crossed, for example in a cross‐over design where each subject (experimental unit) takes all different diets (levels of the diet treatment factor) in different order, to correct for individual variation.
In the fertilizer example there was randomization of pots inside the blocks, but not over the blocks. This means that within each block, it is made sure that each type of fertilizer has the same number of pots. The blocking structure thus puts a restriction on the randomization structure and, therefore, it changes the way of estimating the experimental variability. In models that describe the effects of different treatments on a response, usually, a distinction is made between factors that describe how the intended variation, e.g., the different types of fertilizer, is distributed over the experimental units, and factors that describe the blocking and randomization structure that is needed to obtain homogeneous variation for the experimental units. The distinction between these factors is further discussed in Chapter 3.
For each design, the different factors of the experimental part and the treatment part will have a crossed or nested structure. This needs to be clarified and considered when analyzing the data they provide. Many predefined designs exist. The simplest design is the Completely Randomized Design (CRD) (Casella, 2008). For this design, no blocking structure is present, and the random allocation of treatments to the experimental units is not restricted in any way. An easy extension to situations where blocks are needed to correct for heterogeneity in the experimental units is a Randomized Complete Block design (RCB) (Casella, 2008) where all treatments are applied in equal amounts within each block. In this book, we are mainly dealing with quantifying the effect of different treatments in the treatment design. If also the part that deals with the blocking and replication structure is taken into account, this will be mentioned specifically.
The unbiased estimation of treatment effects depends on how the experiments are performed with respect to blocking, randomization, and replication as well as whether treatments are crossed or nested and whether their effects are considered fixed or random. Each combination of these selections gives rise to a different model, and the effects can be calculated with ANOVA or linear models. In ANOVA, ‐tests are used to calculate the significance of treatment effects, which are ratios of the variation explained by the treatments in the numerator and the residual variation in the denominator. The treatment design gives information on the degrees of freedom (related to the number of levels used for each factor), while the experimental design gives information on the expected mean squares of the residual variation and its degrees of freedom. More details will be provided in Chapter 3.
Already the title of this book refers to the biggest hurdle to take: high‐dimensional data. There are very many ways that measurements can become high‐dimensional in the life sciences. In omics research, it is not unusual to have thousands of variables (or features) measured per sample. This happens in RNAseq (orders of 10,000–30,000), in methylation (orders of 50,000–100,000), and in metabolomics, proteomics, and microbiomics (orders of 1000–10,000). Also vibrational spectroscopy generates many variables. One of the first problems of high‐dimensional data is the curse of dimensionality, see Elaboration 1.1.
The curse of dimensionality refers to the fact that given a certain sample size, increasing the number of measurements per sample will decrease the density of the sample points in the row‐space (in which the samples are points) dramatically. This is illustrated in Figure 1.6. Suppose that we have 20 samples and consider the number of samples in a unit bin. In Figure 1.6a, we see that the unit bin contains half of the samples on average. By adding a dimension (Figure 1.6b), now a quarter of the samples are in a unit bin. Going up to three dimensions decreases the number of samples in a unit bin to about one‐eighth (Figure 1.6c). Hence, in the limit, most unit bins contain one or even zero samples.
Figure 1.6 The curse of dimensionality. For an explanation, see the text.
Source: Reproduced from Parsons et al. (2004)/with permission of ACM (The Association for Computing Machinery).
The curse of dimensionality, as presented above, represents the worst‐case scenario since it does not account for correlation. Suppose for example that we perform near‐infrared (NIR) measurements on a chemical system where around 200 variables are measured per sample. Assume that the chemical system contains three absorbing constituents and that Beer's law holds (Christian and O'Reilly, 1988). Then the data set has rank three, apart from some noise which is low in NIR spectroscopy. Stated otherwise, there is a high amount of correlation between the variables in such data.
Obviously, the characteristics of a data set in terms of the curse of dimensionality depend on the system and the type of measurements performed. Usually, omics data (RNAseq, metabolomics, proteomics and microbiomics) are not of low rank, have relatively high noise levels, and are measured on complex systems. Chemical systems and spectroscopic measurements are less complex in terms of rank and measurement error.
Another aspect to consider is the relevance of the measured variables. In omics data, it is not unusual that only a small number of the variables are really important for the research question. This calls for variable selection, which is not an easy task in high‐dimensional data. Another route to take is using regularization approaches that invoke sparseness of the solution, thereby reducing the number of variables to (hopefully) the most relevant ones (Tibshirani, 1996). This selection of relevant variables is exacerbated when there is grouping in the samples and for each group a different set of variables is relevant (Friedman and Meulman, 2004). Yet, variable selection or regularization approaches are not trivial for certain kinds of high‐dimensional data.
One of the mainstream methods for analyzing multivariate responses obtained through a statistical design is multivariate analysis of variance (MANOVA) (Mardia et al., 1979). Although it may work for skinny data (see Figure 1.5a) this method breaks down for high‐dimensional data, see Elaboration 1.2.
Consider an experiment where one factor is varied at levels and measurements are performed on samples at each level . Many data analysis methods calculate a pooled within‐sum‐of‐squares matrix , and this matrix needs to be non‐singular since it is used in subsequent calculations in the form of . This pooled within‐sum‐of‐squares matrix is obtained as a weighted average of the within‐sum‐of‐squares matrices per group which are indicated by . If each matrix has maximum rank .1 An upper bound on the rank of is then (Schott, 2016). For high‐dimensional data, this rank is still (much) lower than the number of variables so that is rank‐deficient. Solutions for this problematic issue are found in the use of low‐dimensional approximations of the original data or shrinkage approaches for the covariance matrix (Schäfer and Strimmer, 2005).
In this section, we will give some examples of high‐dimensional data from different fields that are collected according to an underlying treatment design. This serves to show the growing abundance of such data structures. A more detailed review of papers using an HD‐ANOVA method is given in Chapter 7.
Metabolomics is a field of growing interest in the life sciences. It concerns the measurements of the level of small biochemical compounds in biological systems. The term metabolomics may refer to the measurements themselves, mostly performed by advanced instrumental analysis (see Elaboration 1.3), but also to the subsequent data analysis and biological interpretation. In our case, we are concerned with the processing of the data generated in a metabolomics experiment. Typically, these data need to be preprocessed to arrive at so‐called clean data. This preprocessing is meant to remove batch effects, noise, and other artifacts that affect the data and depend on the setup of the experiment and the instrument used. There is a host of literature on data preprocessing of metabolomics data, and we will not discuss this in this book. Hence, we will assume that we have clean data albeit that these data may still have problems (see Section 1.5).
Depending on the type of application, typically metabolomics samples (e.g., blood plasma or plant material) need some sample work‐up. After that, they can be analyzed by an instrumental technique. The three techniques most in use are NMR, LC‐MS, and GC‐MS. Nuclear Magnetic Resonance (NMR) is the fastest and easiest to perform, but it has a low sensitivity and limited selectivity. With a limit of detection of about 1–5 M, the concentrations of the metabolites need to be reasonably high, and thus the number of separate metabolites that can be distinguished is relatively low (in the order of 100–200) (Wishart, 2008). Furthermore, the amount of sample needed for the measurement is rather large (L). Liquid‐chromatography (LC) and gas‐chromatography (GC) coupled with mass spectrometry (LC‐MS and GC‐MS) are much more sensitive and selective. Usually, these methods are either geared toward measuring a predefined class of compounds (targeted analysis) or are used to cover a large range of metabolites (untargeted), of which most are then unidentified. In the latter case, preprocessing the raw data to arrive at clean data is really cumbersome. Also the LC‐MS and GC‐MS analyses need accurate sample work‐up and are more elaborate to perform than NMR.
Metabolomics is used in a wide variety of applications in the life sciences; for a review see Chapter 7. It is important to distinguish the different organismal levels that can be subjected to a metabolomics experiment. Such experiments can be performed at the cellular level, such as in biotechnology and human cell‐lines. In such cases, biological interpretation of the results may be phrased in terms of the metabolic network of the cellular systems. In most cases, in human and animal studies, the metabolome of body fluids is measured, such as in blood, urine, saliva, fecal fluid, or cerebrospinal fluid (CSF). Clearly, in such cases, biological interpretation is more difficult since there are no direct relationships of such biological compartments with cellular metabolic networks. An intermediate level at which measurements can be performed is in tissue. This can be muscle tissue from humans and animals or specific tissues from plants, e.g., the leaves or roots; this may also include a spatial dimension to the data. An example of metabolomics using the analysis of variance simultaneous component analysis (ASCA) method (see Chapter 4) is given in Example 1.1.
In the field of cell differentiation, specifically, the differentiation of types of stem cells into hepatocytes, metabolomics measurements can be used to quantify the different stages of the differentiation (Moreno‐Torres et al., 2022). Figure 1.7 shows the treatment design of the study. Briefly, during a time‐course, stem cells are stimulated in different ways and turn into hepatocytes. Along the way, they are also subjected to different renewed Medium compositions. Metabolites are obtained at different Days and measured with LC‐MS.
Figure 1.7 The treatment design of the hepatocyte differentiation study. Details of the cell differentiation strategy and assessment of the influence of transcription factors on the cell's metabolome. Cells are treated with different Medium compositions (LDM, LDM‐AA, and LDM‐AAGly) and analyzed at different Days during the differentiation process. For detailed explanation, see the text.
Source: Reproduced from Moreno‐Torres et al. (2022)/with permission of American Chemical Society.
An ASCA model (see Chapter 4) was used to explore the metabolite differences due to the different Cell types, the different Time points, and the different renewed Medium compositions. The results are shown in Figure 1.8.
The figure shows spider score plots that connect the experiments of the same treatment level. In spider plots, the center of the spider shows the overall effect size and the arms represent individual measurements, (see Section 8.6.1.3). The position of a measurement represents a summary of the levels of all metabolites. Dots in the same color represent replicates for the level of the treatment indicated. In Figure 1.8a the three cell types are indicated by color, and a clear difference can be observed between the three cell types. In Figure 1.8b we can see that the metabolite levels are rather different at the different time points. The larger the separation between the levels, the more different they are. Besides the main factors shown in Figure 1.8, significant interactions between the Cell type and time as well as between Medium compositions and Time were reported.
Figure 1.8 The ASCA results of the hepatocyte study: (a) cell type scores plot with different types of cells, WT, HC3X, and HC6X; (b) culture media scores plot with different Media, Undiff, LDM, LDM‐AA, and LDM‐AAGly. These are spider plots in which each center of the spider shows the overall effect size and the arms represent individual measurements.
Source: Reproduced from Moreno‐Torres et al. (2022)/with permission of American Chemical Society.
Genomics is a huge field and it encompasses many aspects of the genome such as the makeup of the DNA, epigenetics (e.g., methylation patterns of the DNA), and gene expression. HD‐ANOVA analyses are mainly applied to gene‐expression measurements, which are performed with micro‐arrays and modern sequencing methods, especially RNAseq. These types of measurements can easily produce thousands of variables for a single sample. One of the complications with modern RNAseq measuring devices is that they need to be preprocessed prior to analysis because there are differences in library sizes, or sometimes referred to as sequencing depth, i.e., the number of sequenced bases in the samples. A normalization step is necessary to make the samples comparable (Evans et al., 2018) (see Section 1.5.1). There are many options and no clear consensus on how to do this. Normalization is one remedy for a type of incomparability which will be explained in more detail in Section 1.5.5, and indeed these types of problems are also encountered in other omics measurements such as in metabolomics (Uh et al., 2020) and microbiome data analysis (Lin and Peddada, 2020). An example of the use of ASCA for RNAseq data is given in Example 1.2. For more applications on the use of HD‐ANOVA methods, see Chapter 7.
Soft‐scald is an injury often observed for “Honeycrisp” apples (Leisso et al., 2016). Apples of two different categories of Risk for developing soft‐scald were stored for different periods of Time. This results in two factors that can be studied: Risk category and storage Time. RNAseq measurements were performed on the fruit, and an ASCA model was used to study the effect of the factors involved. Those factors did not only affect the soft‐scald independently, but there appeared to be a strong interaction of those two factors as shown in Figure 1.9 indicating an early response and a delayed response.
Figure 1.9 The ASCA results of the (Risk:Time) interaction effect of the apple study.
Source: Leisso et al. (2016)/Springer Nature/CC BY 4.0.
Apart from the RNAseq measurements, metabolomics measurements were performed which were also subjected to an ASCA analysis. For more details regarding this study, we refer to the original publication (Leisso et al., 2016).
A microbiome is a community of microbes located in a certain niche. These microbes may contain, a.o., bacteria, fungi, and viruses. Microbiome research has been exploding in the last decade. The topic is also very broad since microbiomes are literally everywhere. Most papers deal with the human gut‐microbiome, but there are also papers about the oral microbiome and skin microbiome. Also in other areas of biology, microbiomes are being studied, e.g., in plants' roots, soil, and water. There is a growing awareness of the importance of these microbiomes, e.g., in their relation to health in general.
As in the case of metabolomics, a strong driver in microbiome research has been the development of measuring devices. Whereas in the early days, HITChips were being used (Rajilić‐Stojanović et al., 2009), nowadays advanced sequencing techniques are available to measure the microbiome such as 16S rRNA sequencing (amplicon sequencing) and metagenome sequencing (Janda and Abbott, 2007). The measured microbiome variables can be expressed in Operational Taxonomic Units (OTUs) or in Amplicon Sequence Variants (ASVs), which are counts for a given observed DNA sequence. Microbiome data are intrinsically different from metabolomics data, because the first represents compositions whereas the second represents concentrations, see Elaboration 1.4. In some microbiome applications, there is a natural grouping in the subjects, and in some cases, e.g., in the plant sciences, there is a statistical design underlying the samples. Hence, in these cases, the use of HD‐ANOVA methods is warranted. In the majority of cases, there are three different routes followed for microbiome data analysis. These are shown in Figure 1.10.
