134,99 €
Multiblock Data Fusion in Statistics and Machine Learning Explore the advantages and shortcomings of various forms of multiblock analysis, and the relationships between them, with this expert guide Arising out of fusion problems that exist in a variety of fields in the natural and life sciences, the methods available to fuse multiple data sets have expanded dramatically in recent years. Older methods, rooted in psychometrics and chemometrics, also exist. Multiblock Data Fusion in Statistics and Machine Learning: Applications in the Natural and Life Sciences is a detailed overview of all relevant multiblock data analysis methods for fusing multiple data sets. It focuses on methods based on components and latent variables, including both well-known and lesser-known methods with potential applications in different types of problems. Many of the included methods are illustrated by practical examples and are accompanied by a freely available R-package. The distinguished authors have created an accessible and useful guide to help readers fuse data, develop new data fusion models, discover how the involved algorithms and models work, and understand the advantages and shortcomings of various approaches. This book includes: * A thorough introduction to the different options available for the fusion of multiple data sets, including methods originating in psychometrics and chemometrics * Practical discussions of well-known and lesser-known methods with applications in a wide variety of data problems * Included, functional R-code for the application of many of the discussed methods Perfect for graduate students studying data analysis in the context of the natural and life sciences, including bioinformatics, sensometrics, and chemometrics, Multiblock Data Fusion in Statistics and Machine Learning: Applications in the Natural and Life Sciences is also an indispensable resource for developers and users of the results of multiblock methods.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 791
Veröffentlichungsjahr: 2022
Age K. Smilde
Swammerdam Institute for Life Sciences, University of Amsterdam,Amsterdam, NL andSimula Metropolitan Center for Digital Engineering, Oslo, NO
Tormod Næs
Nofima Ås, NO
Kristian Hovde Liland
Norwegian University of Life Sciences Ås, NO
This edition first published 2022
© 2022 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Age K. Smilde, Tormod Næs and Kristian Hovde Liland to be identified as the authors of this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
A catalogue record for this book is available from the Library of Congress
Hardback ISBN: 9781119600961; ePDF ISBN: 9781119600985; epub ISBN: 9781119600992;
Obook ISBN: 9781119600978
Cover image: © Professor Age K. Smilde
Cover design by Wiley
Set in 10/12pt WarnockPro-Regular by Integra Software Services Pvt. Ltd, Pondicherry, India
Cover
Title page
Copyright
Foreword
Preface
List of Figures
List of Tables
Part I Introductory Concepts and Theory
1 Introduction
1.1 Scope of the Book
1.2 Potential Audience
1.3 Types of Data and Analyses
1.3.1 Supervised and Unsupervised Analyses
1.3.2 High-, Mid- and Low-level Fusion
1.3.3 Dimension Reduction
1.3.4 Indirect Versus Direct Data
1.3.5 Heterogeneous Fusion
1.4 Examples
1.4.1 Metabolomics
1.4.2 Genomics
1.4.3 Systems Biology
1.4.4 Chemistry
1.4.5 Sensory Science
1.5 Goals of Analyses
1.6 Some History
1.7 Fundamental Choices
1.8 Common and Distinct Components
1.9 Overview and Links
1.10 Notation and Terminology
1.11 Abbreviations
2 Basic Theory and Concepts
2.i General Introduction
2.1 Component Models
2.1.1 General Idea of Component Models
2.1.2 Principal Component Analysis
2.1.3 Sparse PCA
2.1.4 Principal Component Regression
2.1.5 Partial Least Squares
2.1.6 Sparse PLS
2.1.7 Principal Covariates Regression
2.1.8 Redundancy Analysis
2.1.9 Comparing PLS, PCovR and RDA
2.1.10 Generalised Canonical Correlation Analysis
2.1.11 Simultaneous Component Analysis
2.2 Properties of Data
2.2.1 Data Theory
2.2.2 Scale-types
2.3 Estimation Methods
2.3.1 Least-squares Estimation
2.3.2 Maximum-likelihood Estimation
2.3.3 Eigenvalue Decomposition-based Methods
2.3.4 Covariance or Correlation-based Estimation Methods
2.3.5 Sequential Versus Simultaneous Methods
2.3.6 Homogeneous Versus Heterogeneous Fusion
2.4 Within- and Between-block Variation
2.4.1 Definition and Example
2.4.2 MAXBET Solution
2.4.3 MAXNEAR Solution
2.4.4 PLS2 Solution
2.4.5 CCA Solution
2.4.6 Comparing the Solutions
2.4.7 PLS, RDA and CCA Revisited
2.5 Framework for Common and Distinct Components
2.6 Preprocessing
2.7 Validation
2.7.1 Outliers
2.7.1.1 Residuals
2.7.1.2 Leverage
2.7.2 Model Fit
2.7.3 Bias-variance Trade-off
2.7.4 Test Set Validation
2.7.5 Cross-validation
2.7.6 Permutation Testing
2.7.7 Jackknife and Bootstrap
2.7.8 Hyper-parameters and Penalties
2.8 Appendix
3 Structure of Multiblock Data
3.i General Introduction
3.1 Taxonomy
3.2 Skeleton of a Multiblock Data Set
3.2.1 Shared Sample Mode
3.2.2 Shared Variable Mode
3.2.3 Shared Variable or Sample Mode
3.2.4 Shared Variable and Sample Mode
3.3 Topology of a Multiblock Data Set
3.3.1 Unsupervised Analysis
3.3.2 Supervised Analysis
3.4 Linking Structures
3.4.1 Linking Structure for Unsupervised Analysis
3.4.2 Linking Structures for Supervised Analysis
3.5 Summary
4 Matrix Correlations
4.i General Introduction
4.1 Definition
4.2 Most Used Matrix Correlations
4.2.1 Inner Product Correlation
4.2.2 GCD coefficient
4.2.3 RV-coefficient
4.2.4 SMI-coefficient
4.3 Generic Framework of Matrix Correlations
4.4 Generalised Matrix Correlations
4.4.1 Generalised RV-coefficient
4.4.2 Generalised Association Coefficient
4.5 Partial Matrix Correlations
4.6 Conclusions and Recommendations
4.7 Open Issues
Part II Selected Methods for Unsupervised and Supervised Topologies
5 Unsupervised Methods
5.i General Introduction
5.ii Relations to the General Framework
5.1 Shared Variable Mode
5.1.1 Only Common Variation
5.1.1.1 Simultaneous Component Analysis
5.1.1.2 Clustering and SCA
5.1.1.3 Multigroup Data Analysis
5.1.2 Common, Local, and Distinct Variation
5.1.2.1 Distinct and Common Components
5.1.2.2 Multivariate Curve Resolution
5.2 Shared Sample Mode
5.2.1 Only Common Variation
5.2.1.1 SUM-PCA
5.2.1.2 Multiple Factor Analysis and STATIS
5.2.1.3 Generalised Canonical Analysis
5.2.1.4 Regularised Generalised Canonical Correlation Analysis
5.2.1.5 Exponential Family SCA
5.2.1.6 Optimal-scaling
5.2.2 Common, Local, and Distinct Variation
5.2.2.1 Joint and Individual Variation Explained
5.2.2.2 Distinct and Common Components
5.2.2.3 PCA-GCA
5.2.2.4 Advanced Coupled Matrix and Tensor Factorisation
5.2.2.5 Penalised-ESCA
5.2.2.6 Multivariate Curve Resolution
5.3 Generic Framework
5.3.1 Framework for Simultaneous Unsupervised Methods
5.3.1.1 Description of the Framework
5.3.1.2 Framework Applied to Simultaneous Unsupervised Data Analysis Methods
5.3.1.3 Framework of Common/Distinct Applied to Simultaneous Unsupervised Multiblock Data Analysis Methods
5.4 Conclusions and Recommendations
5.5 Open Issues
6 ASCA and Extensions
6.i General Introduction
6.ii Relations to the General Framework
6.1 ANOVA-Simultaneous Component Analysis
6.1.1 The ASCA Method
6.1.2 Validation of ASCA
6.1.2.1 Permutation Testing
6.1.2.2 Back-projection
6.1.2.3 Confidence Ellipsoids
6.1.3 The ASCA+ and LiMM-PCA Methods
6.2 Multilevel-SCA
6.3 Penalised-ASCA
6.4 Conclusions and Recommendations
6.5 Open Issues
7 Supervised Methods
7.i General Introduction
7.ii Relations to the General Framework
7.1 Multiblock Regression: General Perspectives
7.1.1 Model and Assumptions
7.1.2 Different Challenges and Aims
7.2 Multiblock PLS Regression
7.2.1 Standard Multiblock PLS Regression
7.2.2 MB-PLS Used for Classification
7.2.3 Sparse Multiblock PLS Regression (sMB-PLS)
7.3 The Family of SO-PLS Regression Methods (Sequential and Orthogonalised PLS Regression)
7.3.1 The SO-PLS Method
7.3.2 Order of Blocks
7.3.3 Interpretation Tools
7.3.4 Restricted PLS Components and their Application in SO-PLS
7.3.5 Validation and Component Selection
7.3.6 Relations to ANOVA
7.3.7 Extensions of SO-PLS to Handle Interactions Between Blocks
7.3.8 Further Applications of SO-PLS
7.3.9 Relations Between SO-PLS and ASCA
7.4 Parallel and Orthogonalised PLS (PO-PLS) Regression
7.5 Response Oriented Sequential Alternation
7.5.1 The ROSA Method
7.5.2 Validation
7.5.3 Interpretation
7.6 Conclusions and Recommendations
7.7 Open Issues
Part III Methods for Complex Multiblock Structures
8 Complex Block Structures; with Focus on L-Shape Relations
8.i General Introduction
8.ii Relations to the General Framework
8.1 Analysis of L-shape Data: General Perspectives
8.2 Sequential Procedures for L-shape Data Based on PLS/PCR and ANOVA
8.2.1 Interpretation of X1, Quantitative X2-data, Horizontal Axis First
8.2.2 Interpretation of X1, Categorical X2-data, Horizontal Axis First
8.2.3 Analysis of Segments/Clusters of X1 Data
8.3 The L-PLS Method for Joint Estimation of Blocks in L-shape Data
8.3.1 The Original L-PLS Method, Endo-L-PLS
8.3.2 Exo- Versus Endo-L-PLS
8.4 Modifications of the Original L-PLS Idea
8.4.1 Weighting Information from X3 and X1 in L-PLS Using a Parameter α
8.4.2 Three-blocks Bifocal PLS
8.5 Alternative L-shape Data Analysis Methods
8.5.1 Principal Component Analysis with External Information
8.5.2 A Simple PCA Based Procedure for Using Unlabelled Data in Calibration
8.5.3 Multivariate Curve Resolution for Incomplete Data
8.5.4 An Alternative Approach in Consumer Science Based on Correlations Between X3 and X1
8.6 Domino PLS and More Complex Data Structures
8.7 Conclusions and Recommendations
8.8 Open Issues
Part IV Alternative Methods for Unsupervised and Supervised Topologies
9 Alternative Unsupervised Methods
9.i General Introduction
9.ii Relationship to the General Framework
9.1 Shared Variable Mode
9.2 Shared Sample Mode
9.2.1 Only Common Variation
9.2.1.1 DIABLO
9.2.1.2 Generalised Coupled Tensor Factorisation
9.2.1.3 Representation Matrices
9.2.1.4 Extended PCA
9.2.2 Common, Local, and Distinct Variation
9.2.2.1 Generalised SVD
9.2.2.2 Structural Learning and Integrative Decomposition
9.2.2.3 Bayesian Inter-battery Factor Analysis
9.2.2.4 Group Factor Analysis
9.2.2.5 OnPLS
9.2.2.6 Generalised Association Study
9.2.2.7 Multi-Omics Factor Analysis
9.3 Two Shared Modes and Only Common Variation
9.3.1 Generalised Procrustes Analysis
9.3.2 Three-way Methods
9.4 Conclusions and Recommendations
9.4.1 Open Issues
10 Alternative Supervised Methods
10.i General Introduction
10.ii Relations to the General Framework
10.1 Model and Focus
10.2 Extension of PCovR
10.2.1 Sparse Multiblock Principal Covariates Regression, Sparse PCovR
10.2.2 Multiway Multiblock Covariates Regression
10.3 Multiblock Redundancy Analysis
10.3.1 Standard Multiblock Redundancy Analysis
10.3.2 Sparse Multiblock Redundancy Analysis
10.4 Miscellaneous Multiblock Regression Methods
10.4.1 Multiblock Variance Partitioning
10.4.2 Network Induced Supervised Learning
10.4.3 Common Dimensions for Multiblock Regression
10.5 Modifications and Extensions of the SO-PLS Method
10.5.1 Extensions of SO-PLS to Three-Way Data
10.5.2 Variable Selection for SO-PLS
10.5.3 More Complicated Error Structure for SO-PLS
10.5.4 SO-PLS Used for Path Modelling
10.6 Methods for Data Sets Split Along the Sample Mode, Multigroup Methods
10.6.1 Multigroup PLS Regression
10.6.2 Clustering of Observations in Multiblock Regression
10.6.3 Domain-Invariant PLS, DI-PLS
10.7 Conclusions and Recommendations
10.8 Open Issues
Part V Software
11 Algorithms and Software
11.1 Multiblock Software
11.2 R package multiblock
11.3 Installing and Starting the Package
11.4 Data Handling
11.4.1 Read From File
11.4.2 Data Pre-processing
11.4.3 Re-coding Categorical Data
11.4.4 Data Structures for Multiblock Analysis
11.4.4.1 Create List of Blocks
11.4.4.2 Create data.frame of Blocks
11.5 Basic Methods
11.5.1 Prepare Data
11.5.2 Modelling
11.5.3 Common Output Elements Across Methods
11.5.4 Scores and Loadings
11.6 Unsupervised Methods
11.6.1 Formatting Data for Unsupervised Data Analysis
11.6.2 Method Interfaces
11.6.3 Shared Sample Mode Analyses
11.6.4 Shared Variable Mode
11.6.5 Common Output Elements Across Methods
11.6.6 Scores and Loadings
11.6.7 Plot From Imported Package
11.7 ANOVA Simultaneous Component Analysis
11.7.1 Formula Interface
11.7.2 Simulated Data
11.7.3 ASCA Modelling
11.7.4 ASCA Scores
11.7.5 ASCA Loadings
11.8 Supervised Methods
11.8.1 Formatting Data for Supervised Analyses
11.8.2 Multiblock Partial Least Squares
11.8.2.1 MB-PLS Modelling
11.8.2.2 MB-PLS Summaries and Plotting
11.8.3 Sparse Multiblock Partial Least Squares
11.8.3.1 Sparse MB-PLS Modelling
11.8.3.2 Sparse MB-PLS Plotting
11.8.4 Sequential and Orthogonalised Partial Least Squares
11.8.4.1 SO-PLS Modelling
11.8.4.2 Måge Plot
11.8.4.3 SO-PLS Loadings
11.8.4.4 SO-PLS Scores
11.8.4.5 SO-PLS Prediction
11.8.4.6 SO-PLS Validation
11.8.4.7 Principal Components of Predictions
11.8.4.8 CVANOVA
11.8.5 Parallel and Orthogonalised Partial Least Squares
11.8.5.1 PO-PLS Modelling
11.8.5.2 PO-PLS Scores and Loadings
11.8.6 Response Optimal Sequential Alternation
11.8.6.1 ROSA Modelling
11.8.6.2 ROSA Loadings
11.8.6.3 ROSA Scores
11.8.6.4 ROSA Prediction
11.8.6.5 ROSA Validation
11.8.6.6 ROSA Image Plots
11.8.7 Multiblock Redundancy Analysis
11.8.7.1 MB-RDA Modelling
11.8.7.2 MB-RDA Loadings and Scores
11.9 Complex Data Structures
11.9.1 L-PLS
11.9.1.1 Simulated L-shaped Data
11.9.1.2 Exo-L-PLS
11.9.1.3 Endo-L-PLS
11.9.1.4 L-PLS Cross-validation
11.9.2 SO-PLS-PM
11.9.2.1 Single SO-PLS-PM Model
11.9.2.2 Multiple Paths in an SO-PLS-PM Model
11.10 Software Packages
11.10.1 R Packages
11.10.2 MATLAB Toolboxes
11.10.3 Python
11.10.4 Commercial Software
References
Index
End User License Agreement
Chapter 1
Figure 1.7 L-shape data of...
Figure 1.1 High-level...
Figure 1.2 Idea of dimension...
Figure 1.3 Design of the plant...
Figure 1.4 Scores on the first...
Figure 1.5 Idea of copy number...
Figure 1.6 Plot of the Raman...
Figure 1.8 Phylogeny of some...
Figure 1.9 The idea of common...
Chapter 2
Figure 2.1 Idea of dimension reduction...
Figure 2.2 Geometry of PCA...
Figure 2.3 Score (a) and loading...
Figure 2.4 PLS validated explained...
Figure 2.5 Score and loading plots...
Figure 2.6 Raw and normalised...
Figure 2.7 Numerical representations...
Figure 2.8 Classical (a) and...
Figure 2.9 Classical (a) and...
Figure 2.10 SCA for two data...
Figure 2.11 The block scores...
Figure 2.12 Two column-spaces...
Figure 2.13 Common and distinct...
Figure 2.14 Common components...
Figure 2.15 Visualisation of a response...
Figure 2.16 Fitted values versus...
Figure 2.17 Simple linear regression...
Figure 2.18 Two-variable multiple...
Figure 2.19 Two component PCA...
Figure 2.20 Illustration of true...
Figure 2.21 Visualisation of bias...
Figure 2.22 Learning curves showing...
Figure 2.23 Visualisation of the...
Figure 2.24 Cumulative explained...
Figure 2.25 Null distribution...
Chapter 3
Figure 3.1 Skeleton of a three-block data...
Figure 3.2 Skeleton of a four-block data...
Figure 3.3 Skeleton of a three-block data...
Figure 3.4 Skeleton of a three-block...
Figure 3.5 Skeleton of a four-block...
Figure 3.6 Topology of a three-block...
Figure 3.7 Topology of a three-block...
Figure 3.8 Different arrangements...
Figure 3.9 Unsupervised combination...
Figure 3.10 Supervised three-set problem...
Figure 3.11 Supervised L-shape problem...
Figure 3.12 Path model structure...
Figure 3.13 Idea of linking two...
Figure 3.14 Different linking...
Figure 3.15 Idea of linking...
Figure 3.16 Different linking...
Figure 3.17 Treating common and distinct...
Chapter 4
Figure 4.1 Explanation of the scale...
Figure 4.2 Topology of interactions...
Figure 4.3 The RV and partial RV...
Figure 4.4 Decision tree for selecting...
Chapter 5
Figure 5.1 Unsupervised analysis...
Figure 5.2 Illustration explaining...
Figure 5.3 The idea of common...
Figure 5.4 Proportion of explained...
Figure 5.5 Row-spaces visualised...
Figure 5.6 Difference between weights...
Figure 5.7 The logistic function...
Figure 5.8 CNA data visualised...
Figure 5.9 Score plot of the CNA...
Figure 5.10 Plots for selecting...
Figure 5.11 Biplots from PCA-GCA...
Figure 5.12 Amount of explained...
Figure 5.13 Amount of explained...
Figure 5.14 Scores (upper part)...
Figure 5.17 True design used...
Figure 5.15 ACMTF as applied...
Figure 5.16 True design used...
Figure 5.18 Example of the properties...
Figure 5.19 Quantification of modes...
Figure 5.20 Linking the blocks...
Figure 5.21 Decision tree...
Figure 5.22 Decision tree for...
Chapter 6
Figure 6.1 ASCA decomposition...
Figure 6.2 A part of the ASCA...
Figure 6.3 The ASCA scores...
Figure 6.4 The ASCA scores on the factor...
Figure 6.5 The ASCA scores on the interaction...
Figure 6.6 PCA on toxicology data...
Figure 6.7 ASCA on toxicology data...
Figure 6.8 PARAFASCA on toxicology...
Figure 6.9 Permutation example...
Figure 6.10 Permutation test...
Figure 6.11 ASCA candy scores...
Figure 6.12 ASCA assessor scores...
Figure 6.13 ASCA assessor and candy...
Figure 6.14 PE-ASCA of the NMR metabolomics...
Figure 6.15 Tree for selecting...
Chapter 7
Figure 7.1 Conceptual illustration...
Figure 7.2 Illustration of link...
Figure 7.3 Cross-validated explained...
Figure 7.4 Super-weights (w) for the...
Figure 7.5 Block-weights (wm) for first...
Figure 7.6 Block-scores (tm, for left...
Figure 7.7 Classification by regression...
Figure 7.8 AUROC values of different...
Figure 7.9 Super-scores...
Figure 7.10 Linking structure...
Figure 7.11 The SO-PLS iterates...
Figure 7.20 CV-ANOVA results...
Figure 7.13 Måge plot...
Figure 7.12 The CVANOVA...
Figure 7.14 PCP plots for wine...
Figure 7.15 Måge plot showing...
Figure 7.16 Block-wise scores...
Figure 7.17 Block-wise (projected)...
Figure 7.18 Block-wise loadings...
Figure 7.19 Måge plot...
Figure 7.21 Loadings from Principal...
Figure 7.22 RMSEP for fish data...
Figure 7.23 Regression coefficients...
Figure 7.24 SO-PLS results using...
Figure 7.25 Illustration of the idea...
Figure 7.26 PO-PLS calibrated/fitted...
Figure 7.27 PO-PLS calibrated...
Figure 7.28 PO-PLS common scores...
Figure 7.29 PO-PLS common...
Figure 7.30 PO-PLS distinct loadings...
Figure 7.31 ROSA component selection...
Figure 7.32 Cross-validated explained...
Figure 7.33 ROSA weights...
Figure 7.34 Summary of cross-validated...
Figure 7.35 The decision paths...
Chapter 8
Figure 8.1 Figure (a)–(c) represent...
Figure 8.2 Conceptual illustration...
Figure 8.3 Topologies for four...
Figure 8.4 Scheme for information...
Figure 8.5 Preference mapping...
Figure 8.6 Results from consumer...
Figure 8.7 Results from consumer liking...
Figure 8.8 Relations between segments...
Figure 8.9 Topology for the extension...
Figure 8.10 L-block scheme with...
Figure 8.11 Endo-L-PLS results...
Figure 8.12 Classification...
Figure 8.13 (a) Data structure for labelled...
Figure 8.14 Tree for selecting methods...
Chapter 9
Figure 9.1 General setup for fusing heterogeneous...
Figure 9.2 Score plots of IDIOMIX...
Figure 9.3 True design used in mixture...
Figure 9.4 Cross-validation results...
Figure 9.5 Explained variances...
Figure 9.6 From multiblock data...
Figure 9.7 Decision tree for selecting...
Chapter 10
Figure 10.1 Results from multiblock...
Figure 10.2 Pie chart of the sources...
Figure 10.3 Flow chart...
Figure 10.4 An illustration of...
Figure 10.5 Path diagram for a wine...
Figure 10.6 Wine data. PCP plots...
Figure 10.7 An illustration of the multigroup...
Figure 10.8 Decision tree for selecting...
Chapter 11
Figure 11.1 Output from use of scoreplot...
Figure 11.2 Output from use of loadingplot...
Figure 11.3 Output from use of scoreplot...
Figure 11.4 Output from use of loadingplot...
Figure 11.5 Output from use of plot...
Figure 11.6 Output from use of scoreplot()...
Figure 11.7 Output from use of scoreplot()...
Figure 11.8 Output from use of loadingplot()...
Figure 11.9 Output from use of scoreplot()...
Figure 11.10 Output from use of loadingplot()...
Figure 11.11 Output from use of scoreplot()...
Figure 11.12 Output from use of maage().
Figure 11.13 Output from use of maageSeq().
Figure 11.14 Output from use of loadingplot()...
Figure 11.15 Output from use of scoreplot()...
Figure 11.16 Output from use of scoreplot()...
Figure 11.17 Output from use of plot()...
Figure 11.18 Output from use of scoreplot()...
Figure 11.19 Output from use of loadingplot()...
Figure 11.20 Output from use of loadingplot()...
Figure 11.21 Output from use of scoreplot()...
Figure 11.22 Output from use of image()...
Figure 11.23 Output from use of image()...
Figure 11.24 Output from use of scoreplot()...
Figure 11.25 Output from use of plot()...
Chapter 1
Table 1.1 Overview of methods...
Table 1.2 Abbreviations of the...
Chapter 2
Table 2.1 Formal treatment of types...
Table 2.2 Different methods for fusing...
Table 2.3 The matrices of which the weights...
Chapter 4
Table 4.1 Overview of the data sets...
Chapter 5
Table 5.1 Overview of methods...
Table 5.2 Different types of SCA...
Table 5.3 Proportions of explained...
Table 5.4 Properties of methods...
Chapter 6
Table 6.1 Overview of methods...
Chapter 7
Table 7.1 Overview of methods...
Chapter 8
Table 8.1 Overview of methods...
Table 8.2 Tabulation of consumer characteristics...
Table 8.3 Consumer liking of cheese...
Chapter 9
Table 9.1 Overview of methods...
Chapter 10
Table 10.1 Overview of methods...
Table 10.2 Results of the single-block...
Table 10.3 Results of the multiway...
Table 10.4 SO-PLS-PM results for wine data...
Chapter 11
Table 11.1 R packages on CRAN having...
Table 11.2 MATLAB toolboxes and...
Table 11.3 Python packages having...
Table 11.4 Commercial software having...
Cover
Title page
Copyright
Table of Contents
Foreword
Preface
List of Figures
List of Tables
Begin Reading
References
Index
End User License Agreement
i
ii
iii
iv
v
vi
vii
viii
ix
x
xi
xii
xiii
xiv
xv
xvi
xvii
xviii
xix
xx
xxi
xxii
xxiii
xxiv
xxv
xxvi
xxvii
xxviii
xxix
xxx
xxxi
xxxii
xxxiii
xxxiv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
It is a real honour to write a few introductory words about Multiblock Data Fusion in Statistics and Machine Learning. The book is maybe not timely! The subject has been around in chemometrics since the late 1980s; usually under the term multiblock analysis.
Let me take that back immediately–the book is definitely timely. Even though this subject has been discussed for decades, it has taken off dramatically lately. And not only in chemometrics, but in a variety of fields. There are many diverse and interesting developments and in fact, it is quite difficult to really understand what is going on and to filter or even just understand the literature from so many sources. Each field will have their own internal jargon and background. This may be the biggest obstacle right now. It is evident that there are many interesting developments but grasping them is next to impossible. This book fixes that. And not only that, this book provides a comprehensive overview across fields and it also adds perspective and new research where needed. I would argue that this is the place if you want to understand data fusion comprehensively.
That is, if you want to understand how to apply data fusion; or you want to develop new data fusion models; or learn how the algorithms and models work; or maybe you want to understand what the shortcomings of different approaches are. If you have questions like these or you simply want to know what is happening in this area of data science, then reading this book will be a nice and fulfilling experience.
To write a comprehensive book about such an enormous field requires special people. And indeed, there are three very competent persons behind this book. They have all worked within the area for many years and have each provided important research on both the theoretical and the application sides of things. And they represent both the experience of the old-timers and visions of the coming generations. I can say that without insulting (I hope) as I am in the same age group as the more exdistinguished part of the authors.
I have the deepest respect and the highest admiration for the three authors. I have learned so many things from their individual contributions over the years. Reading this joint work is not a disappointment. Please do enjoy!
Rasmus Bro
Køge, Denmark, July 28, 2021
Combining information from two or possibly several blocks of data is gaining increased attention and importance in several areas of science and industry. Typical examples can be found in chemistry, spectroscopy, metabolomics, genomics, systems biology, and sensory science. Many methods and procedures have been proposed and used in practice. The area goes under different names: data integration, data fusion, multiblock analyses, multiset analyses, and others.
This book is an attempt to provide an up-to-date treatment of the most used and important methods within an important branch of the area; namely methods based on so-called components or latent variables. These methods have already obtained enormous attention in, for instance, chemometrics, bioinformatics, machine learning, and sensometrics and have proved to be important both for prediction and interpretation.
The book is primarily a description of methodologies, but most of the methods will be illustrated by examples from the above-mentioned areas. The book is written such that both users of the methods as well as method developers will hopefully find sections of interest. At the end of the book there is a description of a software package developed particularly for the book. This package is freely available in R and covers many of the methods discussed.
To distinguish the different types of methods from each other, the book is divided into five parts. Part I is an introduction and description of preliminary concepts. Part II is the core of the book containing the main unsupervised and supervised methods. Part III deals with more complex structures and, finally, Part IV presents alternative unsupervised and supervised methods. The book ends with Part V discussing the available software.
Our recommendations for reading the book are as follows. A minimum read of the book would involve chapters 1, 2, 3, 5, and 7. Chapters 4, 6 and 8 are more specialized and chapters 9 and 10 contain methods we think are more advanced or less obvious to use. We feel privileged to have so many friendly colleagues who were willing to spend their time on helping us to improve the book by reading separate chapters. We would like to express our thanks to: Rasmus Bro, Margriet Hendriks, Ulf Indahl, Henk Kiers, Ingrid Måge, Federico Marini, Åsmund Rinnan, Rosaria Romano, Lars Erik Solberg, Marieke Timmerman, Oliver Tomic, Johan Westerhuis, and Barry Wise. Of course, the correctness of the final text is fully our responsibility!
Age Smilde, Utrecht, The Netherlands
Tormod Næs, Ås, Norway
Kristian Hovde Liland, Ås, Norway
March 2022
