Multiblock Data Fusion in Statistics and Machine Learning - Age K. Smilde - E-Book

Multiblock Data Fusion in Statistics and Machine Learning E-Book

Age K. Smilde

0,0
134,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Multiblock Data Fusion in Statistics and Machine Learning Explore the advantages and shortcomings of various forms of multiblock analysis, and the relationships between them, with this expert guide Arising out of fusion problems that exist in a variety of fields in the natural and life sciences, the methods available to fuse multiple data sets have expanded dramatically in recent years. Older methods, rooted in psychometrics and chemometrics, also exist. Multiblock Data Fusion in Statistics and Machine Learning: Applications in the Natural and Life Sciences is a detailed overview of all relevant multiblock data analysis methods for fusing multiple data sets. It focuses on methods based on components and latent variables, including both well-known and lesser-known methods with potential applications in different types of problems. Many of the included methods are illustrated by practical examples and are accompanied by a freely available R-package. The distinguished authors have created an accessible and useful guide to help readers fuse data, develop new data fusion models, discover how the involved algorithms and models work, and understand the advantages and shortcomings of various approaches. This book includes: * A thorough introduction to the different options available for the fusion of multiple data sets, including methods originating in psychometrics and chemometrics * Practical discussions of well-known and lesser-known methods with applications in a wide variety of data problems * Included, functional R-code for the application of many of the discussed methods Perfect for graduate students studying data analysis in the context of the natural and life sciences, including bioinformatics, sensometrics, and chemometrics, Multiblock Data Fusion in Statistics and Machine Learning: Applications in the Natural and Life Sciences is also an indispensable resource for developers and users of the results of multiblock methods.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 791

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Multiblock Data Fusion in Statistics and Machine Learning

Applications in the Natural and Life Sciences

Age K. Smilde

Swammerdam Institute for Life Sciences, University of Amsterdam,Amsterdam, NL andSimula Metropolitan Center for Digital Engineering, Oslo, NO

Tormod Næs

Nofima Ås, NO

Kristian Hovde Liland

Norwegian University of Life Sciences Ås, NO

This edition first published 2022

© 2022 John Wiley & Sons Ltd

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Age K. Smilde, Tormod Næs and Kristian Hovde Liland to be identified as the authors of this work has been asserted in accordance with law.

Registered Offices

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

A catalogue record for this book is available from the Library of Congress

Hardback ISBN: 9781119600961; ePDF ISBN: 9781119600985; epub ISBN: 9781119600992;

Obook ISBN: 9781119600978

Cover image: © Professor Age K. Smilde

Cover design by Wiley

Set in 10/12pt WarnockPro-Regular by Integra Software Services Pvt. Ltd, Pondicherry, India

Contents

Cover

Title page

Copyright

Foreword

Preface

List of Figures

List of Tables

Part I Introductory Concepts and Theory

1 Introduction

1.1 Scope of the Book

1.2 Potential Audience

1.3 Types of Data and Analyses

1.3.1 Supervised and Unsupervised Analyses

1.3.2 High-, Mid- and Low-level Fusion

1.3.3 Dimension Reduction

1.3.4 Indirect Versus Direct Data

1.3.5 Heterogeneous Fusion

1.4 Examples

1.4.1 Metabolomics

1.4.2 Genomics

1.4.3 Systems Biology

1.4.4 Chemistry

1.4.5 Sensory Science

1.5 Goals of Analyses

1.6 Some History

1.7 Fundamental Choices

1.8 Common and Distinct Components

1.9 Overview and Links

1.10 Notation and Terminology

1.11 Abbreviations

2 Basic Theory and Concepts

2.i General Introduction

2.1 Component Models

2.1.1 General Idea of Component Models

2.1.2 Principal Component Analysis

2.1.3 Sparse PCA

2.1.4 Principal Component Regression

2.1.5 Partial Least Squares

2.1.6 Sparse PLS

2.1.7 Principal Covariates Regression

2.1.8 Redundancy Analysis

2.1.9 Comparing PLS, PCovR and RDA

2.1.10 Generalised Canonical Correlation Analysis

2.1.11 Simultaneous Component Analysis

2.2 Properties of Data

2.2.1 Data Theory

2.2.2 Scale-types

2.3 Estimation Methods

2.3.1 Least-squares Estimation

2.3.2 Maximum-likelihood Estimation

2.3.3 Eigenvalue Decomposition-based Methods

2.3.4 Covariance or Correlation-based Estimation Methods

2.3.5 Sequential Versus Simultaneous Methods

2.3.6 Homogeneous Versus Heterogeneous Fusion

2.4 Within- and Between-block Variation

2.4.1 Definition and Example

2.4.2 MAXBET Solution

2.4.3 MAXNEAR Solution

2.4.4 PLS2 Solution

2.4.5 CCA Solution

2.4.6 Comparing the Solutions

2.4.7 PLS, RDA and CCA Revisited

2.5 Framework for Common and Distinct Components

2.6 Preprocessing

2.7 Validation

2.7.1 Outliers

2.7.1.1 Residuals

2.7.1.2 Leverage

2.7.2 Model Fit

2.7.3 Bias-variance Trade-off

2.7.4 Test Set Validation

2.7.5 Cross-validation

2.7.6 Permutation Testing

2.7.7 Jackknife and Bootstrap

2.7.8 Hyper-parameters and Penalties

2.8 Appendix

3 Structure of Multiblock Data

3.i General Introduction

3.1 Taxonomy

3.2 Skeleton of a Multiblock Data Set

3.2.1 Shared Sample Mode

3.2.2 Shared Variable Mode

3.2.3 Shared Variable or Sample Mode

3.2.4 Shared Variable and Sample Mode

3.3 Topology of a Multiblock Data Set

3.3.1 Unsupervised Analysis

3.3.2 Supervised Analysis

3.4 Linking Structures

3.4.1 Linking Structure for Unsupervised Analysis

3.4.2 Linking Structures for Supervised Analysis

3.5 Summary

4 Matrix Correlations

4.i General Introduction

4.1 Definition

4.2 Most Used Matrix Correlations

4.2.1 Inner Product Correlation

4.2.2 GCD coefficient

4.2.3 RV-coefficient

4.2.4 SMI-coefficient

4.3 Generic Framework of Matrix Correlations

4.4 Generalised Matrix Correlations

4.4.1 Generalised RV-coefficient

4.4.2 Generalised Association Coefficient

4.5 Partial Matrix Correlations

4.6 Conclusions and Recommendations

4.7 Open Issues

Part II Selected Methods for Unsupervised and Supervised Topologies

5 Unsupervised Methods

5.i General Introduction

5.ii Relations to the General Framework

5.1 Shared Variable Mode

5.1.1 Only Common Variation

5.1.1.1 Simultaneous Component Analysis

5.1.1.2 Clustering and SCA

5.1.1.3 Multigroup Data Analysis

5.1.2 Common, Local, and Distinct Variation

5.1.2.1 Distinct and Common Components

5.1.2.2 Multivariate Curve Resolution

5.2 Shared Sample Mode

5.2.1 Only Common Variation

5.2.1.1 SUM-PCA

5.2.1.2 Multiple Factor Analysis and STATIS

5.2.1.3 Generalised Canonical Analysis

5.2.1.4 Regularised Generalised Canonical Correlation Analysis

5.2.1.5 Exponential Family SCA

5.2.1.6 Optimal-scaling

5.2.2 Common, Local, and Distinct Variation

5.2.2.1 Joint and Individual Variation Explained

5.2.2.2 Distinct and Common Components

5.2.2.3 PCA-GCA

5.2.2.4 Advanced Coupled Matrix and Tensor Factorisation

5.2.2.5 Penalised-ESCA

5.2.2.6 Multivariate Curve Resolution

5.3 Generic Framework

5.3.1 Framework for Simultaneous Unsupervised Methods

5.3.1.1 Description of the Framework

5.3.1.2 Framework Applied to Simultaneous Unsupervised Data Analysis Methods

5.3.1.3 Framework of Common/Distinct Applied to Simultaneous Unsupervised Multiblock Data Analysis Methods

5.4 Conclusions and Recommendations

5.5 Open Issues

6 ASCA and Extensions

6.i General Introduction

6.ii Relations to the General Framework

6.1 ANOVA-Simultaneous Component Analysis

6.1.1 The ASCA Method

6.1.2 Validation of ASCA

6.1.2.1 Permutation Testing

6.1.2.2 Back-projection

6.1.2.3 Confidence Ellipsoids

6.1.3 The ASCA+ and LiMM-PCA Methods

6.2 Multilevel-SCA

6.3 Penalised-ASCA

6.4 Conclusions and Recommendations

6.5 Open Issues

7 Supervised Methods

7.i General Introduction

7.ii Relations to the General Framework

7.1 Multiblock Regression: General Perspectives

7.1.1 Model and Assumptions

7.1.2 Different Challenges and Aims

7.2 Multiblock PLS Regression

7.2.1 Standard Multiblock PLS Regression

7.2.2 MB-PLS Used for Classification

7.2.3 Sparse Multiblock PLS Regression (sMB-PLS)

7.3 The Family of SO-PLS Regression Methods (Sequential and Orthogonalised PLS Regression)

7.3.1 The SO-PLS Method

7.3.2 Order of Blocks

7.3.3 Interpretation Tools

7.3.4 Restricted PLS Components and their Application in SO-PLS

7.3.5 Validation and Component Selection

7.3.6 Relations to ANOVA

7.3.7 Extensions of SO-PLS to Handle Interactions Between Blocks

7.3.8 Further Applications of SO-PLS

7.3.9 Relations Between SO-PLS and ASCA

7.4 Parallel and Orthogonalised PLS (PO-PLS) Regression

7.5 Response Oriented Sequential Alternation

7.5.1 The ROSA Method

7.5.2 Validation

7.5.3 Interpretation

7.6 Conclusions and Recommendations

7.7 Open Issues

Part III Methods for Complex Multiblock Structures

8 Complex Block Structures; with Focus on L-Shape Relations

8.i General Introduction

8.ii Relations to the General Framework

8.1 Analysis of L-shape Data: General Perspectives

8.2 Sequential Procedures for L-shape Data Based on PLS/PCR and ANOVA

8.2.1 Interpretation of X1, Quantitative X2-data, Horizontal Axis First

8.2.2 Interpretation of X1, Categorical X2-data, Horizontal Axis First

8.2.3 Analysis of Segments/Clusters of X1 Data

8.3 The L-PLS Method for Joint Estimation of Blocks in L-shape Data

8.3.1 The Original L-PLS Method, Endo-L-PLS

8.3.2 Exo- Versus Endo-L-PLS

8.4 Modifications of the Original L-PLS Idea

8.4.1 Weighting Information from X3 and X1 in L-PLS Using a Parameter α

8.4.2 Three-blocks Bifocal PLS

8.5 Alternative L-shape Data Analysis Methods

8.5.1 Principal Component Analysis with External Information

8.5.2 A Simple PCA Based Procedure for Using Unlabelled Data in Calibration

8.5.3 Multivariate Curve Resolution for Incomplete Data

8.5.4 An Alternative Approach in Consumer Science Based on Correlations Between X3 and X1

8.6 Domino PLS and More Complex Data Structures

8.7 Conclusions and Recommendations

8.8 Open Issues

Part IV Alternative Methods for Unsupervised and Supervised Topologies

9 Alternative Unsupervised Methods

9.i General Introduction

9.ii Relationship to the General Framework

9.1 Shared Variable Mode

9.2 Shared Sample Mode

9.2.1 Only Common Variation

9.2.1.1 DIABLO

9.2.1.2 Generalised Coupled Tensor Factorisation

9.2.1.3 Representation Matrices

9.2.1.4 Extended PCA

9.2.2 Common, Local, and Distinct Variation

9.2.2.1 Generalised SVD

9.2.2.2 Structural Learning and Integrative Decomposition

9.2.2.3 Bayesian Inter-battery Factor Analysis

9.2.2.4 Group Factor Analysis

9.2.2.5 OnPLS

9.2.2.6 Generalised Association Study

9.2.2.7 Multi-Omics Factor Analysis

9.3 Two Shared Modes and Only Common Variation

9.3.1 Generalised Procrustes Analysis

9.3.2 Three-way Methods

9.4 Conclusions and Recommendations

9.4.1 Open Issues

10 Alternative Supervised Methods

10.i General Introduction

10.ii Relations to the General Framework

10.1 Model and Focus

10.2 Extension of PCovR

10.2.1 Sparse Multiblock Principal Covariates Regression, Sparse PCovR

10.2.2 Multiway Multiblock Covariates Regression

10.3 Multiblock Redundancy Analysis

10.3.1 Standard Multiblock Redundancy Analysis

10.3.2 Sparse Multiblock Redundancy Analysis

10.4 Miscellaneous Multiblock Regression Methods

10.4.1 Multiblock Variance Partitioning

10.4.2 Network Induced Supervised Learning

10.4.3 Common Dimensions for Multiblock Regression

10.5 Modifications and Extensions of the SO-PLS Method

10.5.1 Extensions of SO-PLS to Three-Way Data

10.5.2 Variable Selection for SO-PLS

10.5.3 More Complicated Error Structure for SO-PLS

10.5.4 SO-PLS Used for Path Modelling

10.6 Methods for Data Sets Split Along the Sample Mode, Multigroup Methods

10.6.1 Multigroup PLS Regression

10.6.2 Clustering of Observations in Multiblock Regression

10.6.3 Domain-Invariant PLS, DI-PLS

10.7 Conclusions and Recommendations

10.8 Open Issues

Part V Software

11 Algorithms and Software

11.1 Multiblock Software

11.2 R package multiblock

11.3 Installing and Starting the Package

11.4 Data Handling

11.4.1 Read From File

11.4.2 Data Pre-processing

11.4.3 Re-coding Categorical Data

11.4.4 Data Structures for Multiblock Analysis

11.4.4.1 Create List of Blocks

11.4.4.2 Create data.frame of Blocks

11.5 Basic Methods

11.5.1 Prepare Data

11.5.2 Modelling

11.5.3 Common Output Elements Across Methods

11.5.4 Scores and Loadings

11.6 Unsupervised Methods

11.6.1 Formatting Data for Unsupervised Data Analysis

11.6.2 Method Interfaces

11.6.3 Shared Sample Mode Analyses

11.6.4 Shared Variable Mode

11.6.5 Common Output Elements Across Methods

11.6.6 Scores and Loadings

11.6.7 Plot From Imported Package

11.7 ANOVA Simultaneous Component Analysis

11.7.1 Formula Interface

11.7.2 Simulated Data

11.7.3 ASCA Modelling

11.7.4 ASCA Scores

11.7.5 ASCA Loadings

11.8 Supervised Methods

11.8.1 Formatting Data for Supervised Analyses

11.8.2 Multiblock Partial Least Squares

11.8.2.1 MB-PLS Modelling

11.8.2.2 MB-PLS Summaries and Plotting

11.8.3 Sparse Multiblock Partial Least Squares

11.8.3.1 Sparse MB-PLS Modelling

11.8.3.2 Sparse MB-PLS Plotting

11.8.4 Sequential and Orthogonalised Partial Least Squares

11.8.4.1 SO-PLS Modelling

11.8.4.2 Måge Plot

11.8.4.3 SO-PLS Loadings

11.8.4.4 SO-PLS Scores

11.8.4.5 SO-PLS Prediction

11.8.4.6 SO-PLS Validation

11.8.4.7 Principal Components of Predictions

11.8.4.8 CVANOVA

11.8.5 Parallel and Orthogonalised Partial Least Squares

11.8.5.1 PO-PLS Modelling

11.8.5.2 PO-PLS Scores and Loadings

11.8.6 Response Optimal Sequential Alternation

11.8.6.1 ROSA Modelling

11.8.6.2 ROSA Loadings

11.8.6.3 ROSA Scores

11.8.6.4 ROSA Prediction

11.8.6.5 ROSA Validation

11.8.6.6 ROSA Image Plots

11.8.7 Multiblock Redundancy Analysis

11.8.7.1 MB-RDA Modelling

11.8.7.2 MB-RDA Loadings and Scores

11.9 Complex Data Structures

11.9.1 L-PLS

11.9.1.1 Simulated L-shaped Data

11.9.1.2 Exo-L-PLS

11.9.1.3 Endo-L-PLS

11.9.1.4 L-PLS Cross-validation

11.9.2 SO-PLS-PM

11.9.2.1 Single SO-PLS-PM Model

11.9.2.2 Multiple Paths in an SO-PLS-PM Model

11.10 Software Packages

11.10.1 R Packages

11.10.2 MATLAB Toolboxes

11.10.3 Python

11.10.4 Commercial Software

References

Index

End User License Agreement

List of Figures

Chapter 1

Figure 1.7 L-shape data of...

Figure 1.1 High-level...

Figure 1.2 Idea of dimension...

Figure 1.3 Design of the plant...

Figure 1.4 Scores on the first...

Figure 1.5 Idea of copy number...

Figure 1.6 Plot of the Raman...

Figure 1.8 Phylogeny of some...

Figure 1.9 The idea of common...

Chapter 2

Figure 2.1 Idea of dimension reduction...

Figure 2.2 Geometry of PCA...

Figure 2.3 Score (a) and loading...

Figure 2.4 PLS validated explained...

Figure 2.5 Score and loading plots...

Figure 2.6 Raw and normalised...

Figure 2.7 Numerical representations...

Figure 2.8 Classical (a) and...

Figure 2.9 Classical (a) and...

Figure 2.10 SCA for two data...

Figure 2.11 The block scores...

Figure 2.12 Two column-spaces...

Figure 2.13 Common and distinct...

Figure 2.14 Common components...

Figure 2.15 Visualisation of a response...

Figure 2.16 Fitted values versus...

Figure 2.17 Simple linear regression...

Figure 2.18 Two-variable multiple...

Figure 2.19 Two component PCA...

Figure 2.20 Illustration of true...

Figure 2.21 Visualisation of bias...

Figure 2.22 Learning curves showing...

Figure 2.23 Visualisation of the...

Figure 2.24 Cumulative explained...

Figure 2.25 Null distribution...

Chapter 3

Figure 3.1 Skeleton of a three-block data...

Figure 3.2 Skeleton of a four-block data...

Figure 3.3 Skeleton of a three-block data...

Figure 3.4 Skeleton of a three-block...

Figure 3.5 Skeleton of a four-block...

Figure 3.6 Topology of a three-block...

Figure 3.7 Topology of a three-block...

Figure 3.8 Different arrangements...

Figure 3.9 Unsupervised combination...

Figure 3.10 Supervised three-set problem...

Figure 3.11 Supervised L-shape problem...

Figure 3.12 Path model structure...

Figure 3.13 Idea of linking two...

Figure 3.14 Different linking...

Figure 3.15 Idea of linking...

Figure 3.16 Different linking...

Figure 3.17 Treating common and distinct...

Chapter 4

Figure 4.1 Explanation of the scale...

Figure 4.2 Topology of interactions...

Figure 4.3 The RV and partial RV...

Figure 4.4 Decision tree for selecting...

Chapter 5

Figure 5.1 Unsupervised analysis...

Figure 5.2 Illustration explaining...

Figure 5.3 The idea of common...

Figure 5.4 Proportion of explained...

Figure 5.5 Row-spaces visualised...

Figure 5.6 Difference between weights...

Figure 5.7 The logistic function...

Figure 5.8 CNA data visualised...

Figure 5.9 Score plot of the CNA...

Figure 5.10 Plots for selecting...

Figure 5.11 Biplots from PCA-GCA...

Figure 5.12 Amount of explained...

Figure 5.13 Amount of explained...

Figure 5.14 Scores (upper part)...

Figure 5.17 True design used...

Figure 5.15 ACMTF as applied...

Figure 5.16 True design used...

Figure 5.18 Example of the properties...

Figure 5.19 Quantification of modes...

Figure 5.20 Linking the blocks...

Figure 5.21 Decision tree...

Figure 5.22 Decision tree for...

Chapter 6

Figure 6.1 ASCA decomposition...

Figure 6.2 A part of the ASCA...

Figure 6.3 The ASCA scores...

Figure 6.4 The ASCA scores on the factor...

Figure 6.5 The ASCA scores on the interaction...

Figure 6.6 PCA on toxicology data...

Figure 6.7 ASCA on toxicology data...

Figure 6.8 PARAFASCA on toxicology...

Figure 6.9 Permutation example...

Figure 6.10 Permutation test...

Figure 6.11 ASCA candy scores...

Figure 6.12 ASCA assessor scores...

Figure 6.13 ASCA assessor and candy...

Figure 6.14 PE-ASCA of the NMR metabolomics...

Figure 6.15 Tree for selecting...

Chapter 7

Figure 7.1 Conceptual illustration...

Figure 7.2 Illustration of link...

Figure 7.3 Cross-validated explained...

Figure 7.4 Super-weights (w) for the...

Figure 7.5 Block-weights (wm) for first...

Figure 7.6 Block-scores (tm, for left...

Figure 7.7 Classification by regression...

Figure 7.8 AUROC values of different...

Figure 7.9 Super-scores...

Figure 7.10 Linking structure...

Figure 7.11 The SO-PLS iterates...

Figure 7.20 CV-ANOVA results...

Figure 7.13 Måge plot...

Figure 7.12 The CVANOVA...

Figure 7.14 PCP plots for wine...

Figure 7.15 Måge plot showing...

Figure 7.16 Block-wise scores...

Figure 7.17 Block-wise (projected)...

Figure 7.18 Block-wise loadings...

Figure 7.19 Måge plot...

Figure 7.21 Loadings from Principal...

Figure 7.22 RMSEP for fish data...

Figure 7.23 Regression coefficients...

Figure 7.24 SO-PLS results using...

Figure 7.25 Illustration of the idea...

Figure 7.26 PO-PLS calibrated/fitted...

Figure 7.27 PO-PLS calibrated...

Figure 7.28 PO-PLS common scores...

Figure 7.29 PO-PLS common...

Figure 7.30 PO-PLS distinct loadings...

Figure 7.31 ROSA component selection...

Figure 7.32 Cross-validated explained...

Figure 7.33 ROSA weights...

Figure 7.34 Summary of cross-validated...

Figure 7.35 The decision paths...

Chapter 8

Figure 8.1 Figure (a)–(c) represent...

Figure 8.2 Conceptual illustration...

Figure 8.3 Topologies for four...

Figure 8.4 Scheme for information...

Figure 8.5 Preference mapping...

Figure 8.6 Results from consumer...

Figure 8.7 Results from consumer liking...

Figure 8.8 Relations between segments...

Figure 8.9 Topology for the extension...

Figure 8.10 L-block scheme with...

Figure 8.11 Endo-L-PLS results...

Figure 8.12 Classification...

Figure 8.13 (a) Data structure for labelled...

Figure 8.14 Tree for selecting methods...

Chapter 9

Figure 9.1 General setup for fusing heterogeneous...

Figure 9.2 Score plots of IDIOMIX...

Figure 9.3 True design used in mixture...

Figure 9.4 Cross-validation results...

Figure 9.5 Explained variances...

Figure 9.6 From multiblock data...

Figure 9.7 Decision tree for selecting...

Chapter 10

Figure 10.1 Results from multiblock...

Figure 10.2 Pie chart of the sources...

Figure 10.3 Flow chart...

Figure 10.4 An illustration of...

Figure 10.5 Path diagram for a wine...

Figure 10.6 Wine data. PCP plots...

Figure 10.7 An illustration of the multigroup...

Figure 10.8 Decision tree for selecting...

Chapter 11

Figure 11.1 Output from use of scoreplot...

Figure 11.2 Output from use of loadingplot...

Figure 11.3 Output from use of scoreplot...

Figure 11.4 Output from use of loadingplot...

Figure 11.5 Output from use of plot...

Figure 11.6 Output from use of scoreplot()...

Figure 11.7 Output from use of scoreplot()...

Figure 11.8 Output from use of loadingplot()...

Figure 11.9 Output from use of scoreplot()...

Figure 11.10 Output from use of loadingplot()...

Figure 11.11 Output from use of scoreplot()...

Figure 11.12 Output from use of maage().

Figure 11.13 Output from use of maageSeq().

Figure 11.14 Output from use of loadingplot()...

Figure 11.15 Output from use of scoreplot()...

Figure 11.16 Output from use of scoreplot()...

Figure 11.17 Output from use of plot()...

Figure 11.18 Output from use of scoreplot()...

Figure 11.19 Output from use of loadingplot()...

Figure 11.20 Output from use of loadingplot()...

Figure 11.21 Output from use of scoreplot()...

Figure 11.22 Output from use of image()...

Figure 11.23 Output from use of image()...

Figure 11.24 Output from use of scoreplot()...

Figure 11.25 Output from use of plot()...

List of Tables

Chapter 1

Table 1.1 Overview of methods...

Table 1.2 Abbreviations of the...

Chapter 2

Table 2.1 Formal treatment of types...

Table 2.2 Different methods for fusing...

Table 2.3 The matrices of which the weights...

Chapter 4

Table 4.1 Overview of the data sets...

Chapter 5

Table 5.1 Overview of methods...

Table 5.2 Different types of SCA...

Table 5.3 Proportions of explained...

Table 5.4 Properties of methods...

Chapter 6

Table 6.1 Overview of methods...

Chapter 7

Table 7.1 Overview of methods...

Chapter 8

Table 8.1 Overview of methods...

Table 8.2 Tabulation of consumer characteristics...

Table 8.3 Consumer liking of cheese...

Chapter 9

Table 9.1 Overview of methods...

Chapter 10

Table 10.1 Overview of methods...

Table 10.2 Results of the single-block...

Table 10.3 Results of the multiway...

Table 10.4 SO-PLS-PM results for wine data...

Chapter 11

Table 11.1 R packages on CRAN having...

Table 11.2 MATLAB toolboxes and...

Table 11.3 Python packages having...

Table 11.4 Commercial software having...

Guide

Cover

Title page

Copyright

Table of Contents

Foreword

Preface

List of Figures

List of Tables

Begin Reading

References

Index

End User License Agreement

Pages

i

ii

iii

iv

v

vi

vii

viii

ix

x

xi

xii

xiii

xiv

xv

xvi

xvii

xviii

xix

xx

xxi

xxii

xxiii

xxiv

xxv

xxvi

xxvii

xxviii

xxix

xxx

xxxi

xxxii

xxxiii

xxxiv

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

Foreword

It is a real honour to write a few introductory words about Multiblock Data Fusion in Statistics and Machine Learning. The book is maybe not timely! The subject has been around in chemometrics since the late 1980s; usually under the term multiblock analysis.

Let me take that back immediately–the book is definitely timely. Even though this subject has been discussed for decades, it has taken off dramatically lately. And not only in chemometrics, but in a variety of fields. There are many diverse and interesting developments and in fact, it is quite difficult to really understand what is going on and to filter or even just understand the literature from so many sources. Each field will have their own internal jargon and background. This may be the biggest obstacle right now. It is evident that there are many interesting developments but grasping them is next to impossible. This book fixes that. And not only that, this book provides a comprehensive overview across fields and it also adds perspective and new research where needed. I would argue that this is the place if you want to understand data fusion comprehensively.

That is, if you want to understand how to apply data fusion; or you want to develop new data fusion models; or learn how the algorithms and models work; or maybe you want to understand what the shortcomings of different approaches are. If you have questions like these or you simply want to know what is happening in this area of data science, then reading this book will be a nice and fulfilling experience.

To write a comprehensive book about such an enormous field requires special people. And indeed, there are three very competent persons behind this book. They have all worked within the area for many years and have each provided important research on both the theoretical and the application sides of things. And they represent both the experience of the old-timers and visions of the coming generations. I can say that without insulting (I hope) as I am in the same age group as the more exdistinguished part of the authors.

I have the deepest respect and the highest admiration for the three authors. I have learned so many things from their individual contributions over the years. Reading this joint work is not a disappointment. Please do enjoy!

Rasmus Bro

Køge, Denmark, July 28, 2021

Preface

Combining information from two or possibly several blocks of data is gaining increased attention and importance in several areas of science and industry. Typical examples can be found in chemistry, spectroscopy, metabolomics, genomics, systems biology, and sensory science. Many methods and procedures have been proposed and used in practice. The area goes under different names: data integration, data fusion, multiblock analyses, multiset analyses, and others.

This book is an attempt to provide an up-to-date treatment of the most used and important methods within an important branch of the area; namely methods based on so-called components or latent variables. These methods have already obtained enormous attention in, for instance, chemometrics, bioinformatics, machine learning, and sensometrics and have proved to be important both for prediction and interpretation.

The book is primarily a description of methodologies, but most of the methods will be illustrated by examples from the above-mentioned areas. The book is written such that both users of the methods as well as method developers will hopefully find sections of interest. At the end of the book there is a description of a software package developed particularly for the book. This package is freely available in R and covers many of the methods discussed.

To distinguish the different types of methods from each other, the book is divided into five parts. Part I is an introduction and description of preliminary concepts. Part II is the core of the book containing the main unsupervised and supervised methods. Part III deals with more complex structures and, finally, Part IV presents alternative unsupervised and supervised methods. The book ends with Part V discussing the available software.

Our recommendations for reading the book are as follows. A minimum read of the book would involve chapters 1, 2, 3, 5, and 7. Chapters 4, 6 and 8 are more specialized and chapters 9 and 10 contain methods we think are more advanced or less obvious to use. We feel privileged to have so many friendly colleagues who were willing to spend their time on helping us to improve the book by reading separate chapters. We would like to express our thanks to: Rasmus Bro, Margriet Hendriks, Ulf Indahl, Henk Kiers, Ingrid Måge, Federico Marini, Åsmund Rinnan, Rosaria Romano, Lars Erik Solberg, Marieke Timmerman, Oliver Tomic, Johan Westerhuis, and Barry Wise. Of course, the correctness of the final text is fully our responsibility!

Age Smilde, Utrecht, The Netherlands

Tormod Næs, Ås, Norway

Kristian Hovde Liland, Ås, Norway

March 2022