Biological Data Integration -  - E-Book

Biological Data Integration E-Book

0,0
142,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

The study of biological data is constantly undergoing profound changes. Firstly, the volume of data available has increased considerably due to new high throughput techniques used for experiments. Secondly, the remarkable progress in both computational and statistical analysis methods and infrastructures has made it possible to process these voluminous data. The resulting challenge concerns our ability to integrate these data, i.e. to use their complementary nature effectively in the hope of advancing our knowledge. Therefore, a major challenge in studying biology today is integrating data for the most exhaustive analysis possible. Biological Data Integration deals in a pedagogical way with research work in biological data science, examining both computational approaches to data integration and statistical approaches to the integration of omics data.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 376

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.


Ähnliche


SCIENCES

Computer Science, Field Directors – Valérie Berthé and Jean-Charles Pomerol

Bioinformatics, Subject Head – Anne Siegel and Hélène Touzet

Biological Data Integration

Computer and Statistical Approaches

Coordinated by

Christine Froidevaux

Marie-Laure Martin-Magniette

Guillem Rigaill

First published 2023 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the under mentioned address:

ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK

www.iste.co.uk

John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA

www.wiley.com

© ISTE Ltd 2023The rights of Christine Froidevaux, Marie-Laure Martin-Magniette and Guillem Rigaill to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s), contributor(s) or editor(s) and do not necessarily reflect the views of ISTE Group.

Library of Congress Control Number: 2023932774

British Library Cataloguing-in-Publication DataA CIP record for this book is available from the British LibraryISBN 978-1-78945-030-9

ERC code:LS2 Genetics, ’Omics’, Bioinformatics and Systems Biology LS2_12 Bioinformatics LS2_13 Computational biology LS2_14 Biostatistics LS2_15 Systems biology

Preface

Christine FROIDEVAUX1, Marie-Laure MARTIN-MAGNIETTE2,3 and Guillem RIGAILL2,4

1Université Paris-Saclay, CNRS, LISN, Orsay, France

2IPS2, Université Paris-Saclay, CNRS, INRAE, Université d’Évry, Université Paris Cité, Gif-sur-Yvette, France

3MIA Paris-Saclay, Université Paris-Saclay , AgroParis Tech, INRAE, France

4LaMME, Université Paris-Saclay, CNRS, Université d’Évry, Évry-Courcouronnes, France

P.1. Introduction

The study of biological data has undergone fundamental changes in recent years. Firstly, the volume of these data has dramatically increased due to new high-throughput techniques for experiments. Secondly, remarkable advances reached in both computational and statistical analysis methods and in infrastructures have made processing these large datasets possible. These data should then be integrated, that is, their complementarity exploited with the prospect of advancing biological knowledge. Using data integration to allow the most exhaustive analysis possible thus represents a major challenge in biology.

This book intends to address research studies in biological data science with a pedagogical approach, focusing first on computational approaches to biological data integration and then on statistical approaches to omics data integration.

P.2. Computer-based approaches to biological data integration

P.2.1. Challenges of biological knowledge integration

Biological knowledge has given rise to new fields of application: beyond integrative and systems biology, it is valuable for health and the environment. In particular, the linking of omics data with knowledge of pathologies and clinical data has led to the emergence of precision medicine, which holds tremendous promise for individual health. However, to achieve it, we need to be able to analyze all the knowledge available in an integrated way.

Life sciences data integration must face several difficulties: in addition to the fact that they are massive (Big Data), they are heterogeneous (very varied formats), dispersed (they are found in many databases), presenting various granularities (genomic data or pathology information) and of very variable quality (databases do not all grant the same guarantee of verification (curation)).

Unlike other application areas where the integration process is based on the identification of concepts structured in ontologies and on which data matching is performed, biological data integration proceeds by reconciling data using algorithmic, learning and statistical approaches. This integration increasingly attempts to put the human being at the center of the process.

P.2.2. Computer-based solutions

A new paradigm has emerged: the procedure no longer consists of two distinct phases, where the first phase aimed at gathering data distributed through different databases and integrating them, while the second performed analysis on the integrated data. The two phases are intertwined: integration is used for analysis, which in turn is the basis for better integration.

A number of data warehouses have been developed to gather in an integrated, that is to say, structured, coherent and complementary manner fragmented data related to the same biology field. The constitution of these warehouses is accompanied by data querying methods such that their analysis is made possible. These data can be annotated with conceptual terms derived from ontologies, which make it possible to keep track of the deep knowledge associated with them. Ontologies not only allow enriching knowledge with annotations but also to reason about this knowledge. They are at the heart of the Semantic Web, which aims at a fine-grained representation of data to facilitate the automatic integration and interpretation of the data (Chen et al. 2012).

Finally, the analyses performed on the data use a multitude of very different tools. The data processing procedure that makes use of a sequence of several tools one after another, called workflow, is becoming a fundamental part of data analysis and is at the heart of the paradigm shift mentioned in the introduction. Designing and executing these bioinformatics data processing chains are important issues.

P.2.3. Presentation of the first part

Chapter 1 introduces data warehouses for the life sciences, focusing on clinical data. Chapter 2 introduces Semantic Web concepts and techniques for omics data integration. Finally, Chapter 3 exposes bioinformatics problems and solutions for designing and executing scientific workflows.

These chapters underline the close relationships between good integration and the FAIR (Findable, Accessible, Interoperable, Reproducible) data principles and insist on the importance of data provenance (Zheng et al. 2015). They point out the ethical challenges implied by the protection of stored personal data, especially in the health field, in connection with the security of computer systems.

Throughout these chapters, the reader will be able to see how, in terms of data integration, advances in computational research benefit the life sciences, and how wider adoption of computational methods could benefit them even more so. Conversely, the life sciences offer a tremendous field of investigation for the development of innovative computational methods.

P.3. Statistical approaches to omics data integration

P.3.1. Integration statistical challenges

Omics data integration is a very broad topic: it is very difficult to accurately define its contours. Our vision of omics data integration is quite close to the one presented by Ritchie et al. (2015):

[…] (multi)-omics information integration in a meaningful manner to provide a more complete analysis of a biological point of interest.

This definition emphasizes the objectives of integration. The analysis must make sense, of course, but more importantly it must shed a new light on a biological question of interest: in other words, it must do “better” than a non-integrative analysis.

On the biological level, a systemic vision of the functioning of the cell perfectly motivates the development of methodologies for integrating omic information. How could we actually understand the regulations of the cell without studying or understanding the numerous molecular interactions that take place therein: DNA-DNA, DNA-RNA, RNA-protein, etc. Nonetheless, omics data integration is not an easy task. It is not a miraculous solution and the demonstration that an integrative analysis provides a more complete biological picture than a non-integrative analysis is not always straightforward. We mention here very briefly some of the statistical difficulties associated with data integration (Ritchie et al. 2015).

P.3.1.1. Heterogeneous and complex data

One of the first difficulties that comes across is certainly data diversity. For example:

data with very different formats have to be integrated: graphs, matrices, signals, etc.;

data corresponding to a wide range of molecular scales have to be integrated, for example, transcriptomic and proteomic data;

unbalanced datasets where some samples are not present in all the datasets have to be integrated.

P.3.1.2. Quality data

As Ritchie et al. (2015) reminded us well, before integrating data, it is necessary to analyze each dataset separately and validate its quality. To obtain high-quality results from an integrative analysis, high-quality data are necessary.

P.3.1.3. High-dimensional data

In genomics, we are often faced with the problem of high dimensionality (Giraud 2014): the number of variables p (genes, proteins, transcripts) is often much larger than the number of observations n (individuals, samples). Integration tends to make the problem worse. For simplicity reasons, let us assume that in each dataset d to be integrated, the same n samples are observed and that we measure pd variables. If in each dataset, we already have n ≪ pd, a fortiori .

One solution for mitigating the importance of this problem consists of reducing the size of each dataset. For this purpose, there are many existing techniques, for example, data mining techniques or even the use of knowledge bases.

P.3.2. Omic or multiomic knowledge integration and acquisition

The focus is often on the need for multi-omics data integration. This need is undeniable. However, at the statistical level, we should not forget the need for mono-omic integration. A large number of classical analysis tools model biological entities independently (or almost independently). For example, for the study of RNA-seq data, differential analysis is most often used and genes are analyzed almost independently (Robinson et al. 2010; Love et al. 2014). There is a form of integration at the level of the estimation of the overdispersion parameter or even of the analyses of pathways. This integration already raises very important statistical difficulties. However, more should be done in modeling dependencies within a type of omics data (see Chapters 4 and 5, for example).

Clearly, integrating data should make it possible to take advantage of the very large number of datasets already available and perform powerful meta-analyses. A more data-driven, computational and simulation-based science is often predicted. Nevertheless, we think it is important to take the integration into consideration during the knowledge acquisition process (see Figure 1 of Camacho et al. (2018)). In this framework, an important question is to know how “easily integrable” data can be generated. In statistics, this is often referred to as “experimental design”. The answer will obviously depend on what is to be biologically predicted or understood and which validation techniques are available.

P.3.3. Presentation of the second part

In summary, omics data integration seems to be a key objective for a more integrative and systemic biology. At the statistical level, there are still some methodological obstacles remaining, in particular high-dimensionality, managing missing data, prediction within uncertain contexts and validation.

Moreover, we should not forget the objective: that is, to answer a biological question. Defining this question is not always simple. Does a biological process of interest have to be predicted or understood? Will the analysis be supervised, unsupervised or semi-supervised? What are the implicit or explicit assumptions of the analysis performed, are they consistent with the biological question?

The statistical chapters in this book hopefully illustrate how integration allows progress to be made concerning a particular biological point of interest, the diversity of methodological approaches and some of the difficulties encountered. Chapters 4 and 5 address mono-omic data integration for predicting a phenotype. Chapters 6 and 7 present exploratory techniques for multiomics analysis.

We asked the authors to present their work in a pedagogical manner and to provide the codes of their analyses and simulations. A reading committee composed of statisticians, mathematicians, bioinformaticians and biologists was able to appreciate and validate their efforts. We hope that this will make these chapters accessible to as many people as possible. We would like to thank the authors of the chapters and the reviewers for their work.

P.4. References

Camacho, D.M., Collins, K.M., Powers, R.K., Costello, J.C., Collins, J.J. (2018). Next-generation machine learning for biological networks.

Cell

, 173(7), 1581–1592.

Chen, H., Yu, T., Chen, J.Y. (2012). Semantic Web meets integrative biology: A survey.

Briefings in Bioinformatics

, 14(1), 109–125.

Giraud, C. (2014).

Introduction to High-dimensional Statistics

. Chapman & Hall/CRC Press, London.

Love, M.I., Huber, W., Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.

Genome Biology

, 15(12), 550.

Ritchie, M.D., Holzinger, E.R., Li, R., Pendergrass, S.A., Kim, D. (2015). Methods of integrating data to uncover genotype–phenotype interactions.

Nature Reviews Genetics

, 16(2), 85–97.

Robinson, M.D., McCarthy, D.J., Smyth, G.K. (2010). Edger: A bioconductor package for differential expression analysis of digital gene expression data.

Bioinformatics

, 26(1), 139–140.

Zheng, C.L., Ratnakar, V., Gil, Y., McWeeney, S.K. (2015). Use of semantic workflows to enhance transparency and reproducibility in clinical omics.

Genome Medicine

, 7(73).

April 2023

PART 1Knowledge Integration

1Clinical Data Warehouses

Maxime WACK1,2 and Bastien RANCE1,2

1Hôpital Européen Georges Pompidou, AP-HP, Paris, France

2Centre de Recherche des Cordeliers, INSERM, Université Paris Cité, France

1.1. Introduction to clinical information systems and biomedical warehousing: data warehouses for what purposes?

Patient care in hospitals, private practices and any healthcare facilities produce a large amount of information: information in the form of text, such as clinical reports and prescriptions; information in the form of images, such as X-rays, scans and MRIs, as well as anatomical and pathological examinations; structured information in the form of key-value pairs, such as the results of biological laboratory examinations or codes derived from standardized terminologies (e.g. the international classification of diseases). Already useful for decision-making during care, these data can be given a new lease of life in clinical data warehouses, which allow them to be reused for research projects and, in particular, for improving care. Unlike data collected in highly standardized clinical studies, making use of selected and controlled patient profiles, health data warehouses integrate information known as “real life” data. The recorded data reflect practices and their evolutions, the use of treatments outside the planned frameworks and all other forms of incongruities related to the modification of tools to adapt them to the requirements of daily patient care. Reusing becomes difficult to achieve due the complexity of the data models associated with hospital information systems (HISs). Their design, oriented around information collection, patient monitoring and hospital management, is not adapted to the challenge that flexible and facilitated reuse presents.

1.1.1. Warehouse history

Clinical data warehouses were developed to allow the integration of heterogeneous data in a single location and according to a unified data model. Large clinical data warehouses began in the United States during the 1990s. In 1994, the Columbia University Medical Center in New York created a warehouse with the main objective of supporting clinical trials (Chelico et al. 2016). In Boston, Partners Healthcare Inc. (a company grouping together the Massachusetts General Hospital, the Brigham Hospital and the Women’s Hospital) published in 2003 the description of a graphical interface named Research Patient Data Registry (RPDR) (Murphy et al. 2010), making possible the use of their data repository by non-computer specialists. This warehouse would evolve to become in 2004 the i2b2 solution (Murphy et al. 2010) (Informatics for Integrating Biology and the Bedside). The i2b2 warehouse has been widely used for several years by many institutions in the United States and around the world. In France, the Georges Pompidou European Hospital has been using the i2b2 model to develop its warehouse since 2008 (Zapletal et al. 2010). The AP-HP (Assistance publique-Hôpitaux de Paris) also deployed a common data warehouse for its 39 hospitals in 2016 using the i2b2 model at first. Further clinical data warehouse experiments and models have been developed in parallel and subsequently. For example, Vanderbilt University built a warehouse linked to a DNA bank (BioVu) (Danciu et al. 2014). More recently, it is the OMOP CDM (Observational Medical Outcomes Partnership Common Data Model) of the OHDSI consortium (Hripcsak et al. 2015) (Observational Health Data Sciences and Informatics) initially designed for use cases in pharmacovigilance, which seems to be adopted as an international reference.

1.1.2. Using data warehouses today

Apart from simple reuse for clinical research, data warehouses have proven their usefulness for many purposes (decision support, cohort building, predictive model validation, pharmacovigilance). Today, they are at the heart of translational medicine, which aims to translate in a practical manner the results obtained in the context of research for improving patient care. The data warehouse closes the loop of a virtuous circle by enabling the reuse of data generated in the context of medical care for the benefit of research.

1.2. Challenge: widely scattered data

From the patient’s arrival to his or her diagnosis and possible surgery, including bioassays, drug prescriptions, nursing procedures and observations, as well as all the elements involved in keeping a medical record, a patient’s stay generates a large amount of very varied information, which is now increasingly often collected in the form of computerized data.

The main challenge of integrating clinical data lies in the massive heterogeneity of these data produced within the biomedical context. The diversity of data sources (each software tool and each measure produce data of potential interest) and the increasing use of specialized systems contribute to the growing complexity of HISs. Hospitals keep patient files and have the software to manage the monitoring of hospital activity – the PMSI (the French program for the medicalization of information systems), LIMS (a laboratory software) and a PACS (Picture Archiving and Communication System) for managing imaging. Furthermore, there are complementary specialized information systems for advanced life support, genetic data or radiotherapy. The interconnection between these systems is not always obvious. It is not uncommon that, with the exception of common management of patient identities and medical stays, there is very little communication between software programs, making multi-system interrogation very difficult. Besides the complexity of the systems, some warehouses may even integrate multiple sources (from research databases, multiple hospital sites, etc.), making the problem even more acute (see Figure 1.1).

In addition to the diversity of information systems, there is the problem of structuring biomedical data. In the clinical context, for example, the most structured data are often the results of biological laboratory analyses (e.g. hemoglobin measured at 12.8 g/L), ICD 10-coded diagnostic codes (International Classification of Diseases, 10th Revision, for instance, code I10: essential (primary) hypertension), or still medical procedure codes coded using the CCAM in France (Common Classification of Medical Sects, for example, HHFA011: laparoscopic appendectomy). In contrast, data gathered by humans (doctors, nurses, physiotherapists, psychologists, researchers, etc.) are often collected as the result of a combination of free text without any structuring and partial structuring (the question may be labeled by interface terminologies (Rosenbloom et al. 2006), with the definition of a domain of possible answers). This is, for example, the case in the questionnaires of electronic patient records.

Consequently, each interaction of a new patient nature with the care device (new biological technique, new imaging technology, etc.) induces a new category of data, often with its own unique typology. Each new medical measurement technique is likely to generate the need for a new typology to be defined, and these typologies may be subject to further development. It is thus necessary for clinical data warehouses to be able to manage these multiple typologies, as well as to accept arbitrary types.

Figure 1.1.Dedicated databases for a folder containing all the information

1.3. Data warehouses and clinical data

1.3.1. Warehouse structures

Many software-based solutions have appeared over the years, often developed in an ad hoc manner to target the information system of a particular hospital. Later, other solutions have promoted an agnostic approach allowing the adaptation and use of information systems from other centers. Several different approaches have emerged, providing various answers to the problem of integrating heterogeneous data in a centralized manner.

1.3.1.1. Kimball model (star schema warehouse)

Kimball’s model (Kimball 1996) is oriented around observations. A central table records the facts (a biological result, a medical prescription, etc.) in a single model. A series of additional tables provide the context of the information. In the context of health data, this is often the patient, the stay, etc. The fact-based approach allows for a fairly broad definition of a patient’s care, reducing it to a succession of atomic interactions with the various actors in the care process without any a priori restrictions on the number, order or type of these interactions, and without imposing any restrictions on the form that a patient’s stay should take. The model is simple, very easy to implement and understandable to warehouse end users.

The main example of this approach is the aforementioned i2b2 (Informatics for Integrating Biology and the Bedside) platform, which dates back to the early 2000s. The stated objective of i2b2 was to provide an exploitation-oriented clinical data warehouse system for research, especially translational research, which facilitated linking laboratory data (biological analyses and new generation data: DNA tests, sequencing, etc.) to clinical data collected at the patient’s bedside. It was therefore necessary to propose a system allowing the dynamic representation of types.

The i2b2 model follows the canonical structuring of the so-called “star schema”, with a central “facts” table, surrounded by peripheral tables providing information unique to each of the keys in the central table. For example, a peripheral table will be able to contain information unique to each individual identified by a key used in the central table. A fact is the product of an atomic interaction between the patient and the hospital: a diagnosis, a bioassay result, a drug prescription, etc. For each fact observation, the whole context can be found: the patient, his or her hospitalization (stay), as well as the medical unit in which the observation was made. In the absence of a specific table for each type of data, the nature of the observation is recorded in the form of a concept (see the attribute concept_cd in Figure 1.2), a term from a more or less structured nomenclature. In i2b2, these tree-shaped nomenclatures are wrongly called “ontologies” since they do not allow for the representation of complex semantic relations. However, nomenclatures do permit the representation of taxonomies, hyponymy (A is-a B, for example, a lung cancer is a cancer), synonymy (A is B) and hyperonymy (A contains B) relationships. It is thus possible to express in the i2b2 formalism most of the existing medical terminologies and to propose an intuitive classification mechanism that enable the exploration of existing concepts and the creation of personalized terminologies. Each concept thus organized can be used to represent its own occurrence (a diagnosis, a procedure, etc.), or be associated with a specific value (numerical for a dosage result, a drug dosage, textual for a microbiology result, a medical questionnaire item, or even arbitrary binary information which can contain any type of document).

For example, the dosing of a natraemia of 140 mmol/L with Mr. X on November 3, 2017 during his visit to the emergency room could be coded by adding a new row to the table observation_fact containing the unique number of this particular hospitalization, the unique identifier of Mr. X employed to link his data in the HIS, the identifier of the concept of natraemia expressed in mmol/L, the identifier of the emergency department, the date and precise time of the sampling and the numerical value of the dosage (see Table 1.1).

Figure 1.2.i2b2 star schema

Table 1.1.Extract from the observation_fact table containing three observations for two patients

encounter_num

patient_num

concept_cd

provider_id

start_date

nval_num

unit_cd

123456

112233

BIO:NAT

STRUCT:URG

2017-11-03

140

mmol/l

135691

112233

BIO:NAT

STRUCT:CARDIO

2018-10-22

124

mmol/l

136899

654332

CIM:I10

STRUCT:PNEUMO

2018-05-31

{NULL}

{NULL}

The simplicity of Kimball’s model and its ability to incorporate all sorts of data has led to the widespread adoption of i2b2. On the other hand, the Kimball model suffers from a number of drawbacks:

The semantics of the attributes of the fact table may change according to the type of data. For example, the record date (

start_date

) may correspond to a certain point in time (for instance, in the case of an instantaneous dosage) or to the start of an event (the start of drug administration, for instance).

In its original form, using only the attribute

concept_cd

to express the nature of the recorded data, the i2b2 model was not able to express the full potential of the medical data. In 2010, the designers added two additional attributes (

modify_cd

and

instance_num

) resulting in more complete information using an annotation and an occurrence number. For example, a drug such as aspirin (designated by the aspirin concept) can be modified to express the prescribed dose. The model introduces a

modify

dose in this case. The instance is used to assign a unique identifier to the concept, thus allowing the aspirin prescription, its dose, its mode of administration, etc., to be grouped together. The use of

modify

is complex since its meaning may vary depending on the concept used.

In addition to i2b2, many warehouse models are based on the star model, or on forms derived therefrom.

1.3.1.2. The Inmon model

The first commonly accepted data warehouse model is the Inmon model (Inmon 1992). The author recommends the use of the relational model following the third normal form (3NF). This is an organization where each attribute possesses its own key, thus avoiding inconsistencies and redundancy. In 1992, this model results in maintaining a high level of information structuring, while allowing for efficient querying and use of the warehouse. Inmon proposes a top-down approach, the warehouse being built by analyzing existing data and following a set of principles. Tables are created based on the topics covered by the data (patients, diseases, drugs, etc.). Building these warehouses incur large costs since they require a thorough understanding of the integrated data.

The Compton Data Model of the OHDSI consortium (Observational Health Data Sciences and Informatics) is an example of the application of the Inmon model in the health domain. Each type of entity corresponds to a table (see Figure 1.3), with specific attributes (e.g. Table Drug – treatment – has a reason for discontinuation attribute). The tables are linked by foreign keys. Similar to the i2b2 model, OHDSI uses hierarchical nomenclatures to describe the concepts recorded (such as natremia). The OHDSI model is at the moment widely adopted by both public and private sector users.

1.3.1.3. Other data warehousing models

1.3.1.3.1. From the fact to the document

The data warehouses described above are largely based on the idea that health data are at least partially structured. They are organized based on terminologies and reused. Different studies have shown that a large part of medical information, most of it in fact (Raghavan et al. 2014) is found in clinical narrative texts (medical reports, discharge letters and prescriptions) and is written in free text form. These documents not only contain medical observations, but also some of the clinicians’ reasoning, assumptions and generally speaking, a detailed level of clinical observation.

To meet the need for querying and exploiting these data, “text”-oriented data warehouses have been developed (Hanauer 2006; Cuggia et al. 2011; Garcelon et al. 2018), such as Dr. Warehouse.

Dr. Warehouse is a warehouse organized around the document (examination or hospitalization report, medical observation, etc.), which replaces the structured fact of the star model. The warehouse is optimized to accelerate the query, but also the management of attributes specific to natural language, such as the negation or the subject of a property (“the patient does not exhibit such a symptom”, “the patient’s grandfather died of this type of illness”). Text-oriented warehouses are associated with automatic language processing methods to make the contained information queryable in a normalized and structured form.

Figure 1.3.OHDSI CDM data model (image distributed under Creative Commons Zero v1.0 Universal license)

1.3.1.3.2. Data warehouses organized around complex representations of data

Computer research focusing on data warehouses continues and more complex models compared to the Kimball or Inmon models have been developed. These models make, for example, the management of the historicization of data, or the representation of strong semantic links (such as temporality) possible. These new warehouse models are often based on knowledge modeling (by ontological modeling, for example) and on the structure of the data sources (Khnaisser et al. 2015).

1.3.2. Warehouse construction and supply

Warehouses based on structured data representation models most often make use of Structured Query Language-based (SQL) relational database management systems These systems allow the implementation of the described data structures and the constraints necessary for their operation. Implementations based on systems known as non-SQL (NoSQL) systems (CouchDB, MongoDB, Redis, etc.) that do not use classical relational databases but key value storages which can refer to documents are beginning to emerge for clinical data warehouses, a few years after their mainstreaming in pure computer applications.

The database management system having resolved the technical dimensions, and the warehouse architecture the data representation and query structure, the main difficulty in building a warehouse appears in the integration of the data produced by the institution in the chosen warehouse solution. This is the Extract, Transform and Load (ETL) process (Murphy et al. 2006) (see Figure 1.4). The ETL process can include a de-identification process during the transformation stage, for example, removing the patient’s personal information from the text of the reports. For each piece of data to be integrated, we must know how it can be extracted from the source information system in an automatic, programmed, regular and systematic way, and ideally be able to detect changes in the structure of the imported data, or the errors during its extraction. The data must then undergo one or more processing steps to transform it from the original format to the one accepted by the destination solution (e.g. to associate the different keys to an observation in the star model, to harmonize the expression units of the same biological result between several laboratories, to pseudonymize a hospitalization report, etc.). Finally, the data are actually loaded onto the warehouse. It is possible to choose between several data loading strategies, from the total regeneration of the warehouse in an iterative manner with a given period to the incremental data supply in real time.

1.3.3. Uses

The computerization of the various components of the HIS described in section 1.2, as well as the medical record (electronic health record (EHR)) over the last 20 years, has enabled collecting a large amount of digitized data relating to health care, each type in their original system and in their own format. The EHR is deployed within hospitals and is specific to the institution. The shared medical record is another computer-based initiative, over the entire French territory, which aims to enable the sharing of care data between the actors of city medicine and hospitals. At the time of writing, the majority of institutions have an EHR, but only a minority of the population and practitioners have adopted the shared medical file.

Figure 1.4.Data extraction, transformation and loading

In clinical research, the evaluation of the feasibility of a study, the constitution of cohorts, the selection of patients in retrospective studies and data collection, even data originating from health care, are costly tasks in terms of time, money and personnel. The collection of these data in a routine computerized system is therefore of great interest for their reuse within the context of clinical research, but also for the production of health indicators or for managing the institution.

HIS data are generally accessible in the context of clinical management, through various business software programs, in a so-called “vertical” manner: it is possible to access all of a patient’s information from each program, or even all of a patient’s information in a single EHR program in the case of the integration of data originating from business applications. The utilization itself for research or strategic orientation purposes, on the other hand, requires a “horizontal”, population-based view of the data, enabling access to information for all or a subset of patients. This form is provided by clinical data warehouses whose interest for research has driven the initial developments.

1.3.3.1. Utilization in research

The centralized availability of clinical data from an institution, with the possibility of selecting patients according to multiple criteria (clinical, biological, pharmacological, genetic, etc.), and transversally on the units that received them, was first considered for its interest in research.

The most “trivial” use for such a tool is in the creation of cohorts (Stephen et al. 2003). Access to all care data makes it possible to search for all patients corresponding to a list of inclusion/non-inclusion criteria for a given study. It is thus possible to quickly compile a list of patients known to the institution and eligible for inclusion in a new study, where paper-based filing or data fragmentation in the HIS implied long and complex tasks. Previously, it was necessary to achieve an initial rough selection of eligible patient files against a limited number of available criteria (e.g. the diagnostic or medical procedure code produced within the framework of the French program for the medicalization of information systems (PMSI), or even the collection of data of interest over time by practitioners inside a department), and then to consult these files manually and verify each of the selection criteria. At the present time, this task is simplified with the centralization of data in a warehouse. And after all, if the researcher still has to address other difficulties derived from coding inaccuracies, from the particularities of code use linked to various factors, both epidemiological and economic, or even from the simple technical challenge of writing such a query, and that a manual validation of the inclusions is still necessary in all cases, it does not change the fact that this effort results in being nevertheless largely accelerated and the volume of files to be checked greatly reduced.

In addition to the identification of patients on a list of criteria, the presence of the data itself makes it possible to conduct studies entirely “on warehouse”. In this case, these will be retrospective case–control type studies, which can be directly conducted with the data extracted following the selection of patients. For example, it will be possible to select a first list of patients presenting a particular symptom, the cases and a second list of patients not presenting this sign, but which can be compared on the basis of other data (age, sex, notable history, etc.), the controls and then to search for all drug administrations within their medical history in order to identify potential molecules responsible for this adverse effect.