126,99 €
As platforms for sharing, re-using and storing data, research data repositories are integral to open science policy. This book provides a comprehensive approach to these data repositories, their functionalities, uses, issues and prospects. Taking France as an example, the current landscape of data repositories is considered, including discussion of the idea of a national repository and a comparative study of several national systems. The international re3data directory is outlined and a collection of six case studies of model repositories, both public and private, are detailed (CDS, Data INRAE, SEANOE, Nakala, Figshare and Data Mendeley).
Research Data Sharing and Valorization also includes appendices containing a number of websites and reference texts from the French Ministry of Higher Education, Research and Innovation, and the CNRS. To the authors’ knowledge, it is the first book to be entirely devoted to these new platforms and is aimed at researchers, teachers, students and professionals working with scientific and technical data and information.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 409
Veröffentlichungsjahr: 2022
Cover
Title Page
Copyright
Foreword
1 The Research Data Repository Facility
1.1. Introduction
1.2. The term repository in the context of open access
1.3. How to define a research data repository
1.4. Variable geometry devices
1.5. The question of trust
1.6. Certification
1.7. The FAIR principles
1.8. Lifecycle and facility
1.9. References
2 The Landscape of Research Data Repositories in France
2.1. Introduction
2.2. Context
2.3. Number
2.4. Types of repositories
2.5. Institutions and partners
2.6. Domains
2.7. FAIR principles
2.8. Certification
2.9. Perspectives
2.10. References
3 The International Community: The Strasbourg Astronomical Data Centre (CDS)
3.1. Introduction
3.2. The Strasbourg Astronomical Data Centre
3.3. The mission and organization of the CDS
3.4. The evolution and services of the CDS
3.5. FAIR principles in astronomy: the astronomical virtual observatory
3.6. The use of CDS services
3.7. Overview
3.8. Perspectives
3.9. Acknowledgments
3.10. References
4 Data INRAE – The Networked Repository
5 SEANOE – A Thematic Repository
6 Nakala – A Data Publishing Service
7 The National Repository Option
7.1. Introduction
7.2. The concept
7.3. The request
7.4. Features and services
7.5. Architecture
7.6. Alternatives
7.7. Perspectives
7.8. Addendum
7.9. References
8 Comparative Study of National Research Services
8.1. Introduction
8.2. Framework, objectives and scope of the study
8.3. Recent national schemes
8.4. Missions and objectives
8.5. History of the devices
8.6. Governance arrangements
8.7. Business models
8.8. Service offer
8.9. Co-constructed services
8.10. Key success factors
8.11. References
8.12. Webography
9 Mendeley Data
10 Figshare – A Place Where Open Academic Research Outputs Live
11 Community-Driven Open Reference for Research Data Repositories (COREF) – A Project for Further Development of
re3data
12 Issues and Prospects for Research Data Repositories
12.1. Introduction
12.2. The central role of repositories and diversity in the field
12.3. The issues
12.4. The dynamics of technology
12.5. The “new generation” data repository
12.6. Toward a “new normal”?
12.7. References
Appendices
Appendix A: Websites
Appendix B: Reference Documents
List of Authors
Index
Wiley End User License Agreement
Chapter 1
Table 1.1. Sustainable file formats recommended by the 4TU.ResearchData reposito...
Table 1.2. The TRUST Principles (Lin et al. 2020)
Table 1.3. The 16 CoreTrustSeal themes (CTS 2019)
Table 1.4. The FAIR principles
Chapter 2
Table 2.1. Certified repositories (June 2020)
Cover
Table of Contents
Title Page
Copyright
Foreword
Begin Reading
Appendices
Appendix A: Websites
Appendix B: Reference Documents
List of Authors
Index
End User License Agreement
v
iii
iv
ix
x
xi
xii
xiii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
231
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
SCIENCES
Scientific Knowledge Management, Field Director – Renaud Fabre
Transformation Dynamics of Tools and Practices,Subject Head – Joachim Schöpfel
Coordinated by
Joachim Schöpfel
Violaine Rebouillat
First published 2022 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd
27-37 St George’s Road
London SW19 4EU
UK
www.iste.co.uk
John Wiley & Sons, Inc.
111 River Street
Hoboken, NJ 07030
USA
www.wiley.co
© ISTE Ltd 2022
The rights of Joachim Schöpfel and Violaine Rebouillat to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s), contributor(s) or editor(s) and do not necessarily reflect the views of ISTE Group.
Library of Congress Control Number: 2021953220
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
ISBN 978-1-78945-073-6
ERC code:
PE6 Computer Science and Informatics
PE6_10 Web and information systems, database systems, information retrieval and digital libraries, data fusion
Renaud FABRE
Professor Emeritus, Université Paris 8 Vincennes-Saint-Denis, France
Today, the foundations of true scientific data policies have become visible to researchers and users alike. It seems to be an opportune moment to propose an English and French version of a reference book for research work, as well as for higher education, which takes stock of the principles, projects and achievements in progress, by applying once again the editorial policy of the SCIENCES Encyclopedia, in this Collection dedicated to “Knowledge Management”. This book is intended to occupy a central place in the models of basic and continuing education of Higher Education, where a “data strategy” is currently being developed by all stakeholders1.
In this context, the main interest of the book is twofold:
– it presents a timely opportunity to discover the reference projects and understand the set-up;
– it outlines, across a wide range of disciplines and situations, the characteristic facets of the goal of data policy, and of all the work of data science, or “data scholarship” as it has come to be called (Borgman 2020).
The book provides both of these insights in a relevant and complete manner, by and for those who build these data policies and who are both the primary actors and users: disciplinary or thematic structuring projects, current and future national sharing formulas and mechanisms are reviewed, as are the problems of data sharing, both for scientific uses and development. Recommendations for the future are formulated, and we can see that this book takes us to the heart of the new challenges of scientific work, where knowledge sharing and the enrichment of its methods are being renewed simultaneously.
The book’s emphasis on research data archiving should be well understood from the outset. The function of the repository is indeed an essential one in every respect in the process of constructing data sharing: there can be no visibility, comparison, positioning, curation, sampling, stock management and replication without repositories. It is well understood that the repository is the current canonical form, global or local, of any constructed data policy process. Because of its obvious functional importance, the data repository is also the very place for researchers to learn collectively in favor of sharing: indeed, what would the perspective be if each experience of sharing, each practice, were to be confined by the walls of an approach partitioned to each laboratory and each use? It is clear that the repository is the vehicle for a new scientific practice, hence the utmost importance given to the practices of registering and managing repositories, as demonstrated by the recent publication on this subject (Downs 2021). We will return to this later in this introduction.
A few remarks on the context of the publication of this work and a look at its scope will help to illustrate this point.
The context in which this book was published is that of a “French-style” data policy that is rich in a public science dynamic, itself driven by the spirit of sharing and opening up scientific data, addressed to all of the beneficiaries and users of science.
In this sense, many steps have already been taken or are in the process of being taken, with the construction of analyses and mechanisms on the possibility of sharing, the need to do so and the need to share and exchange standards. This reflects major legislative changes, such as the law of July 17, 1978, which deems the output of public authorities and that under their control as “public data”, along with data which are likely to be communicated on request, with certain limited exceptions. The principle of accessibility, the primary characteristic of Open Data, makes communication obligatory and bases open science on access to the common good which is, from the outset, a right to access the data.
France remains in this matter marked by the simplicity and universality of its data access regime, based, it may be recalled, on Article 15 of the Declaration of the Rights of Man and of the Citizen (right to communicate public data), and on the 2003 European Parliament Directive on the reuse of public sector information2. The French national experience has recently been enriched by a comprehensive legislative framework for structuring Open Data: The Valter Act of December 28, 2015 on free access and reuse of public sector data, and subsequently, Articles 30 and 38 of the Act for a Digital Republic of October 2016, to name but a few of the many recent European and national texts on this topic.
A stimulating conception of science in the digital age has thus been set in motion, sometimes starting from afar, with pioneers and precursors sometimes 20 years ahead of their time, and all driven by the same conviction: data are singular objects. Data, in fact, are both singular and multiple in that they contain an observation, recorded in a context and may simultaneously belong to larger or simply parallel sets, where they take on another meaning and participate in another observable reality. The data are also singular in that they can partly fall within the scope of Open Data practices and partly, in a contractual sense, fall within the scope of systems of valorization by the company or by the non-commercial associative fabric. This plasticity of data and their use shapes the ways in which they are made available.
Of course, all of these scientific sharing operations have their own meaning, which is that of the protocol to which this data sharing is linked, but this sharing is also an opportunity to expose, verify and reproduce operations, which corresponds to the growing need for reproducibility and replicability (Fineberg 2020).
We are beginning to reflect on the idea of a national networked service or services for data (Catherine 2020), and this book makes a rich and positive contribution to it.
Storing is the easiest operation in the world in physical terms, but scientifically it is one of the most complex. This is true for any deposit of information of any kind, but it is even more difficult to devise for data: it is a demanding and vigilant attitude that is required here, the rules of which are more and more clearly shared and accepted (Jeffery et al. 2021).
What differentiates a data repository from complete chaos (well-known as a “Spanish inn” in French – sorry to our Spanish neighbors!) is the presence or absence of services and in-depth support for the researcher’s needs. Here too, many reflections are progressing with a level of requirement that guarantees clear progress toward the maturity of projects (Suhr et al. 2020), which must all count with the uneven progress of open access (Hahnel et al. 2020) and training in digital tools (Klebel et al. 2020).
In conclusion, the panoramic virtue of the book promises to examine progression toward maturity, in a field where all science is solicited and questioned by multiple entries today. Of course, as Schöpfel et al. remarked in a very valuable previous publication, what answer can be given to the question: “how can we think of these ‘data documents’ which are part of the science in the making and which therefore quite naturally ‘blur the boundaries’” (Schöpfel et al. 2020)?
However, let us be reassured and find excellent reasons to hope in the current dynamics: this blurring is only the temporary effect of a recomposition of the scientific information landscape, in which, across the board, data repositories occupy a strategic founding place, at the forefront of the organized sources of a new way of doing science. In our recent joint article devoted to the scientific information platforms in the making at the European and international levels, we emphasize the various forms that the requirement for data traceability and the structuring of their flow takes (Fabre et al. 2021). It is true that it is time, as observed by the European Commission3, to deepen the approaches to open science that bring all of the actors and users of scientific information to the same table, in an approach that reconciles all of the uses and thus brings to life all of the aspects of open science in a deployment of data science that remains a critical phase (Davenport and Malone 2021).
This phase is all the more critical as the development of “multi-user” scientific projects is now extremely vigorous, as is that of collective learning projects (He et al. 2020), which are being used more and more widely in a growing number of disciplines. This development is also in line with that of sharing and storage tools such as knowledge graphs, which are developing extremely rapidly in the scientific field, and are oriented above all toward the shared use of heterogeneous data, thus combining various sources of documents and data.
January 2022
Borgman, C.L. (2020). Qu’est-ce que le travail scientifique des données ? Big Data, Little Data, No Data. OpenEdition Press, Marseille.
Catherine, H. (2020). Etude comparative des services nationaux de données de recherche : facteurs de réussite. MESRI comité pour la science ouverte [Online]. Available at: https://www.ouvrirlascience.fr/etude-comparative-des-services-nationaux-de-donnees-derecherche-facteurs-de-reussite/.
Davenport, T. and Malone, K. (2021). Deployment as a critical business data science discipline. Harvard Data Science Review [Online]. Available at: https://doi.org/10.1162/99608f92.90814c32.
Downs, R.R. (2021). Improving opportunities for new value of open data: Assessing and certifying research data repositories. Data Science Journal, 20(1), 1.
Fabre, R., Egret, D., Schöpfel, J., Azeroual, O. (2021). Evaluating scientific impact of research infrastructures: The role of current research information systems. Quantitative Science Studies, 1–25 [Online]. Available at: https://doi.org/10.1162/qss_a_00111.
Fineberg, H., Stodden, V., Meng, X.-L. (2020). Highlights of the US National Academies Report on “Reproducibility and Replicability in Science”. Harvard Data Science Review, 2(4) [Online]. Available at: https://doi.org/10.1162/99608f92.cb310198.
Hahnel, M., McIntosh, L.D., Hyndman, A., Baynes, G., Crosas, M., Nosek, B., Shearer, K., van Selm, M., Goodey, G. (2020). The State of Open Data 2020. Digital Science Report [Online]. Available at: https://doi.org/10.6084/m9.figshare.13227875.v2.
He, C., Li, S., So, J., Zeng, X., Zhang, M., Wang, H., Wang, X., Vepakomma, P., Singh, A., Qiu, H. et al. (2020). FedML: A Research Library and Benchmark for Federated Machine Learning. Preprint. ArXiv [Online]. Available at: https://arxiv.org/abs/2007.13518.
Jeffery, K., Wittenburg, P., Lannom, L., Strawn, G., Biniossek, C., Betz, D., Blanchi, C. (2021). Not ready for convergence in data infrastructures. Data Intelligence, 3(1), 116–135.
Klebel, T., Reichmann, S., Polka, J., McDowell, G., Penfold, N., Hindle, S., Ross-Hellauer, T. (2020). Peer review and preprint policies are unclear at most major journals. PLoS ONE, 15(10), e0239518.
Schöpfel, J., Farace, D., Prost, H., Zane, A., Hjørland, B. (2020). Data documents. Encyclopedia of Knowledge Organization, 48(4), 307–328 [Online]. Available at: https://www.isko.org/cyclo/data_documents.
Suhr, B., Dungl, J., Stocker, A. (2020). Search, reuse and sharing of research data in materials science and engineering – A qualitative interview study. PLoS ONE, 15(9), e0239216.
1
See, for example, the adult learning course at Sorbonne University; available at:
https://www.data-strategie.sorbonne-universite.fr/
.
2
Available at: 2003/98/EC
https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32003L0098&from=en
.
3
Available at:
https://projectescape.eu/news/launch-initial-escape-esfri-science-analysis-platformdiscovery-data-staging
.
Violaine REBOUILLAT1 and Joachim SCHÖPFEL2
1Claude Bernard University Lyon 1, Villeurbanne, France
2University of Lille, Villeneuve d’Ascq, France
What is a research data repository? Where does this term come from? How can we describe this facility? This chapter attempts to answer all of these questions, even though the landscape of data repositories is very varied and heterogeneous. The issue of trust is at the heart of the development of research data repositories – trust in both the content and the quality of the facility. The issue of repository certification is closely related to the issue of trust.
The term repository comes from the field of computer science. “In computing, a repository is a centralized and organized store of data. It can be one or more databases where files are located for distribution over the network or a place directly accessible to users”1. There are repositories for source code, software, data... the list goes on.
In the early 2000s, the term repository found a specific use in the context of open access. It can be found in the acronyms ROAR (for Registry of Open Access Repository) and OpenDOAR (for Directory of Open Access Repositories), which are both directories of open access repositories that make scientific content (journal articles, conference papers, research data, etc.) available.
In the context of open access, repositories are defined as follows by JISC2: “A repository is a set of services that a research organization offers to the members of its community for the management and dissemination of digital materials created by its community members” (Hubbard 2016).
According to this definition, the primary function of repositories is therefore the management and communication of digital content. These are based on a number of basic services including the storage of content, the creation of associated metadata, the indexing, dissemination and preservation of this content under predefined conditions of access and use, and the ongoing maintenance of these services. The open access policy also directs the dissemination of content toward “open” dissemination, with a minimum number of barriers to access (whether financial or technical).
The term research data repository originated from the open science movement, which gradually extended to research data (with the Berlin Declaration3 in 2003 and the OECD Declaration4 in 2004). It has been used to designate infrastructures specifically dedicated to the storage and sharing of scientific data.
In open science policies, data have therefore been progressively dissociated from publications: we have begun to speak of specific “data repositories”, whereas initially we were talking about repositories common to both data and publications (which have notably become known as open archives). In 2007, the European Research Council (ERC) recommended the storage of publications and data in an undifferentiated manner in “open access repositories”5, while 10 years later, in 2017, the European Commission distinguished between research data repositories (for data) and repositories for scientific publications (for publications)6.
During this period, dedicated repositories were developed at the request of research funders. Among the best known are:
–
4TU.ResearchData
7
, a multidisciplinary repository created in 2008 on the initiative of three Dutch universities;
–
Dryad
8
, in the field of biology, launched in 2008 and maintained by a non-profit organization made up of publishers, scientific societies, research institutions, libraries and research funding agencies;
–
Figshare
9
in 2011, a generic repository hosting all types of research products (preprints, posters, research data, etc.) and funded by the company Digital Science;
–
Zenodo
10
in 2013, also a generic repository, hosted by CERN and developed within the framework of the European project OpenAIREplus.
The re3data11 directory currently (as of February 2021) lists over 2,600 data repositories.
While data repositories are an integral part of open science policies, they are not a new product. Long before this openness movement, several research communities had set up more or less extensive data structuring and communication infrastructures, for example, in astronomy (Borgman et al. 2016), crystallography (Bruno et al. 2017) and genomics (International Human Genome Sequencing Consortium 2001). A common feature of these disciplines is the use of large equipment for data generation. The cost of this equipment has led research communities to collaborate around data collection and thus to set up a system of data standardization and dissemination (Rebouillat 2019). The ability of these large facilities to generate large volumes of data has also been a driving force in the decision to pool data through international research networks responsible for their analysis (André 2015).
The diversity of these systems, both prior to and in accordance with contemporary open research data policies, explains why a very large number of repositories are listed as such in re3data.
In a CNRS working paper, they have been related to the category of online platforms (DIST-CNRS 2017). Defined in the Act for a Digital Republic (Article 49)12, this concept refers to
an online public communication service based on:
1) the classification or referencing, by means of computer algorithms, of content, goods or services offered or put online by third parties;
2) or the bringing together of several parties with a view to the sale of a good, the provision of a service or the exchange or sharing of content, goods or services.
The platform concept thus encompasses very heterogeneous services, ranging from GAFAM to scientific journal platforms such as JSTOR, Érudit and SciELO and documentary platforms such as those of the CNRS.
The CNRS therefore proposes to define a sub-category of “science platforms”, which would include research data repositories. The term “science platform” would designate informational devices “with multiple functionalities [including a] repository of scientific data and work, access to scientific documentation and publications, and value-added information processing services [...]” (DIST-CNRS 2017).
Bringing together a “complicated mix of software, hardware, operations, and networks” (Kenney and Zysman 2016), including, in particular, discovery, repository and access rights management tools, the concept of a platform is nevertheless still very broad.
In this chapter, we propose a reflection on the concept of the research data repository. How are data repositories characterized? Are there any criteria for defining them?
We will deploy the concept of a facility, which will allow us to study repositories from both a technical and social perspective. We will first present the scope and functionalities of repositories, before focusing on the user dimension, which we will link to the notion of trust and related criteria. We will conclude by proposing some ideas for a definition.
Data repositories contribute to the mechanism of research data publishing. Despite all the connotations that the notion of “publication” may have, especially in the field of scientific communication, where it is often associated with peer-reviewing, etc., publishing here means “making public”.
Austin et al. (2017) propose the following definition:
Research data publishing is the release of research data, associated metadata, accompanying documentation, and software code (in cases where the raw data have been processed or manipulated) for re-use and analysis in such a manner that they can be discovered on the Web and referred to in a unique and persistent way. Data publishing occurs via dedicated data repositories and/or (data) journals which ensure that the published research objects are well documented, curated, archived for the long term, interoperable, citable, quality assured and discoverable – all aspects of data publishing that are important for future reuse of data by third party end-users.
This definition outlines four functions that appear to be essential to data publishing devices: describing, archiving, identifying and making available (Figure 1.1).
We find these functions in the definition given by Cocaud and Aventurier (2017): a data repository is an “online service allowing the collection, description, conservation, search and dissemination of datasets”. Cocaud and Aventurier cite as examples the repositories Pangaea13, Dryad, Zenodo, Archeology Data Service14, the Strasbourg Astronomical Data Center15or the Ifremer marine data portal16.
Figure 1.1.Four key functions of data repositories
While this definition may seem exact and straightforward, it nevertheless covers a complex reality. There are many definitions of the repository concept and they do not all focus on the same functionalities. As an example, we can compare the definitions proposed by the Network of the National Library of Medicine (NNLM, NIH, USA) and by the re3data directory. One might notice that these definitions are significantly different. The re3data repository describes research data repositories as “a subtype of a sustainable information infrastructure which provides long-term storage and access to research data” (Rücknagel et al. 2015). This definition presents repositories as infrastructures that enable data discovery and use, while the NNLM, on the other hand, proposes a definition that places more emphasis on data repository functionalities: “A data repository can be defined as a place that holds data, makes data available for use, and organizes data in a logical manner. A data repository may also be defined as an appropriate, subject-specific location where researchers can submit their data”17.
While it is difficult to describe what a data repository is, it is possible to say what it is not. In particular, the term repository is distinct from archive. Even if data repositories borrow methods from archival science in terms of data preservation and documentation, the main issue is the reuse of this data in the short, medium or long term. Repository data are therefore preserved more for their scientific value than for their heritage value, unlike archival data.
This chapter proposes that we consider data repositories as digital devices, that is, as tools (or services of tools) for mediating scientific information between producers and users (Prost and Schöpfel 2019). The originality of this approach is that it allows us to study repositories not only in their technical but also social dimensions (Larroche 2019).
Considering data repositories as digital devices means, first of all, getting the measure of the variables that comprise them. What makes a repository unique depends on:
– hosted content (i.e. the nature of the data made available);
– the scope (there are institutional, national, disciplinary and publisher-specific repositories);
– the proposed features.
We will come back to these three parameters in more detail in the following subsections.
One of the peculiarities that repositories have to deal with is the almost infinite typology of entities that can be considered as research data (Pampel et al. 2013).
The notion of research data is a complex one to define, as it encompasses very different realities. Borgman (2015) considers that “an observation, an object, a document or any other entity becomes research data once it is used as evidence of a phenomenon, i.e., collected, analyzed and interpreted”.
Several works have highlighted the contextual nature of research data. According to Leonelli (2015), there is no such thing as data per se. What a scientist considers “data” is always relative to a specific research question. Data are not defined according to their intrinsic properties but according to their function within particular research processes. The question “what is data?” can only be answered in reference to concrete research situations.
This is why data cannot be considered as a fixed entity (it can be, but only if observed at a given moment in a particular context). Data are malleable objects that adapt to research trends, hence the notion of the “lifecycle” of data (Higgins 2012). Schöpfel et al. (2017) have isolated four parameters of variation of scientific data: its factual nature, its recording, the community that generates and/or uses it and its purpose.
These four dimensions therefore have an impact on the way a repository is designed. We find repositories specialized in the dissemination of data generated by large research infrastructures, such as the ILL Data Portal18, which hosts data from the spectrometers of the Institut Laue-Langevin. On the other hand, other repositories such as Figshare (see Chapter 10) have chosen to host so-called orphan data, for which there is no dedicated publication system. There are also specificities among repositories between simple data and complex data, big data and long tail data. The malleable nature of data nevertheless makes any attempt at typology imperfect (Rebouillat 2019) and, in fact, multiplies the number of possible specializations of repositories.
In addition to this data complexity, there is the question of the scope of the repository. The design of a data repository can be done at different levels. A research institution may decide to develop a repository to house the data produced by its researchers (INRAE, for example, with its Data INRAE19 repository, see Chapter 4). A publisher may propose a repository as part of its policy of making the data underlying the journal articles it publishes available (Mendeley Data20 from Elsevier, for example, see Chapter 9). A research consortium may choose to create a repository for the data of a particular discipline or type (this is the case of Pangaea).
The re3data directory has chosen to classify the different forms of existing repositories according to the following typology (Rücknagel et al. 2015):
– Government repositories: these are data collections maintained and managed by government institutions, whose repository arrangements generally exclude external contributions.
– Institutional repositories: these are repositories linked to a particular institution, usually covering several research disciplines. These institutions may have obligations regarding the storage and dissemination of data.
– Disciplinary repositories: they gather research results related to a particular discipline. Often, they cover a general discipline, with contributors from different institutions. They are likely to be funded by one or more entities within the thematic community.
– Multidisciplinary repositories: these are repositories that cover several research disciplines and meet multidisciplinary needs.
– Project-based repositories: they focus on data resulting from specific research projects.
– Other types of repositories (e.g. a specific repository for a research funder).
Figure 1.2.Typology of data repositories listed in (source: re3data.org, survey of May 4, 2021, licensed under a Creative Commons Attribution 4.0 International License)
At present, the majority of repositories listed in re3data are disciplinary repositories (Figure 1.2)21. As of May 4, 2021, there were 2,089 out of a total of 2,677 (78%), although each repository can fall into several categories. This is one of the complexities of the data repository landscape. The scope of a repository can be measured at different levels: that of the institution(s) that provide(s) its governance; that of the institution(s) that fund(s) it; or that of the discipline(s) it covers. One example is the Gene22 repository, which specializes in genomic data. This repository is maintained by the National Center for Biotechnology Information (NCBI) and is funded by the US federal government.
Data repositories are also characterized by a set of basic functionalities, which revolve around the four verbs mentioned above: describing, archiving, identifying and making available. Here, we borrow up the analytical framework of Assante et al. (2016), who study repositories from the following eight axes: formatting; documenting; licensing; publication costs; validation; availability; discoverability and access and citation.
Formatting data is, in the words of Assante et al. (2016), organizing a dataset according to a certain format with the aim of ensuring its reusability. This involves two types of formatting:
– the formatting of the content, which guides the understanding and interpretation of the data (see section 1.4.5);
– the formatting of the container, which conditions the reading of the data with a software.
Formatting the data container is like choosing a file format. A repository can choose to accept all file formats or choose to accept only a few. For example, the 4TU.ResearchData repository’s Preferred formats guide provides a list of file formats considered optimal for the long-term preservation of data (Table 1.1)23.
Some repositories, especially disciplinary ones, host file formats specific to the discipline they cover. Marcial and Hemminger cited the following examples in 2010: FITS for astronomical data; GO/FASTA/Contig annotations for bioinformatics data; statistical formats (SAS, SPSS, R, Stata) for social science data; and GIS formats for earth science data and some biological science data. Other repositories, such as generic repositories, instead give precedence to non-proprietary formats (such as html tables or comma-delimited csv files).
Table 1.1.Sustainable file formats recommended by the 4TU.ResearchData repository
Text file
Plain text, XML, HTML, PDF (PDF/A-1), JSON, PDB (Protein Data Bank), XYZ
(All formats must be encoded in UTF-8)
Spreadsheet
CSV (Comma-separated values), Tab-delimited values
Image
JPEG, TIFF, PNG, SVG
Geospatial file
GML (Geographical Mark-up Language), KML (Keyhole Mark-up Language), ESRI Shapefile, GeoTIFF
Digital file
NetCDF, CSV, JSON
Video file
No sustainable format established
Audio file
WAVE (Waveform Audio File Format)
Database
Delimited Flat File w/DDL
Archives
ZIP, TAR, GZIP, 7Z
One of the added values of data repositories is the documentation of data, that is, the addition of information to facilitate its discovery, understanding and reuse, both by humans and by machines. This information is also called metadata. It describes both the data and the context in which it was produced, that is, when, where, how and by whom it was collected.
For repositories, the issue lies in the metadata model to be used. To be as close as possible to the data, a customized model may be a wise choice. Conversely, for interoperability with other information systems, the use of a metadata standard such as Dublin Core or DataCite may be more legitimate.
In a 2015 study of 32 Canadian and international repositories, Austin et al. (2015) found that 69% of the platforms surveyed (22 out of 32) used an internally designed metadata schema to describe hosted data, while 38% (12 out of 32) used a standard metadata schema or at least mapped their metadata schema to a standard schema.
Many standards coexist. The Research Data Alliance has attempted to bring them together in a directory, the Metadata Directory24, to make them more visible and easier to use (Ball et al. 2014).
The choice of metadata model depends in part on how the data will be used by its secondary users. The way in which the data is described sets the boundaries for how it can be used. However, it is questionable to what extent it is really possible to anticipate all reuses of a dataset.
Two methods are being tested to try to improve a potential reuser’s understanding of the data:
– the publication of
data papers
, that is, “authored, peer reviewed and citable articles in academic or scholarly journals, whose main content is a description of published research datasets, along with contextual information about the production and the acquisition of the datasets, with the purpose of facilitating the findability, availability and reuse of research data” (Schöpfel
et al
. 2019);
– the addition of links to other online resources in the metadata, such as the
data paper
of the dataset, the associated publication(s) and the bibliography of its author(s), etc.
From a legal perspective, repositories have a responsibility to inform users of the terms of use that apply to the data so as to enable appropriate and informed reuse.
In the context of open data, the trend is to assign free licenses to datasets such as Creative Commons25 and Open Data Commons licenses26. However, studies by Kindling et al. (2017) and Austin et al. (2015) show that a majority of repositories also compose licenses tailored to their specificities, not meeting any standard.
Some repositories impose a single license, like Figshare, which uses the CC0 license. Other repositories allow depositors to choose between a set of several licenses (this is the case of Gene, Ortolang27, HEPData28, etc.).
The re3data directory provides an online overview of the licenses used by the repositories it lists in the form of an updated histogram (Figure 1.3)29.
Figure 1.3.Licenses of use offered by the repositories listed in (source: re3data.org, survey of May 3, 2021, licensed under a Creative Commons Attribution 4.0 International License)
Maintaining a repository has a cost. This includes the preparation of datasets, their storage and their online availability. These investments are sometimes passed on to depositors who are asked to make a financial contribution to publish a dataset. Nevertheless, this business model turns out to be relatively rare, as shown by the study of Kindling et al. (2017), based on the re3data directory: out of a total of 1,381 data repositories, only 0.7% require the payment of submission fees.
This finding probably stems from the openness of scientific results. To achieve an increase in the amount of published data, open science policies need to rely on easily accessible infrastructures with low or no publication costs, so as not to discourage researchers from publishing their data (Roche et al. 2013). Another reason for the absence of publication costs for many data repositories is that they are attached to a public research organization, especially those located in Europe. Their business model is often based on public grants.
Data validation is the process used to assess the relevance of published data. It is a process that is still far from being fully characterized. To date, there are no common criteria on how to conduct this review, unlike in scientific publications, for example, with peer reviewing.
The underlying question is that of data quality: what determines the quality of data? Should it be evaluated on a scientific and/or technical level? Since data deposited in repositories can potentially be reused for various purposes, it is still difficult to define scientific conformity criteria. The validation of data by repositories is therefore currently limited to checking the consistency of metadata and file content. This is what the Dryad repository does, for example. Its data curation staff performs a series of checks, which are technical (they check that the files can be opened, that they are not corrupted and do not contain viruses, etc.) and administrative (in particular, they ensure that the metadata is technically correct). However, they do not check the data from a scientific point of view.
Publishing data means that data repositories must ensure that the data are preserved in the short, medium and long term. The objective is to ensure that the datasets are available to users. To achieve this, two mechanisms are used: a mechanism for immediate availability of the data sets and a mechanism for future availability, which guarantees the availability of the data over time. From a technical point of view, the secure archiving of the data involves storing multiple copies of the data files, either on the repository’s own servers (this is the case for Zenodo, whose data are stored at CERN) or with third-party service providers (e.g. Figshare uses the University of California’s Chronopolis service30). Making data available over the long term also sometimes requires repositories to migrate file formats as technology evolves.
In a repository, data discovery and access is the function that allows users to become aware of the existence of a dataset and to access it. This service includes user-oriented functions such as navigation and the ability to search by keywords or filters. In addition, repositories can provide programmatic access to their content through data and metadata exchange interfaces (APIs). These APIs are either based on proprietary developments or on standard protocols such as File Transfer Protocol (FTP) or OAI-PMH31.
Finally, one of the key issues in data sharing is citation, which allows for both tracking the reuse of datasets and crediting their producer with some form of recognition. This aspect has been developed in particular by Force11 (Data Citation Synthesis Group 2014) and the Research Data Alliance (Rauber et al. 2015).
For repositories, the citation functionality consists of adding a reference to a dataset in order to allow the attribution, discovery and interconnection of these data, as well as access to them. This translates into the attribution of an identifier, whether it is a code specific to the repositories or a persistent identifier such as URN, DOI, Handle, etc.
In most repositories, the citation itself is in the form of a bibliographic reference (including the name of the author(s), the year of publication, the title of the dataset, its identifier, etc.) to be copied/pasted or exported in generic formats such as RIS, BibTex, DataCite or Dublin Core.
In addition to the eight primary functionalities described above, other value-added services such as data analysis and visualization tools can be added. Repositories are therefore very rich in terms of functionalities, but also complex, as they must deal with the diversity of disciplinary practices and institutional mechanisms.
The actual use of a new device or technology is affected by several factors, including perceived usefulness, ease of use, quality of services and results, and the brand image of the device32. For an information system to be accepted, it must be credible in the eyes of users, with reliable performance in terms of functionality and service delivery. Data quality plays a particular role (Azeroual et al. 2020).
In relation to data repositories, it is important to distinguish between two types of use, even though users will often come from the same communities or organizations: the use of repositories to store, preserve and share research data (repository), and the use of repositories to verify published results, merge datasets, perform new analyses, etc. (reuse). In the first case, the focus will be on the facility, the reliability and security of the system, the promise of service (e.g. long-term preservation), the ease of deposit, the amount of storage used by other researchers, etc. In the second case, credibility is also and above all affected by the content and the resources stored, and it will be a question of quality variables (the quality of the data and the quality and richness of the metadata), the right to reuse (licenses) and interoperability with other operating systems.
This link between users’ trust in curation devices and trust in the digital content of these devices has been modeled for digital archives in general (Donaldson 2019); it has been formalized as an ISO standard33. Empirical studies, such as those by Yakel et al. (2013) or Yoon (2014), have led to a better understanding of some key factors of trust or distrust in data repositories. Among these factors, three seem particularly important:
– transparency of the system;
– the guarantee (promise) of long-term preservation (sustainability);
– the reputation of the institution that manages and/or hosts the device.
In addition to these factors, there are other criteria, such as the perception and experience of the functionalities and services and the quality of the data and the measures implemented to control, guarantee and improve this quality, in terms of data sources and selection upstream and “cleansing” downstream.
