102,99 €
Provides everything readers need to know for applying the power of informatics to materials science
There is a tremendous interest in materials informatics and application of data mining to materials science. This book is a one-stop guide to the latest advances in these emerging fields. Bridging the gap between materials science and informatics, it introduces readers to up-to-date data mining and machine learning methods. It also provides an overview of state-of-the-art software and tools. Case studies illustrate the power of materials informatics in guiding the experimental discovery of new materials.
Materials Informatics: Methods, Tools and Applications is presented in two parts?Methodological Aspects of Materials Informatics and Practical Aspects and Applications. The first part focuses on developments in software, databases, and high-throughput computational activities. Chapter topics include open quantum materials databases; the ICSD database; open crystallography databases; and more. The second addresses the latest developments in data mining and machine learning for materials science. Its chapters cover genetic algorithms and crystal structure prediction; MQSPR modeling in materials informatics; prediction of materials properties; amongst others.
-Bridges the gap between materials science and informatics
-Covers all the known methodologies and applications of materials informatics
-Presents case studies that illustrate the power of materials informatics in guiding the experimental quest for new materials
-Examines the state-of-the-art software and tools being used today
Materials Informatics: Methods, Tools and Applications is a must-have resource for materials scientists, chemists, and engineers interested in the methods of materials informatics.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 538
Veröffentlichungsjahr: 2019
Cover
1 Crystallography Open Database: History, Development, and Perspectives
1.1 Introduction
1.2 Open Databases for Science
1.3 Building COD
1.4 Use of COD
1.5 Applications
1.6 Perspectives
Acknowledgments
References
2 The Inorganic Crystal Structure Database (ICSD): A Tool for Materials Sciences
2.1 Introduction
2.2 Content of ICSD
2.3 Interfaces
2.4 Applications of ICSD
2.5 Outlook
References
3 Pauling File: Toward a Holistic View
3.1 Introduction
3.2 PAULING FILE: Crystal Structures
3.3 PAULING FILE: Phase Diagrams
3.4 PAULING FILE: Physical Properties
3.5 Data Quality
3.6 Distinct Phases
3.7 Toward a Megadatabase
3.8 Applications
3.9 Lessons to Learn from Experience
3.10 Conclusion
References
4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials
4.1 Introduction
4.2 Topological Tools for Developing Knowledge Databases
4.3 Applications of Topological Tools in Crystal Chemistry and Materials Science
4.4 Conclusions
References
5 A High‐Throughput Computational Study Driven by the AiiDA Materials Informatics Framework and the
PAULING FILE
as Reference Database
5.1 Introduction
5.2 Nature Defines Cornerstones Providing a Marvelously Rich but Still Very Rigid Systematic Framework of Restraint Conditions
5.3 The First, Second, and Third Paradigms
5.4 The Realization of the Fourth and Fifth Paradigms Requires Three Preconditions
5.5 The Core Idea of the Fifth Paradigm
5.6 Restraint Conditions Revealed by “Inorganic Solids Overview–Governing Factor Spaces (Maps)” Discovered by Data‐Mining Techniques
5.7 Quantum Simulation Strategy
5.8 Workflows Engine in AiiDA to Carry Out High‐Throughput Calculation for the Creation of the Materials Cloud, Binaries Edition
5.9 Conclusions
Acknowledgment
References
6 Modeling Materials Quantum Properties with Machine Learning
6.1 Introduction
6.2 Kernel Ridge Regression
6.3 Model Assessment
6.4 Representations
6.5 Recent Developments
References
7 Automated Computation of Materials Properties
7.1 Introduction
7.2 Automated Computational Materials Design Frameworks
7.3 Integrated Calculation of Materials Properties
7.4 Online Data Repositories
7.5 Materials Applications
7.6 Conclusion
Acknowledgments
References
8 Cognitive Chemistry: The Marriage of Machine Learning and Chemistry to Accelerate Materials Discovery
8.1 Introduction
8.2 Describing Molecules for Machine Learning Algorithms
8.3 Building Fast and Accurate Models with Machine Learning
8.4 Searching Through Chemical Libraries
8.5 Conclusion
References
9 Machine Learning Interatomic Potentials for Global Optimization and Molecular Dynamics Simulation
9.1 Introduction
9.2 Machine Learning Potential for Global Optimization
9.3 Interatomic Potential for Molecular Dynamics
9.4 Statistical Approach for Constructing ML Potentials
Acknowledgements
References
Index
End User License Agreement
Chapter 1
Table 1.1 Material property and structure databases available online.
Table 1.2 Error classes routinely addressed by the COD maintainers.
Chapter 3
Table 3.1 Example of data stored for a PAULING FILE crystal structure e...
Table 3.2 The 50 most frequently occurring atomic environment types (AE...
Table 3.3 Electronic and electrical and ferroelectric properties consid...
Table 3.4 Distinct
phases
in the Al–Ta system.
Table 3.5 Numbers of distinct chemical systems,
phases
, and entries in P...
Table 3.6 Number of entries in PCD‐2016/2017 according to the level of ...
Table 3.7 Number of distinct
phases
in PCD‐2016/2017 comparing the numbe...
Chapter 4
Table 4.1 Structural topological descriptors.
Table 4.2 Abundant topological types in standard representation and the...
Table 4.3 All possible correlations between the ligand coordination mod...
Table 4.4 Relationships CML–UNSR–CF for 4196 two‐ or three‐periodic cop...
Chapter 9
Table 9.1 Parameters for 3‐body terms used in all the examples in the current se...
Table 9.2 Results of our scheme for aluminum:
r
low
and RMSE
low
denote the Pears...
Table 9.3 Results of our scheme for carbon:
r
low
and RMSE
low
denote to the Pear...
Table 9.4 Experimental and calculated entropies of crystalline Al at different t...
Chapter 1
Figure 1.1 COD record number growth.
Figure 1.2 TCOD record number growth.
Figure 1.3 COD search Web interface form.
Figure 1.4 COD search result page, obtained as of 05 November 2016 from th...
Figure 1.5 Example of the COD programmatic search interface.
Figure 1.6 Retrieving a specific COD structure using the stable COD URI id...
Figure 1.7 Querying the COD MySQL database.
Figure 1.8 Finding column definitions of the COD “data” view.
Figure 1.9 Filtering out structures from the COD MySQL queries.
Figure 1.10 A COD CIF data retrieval after a MySQL query using COD URIs. T...
Figure 1.11 Preparing coordinates for an SQL query using a locally install...
Figure 1.12 Obtaining (checking‐out) a working copy of the COD data using ...
Figure 1.13 Cloning COD data directory with GIT and GIT SVN.
Figure 1.14 Using the “rsync” program to download and update the COD file ...
Figure 1.15 Checking out the COD MySQL dumps from the Subversion repositor...
Figure 1.16 Extracting the COD MySQL dumps from a ZIP archive of quarterly...
Figure 1.17 Search COD CIFs using “grep.” Options of this command are supp...
Figure 1.18 Use of “find,” “xargs” and “cifvalues” from the “cod‐tools” pa...
Chapter 2
Figure 2.1 Overview of ICSD content by composition of the compounds.
Figure 2.2 Visualization using JSMol in ICSD.
Figure 2.3 The synoptic view helps to reveal structural similarity.
Figure 2.4 Growth of the number of records in ICSD since 1980.
Figure 2.5 Basic search screen of ICSD Desktop/Web interface.
Figure 2.6 Simulated powder patterns of Al
4
H
2
(SO
4
)
7
(H
2
O)
24
(a) and Cr
4
H
2
(S...
Chapter 3
Figure 3.1 Data sets for RbO and CsS, as published and after standardizati...
Figure 3.2 The structure of WAl
5
, reported in space group
P
6
3
(173), can b...
Figure 3.3 Next‐neighbor histogram (NNH) (top left) and the corresp...
Figure 3.4 Examples of cell parameter plots from Pearson's Crystal Data [2...
Figure 3.5 Examples of phase diagrams as redrawn for the PAULING FILE: (a)...
Figure 3.6 Part of the data sheet of a physical properties entry for Li
2
O ...
Figure 3.7 Distribution of the database entries in the PAULING FILE (June ...
Figure 3.8 Distribution of the database entries in the PAULING FILE (June ...
Figure 3.9 Distribution of database entries in the PAULING FILE (June 2016...
Figure 3.10 Number of items in the physical properties part of the PAULING...
Figure 3.11 Example of the Constitution Browser in the PAULING FILE – Bina...
Figure 3.12 A generalized atomic environment type (AET) matrix PN
A
vs. PN
B
Figure 3.13 Number of
phases
vs. the number of atoms per unit cell, consid...
Figure 3.14 Number of
phases
according to the number of different AETs (ri...
Figure 3.15 Number of
phases
according to the crystal system and space gro...
Figure 3.16 Total number of point sets observed for the 18 most frequently...
Figure 3.17 Number of representatives (
phases
) of the 100 most common prot...
Figure 3.18 Outline of the VEMD project. (a) General overview, (b) databas...
Chapter 4
Figure 4.1 The crystal structure of one of the Samson's “monsters,” NaCd
2
....
Figure 4.2 Steps of the underlying net construction and classification in ...
Figure 4.3 Steps of the underlying net construction and classification in ...
Figure 4.4 Distribution of 722 012 metal coordination centers.
Figure 4.5 Distribution of coordination numbers of 73 530 copper atoms in ...
Figure 4.6 Distribution of chemical composition of the coordination polyhe...
Figure 4.7 Distribution of distances between copper atoms and nonmetal ato...
Figure 4.8 Distribution of the most abundant ligands in the copper coordin...
Figure 4.9 Distribution of ligands over most widespread coordination modes...
Figure 4.10 Distributions of copper‐containing coordination groups over th...
Figure 4.11 Distribution of 3D coordination networks over degree of interp...
Figure 4.12 Declarative pattern.
Figure 4.13 Advanced declarative pattern.
Figure 4.14 Contextual pattern: (a) conceptual diagram and (b) ER model.
Figure 4.15 Typed contextual pattern.
Figure 4.16 Advanced contextual pattern: conceptual diagram.
Figure 4.17 Advanced contextual pattern: (a) the objects part and (b) exam...
Figure 4.18 Advanced contextual pattern: (a) the attributes part and (b) e...
Figure 4.19 Advanced contextual pattern: (a) the object attributes part an...
Figure 4.20 Advanced contextual pattern: (a) the relationships part and (b...
Figure 4.21 Advanced contextual pattern: (a) the relationship attributes p...
Figure 4.22 Advanced contextual pattern: (a) the object attribute values p...
Chapter 5
Figure 5.1 Generalized atomic environment type (AET) (coordination polyhed...
Figure 5.2 Outline of the fourth paradigm showing in addition the three ma...
Figure 5.3 Outline of the fifth paradigm showing in addition the three maj...
Figure 5.4 Separation of 2330 binary systems into compound formers (blue) ...
Figure 5.5 Atomic environment type (AET) stability map showing the periodi...
Figure 5.6 Occurrences of daltonide inorganic solids vs. available concent...
Figure 5.7 An example of a portion of a provenance graph generated by AiiD...
Figure 5.8 A simple workflow based on workfunctions showing the definition...
Figure 5.9 A toy example of a workchain showing the workflow from Figure 5...
Figure 5.10 A schematic of the structure relaxation workflow used througho...
Chapter 6
Figure 6.1 Flowchart depicting the training and prediction of properties
Figure 6.2 Schematic representation of the differences between a tradition...
Figure 6.3 Schematic learning curves for training and test set, depicting ...
Figure 6.4 Schematic representation of the learning curve of a good and ba...
Figure 6.5 Two homometric molecules (a) and (b), possessing same internal ...
Chapter 7
Figure 7.1 Computational materials data generation workflow. (a) Crystallograp...
Figure 7.2 Convex hull phase diagrams for multicomponent alloys systems. (a) S...
Figure 7.3 Standardized paths in reciprocal space for calculation of the elect...
Figure 7.4 Challenges in autonomous symmetry analysis. (a) An illustration of ...
Figure 7.5 (a) AEL applies a set of independent normal and shear strains to th...
Figure 7.6 (a) Front page of the AFLOW online data repository, highlighting th...
Figure 7.7 (a) The AFLOW database is organized as a multilayered system. (b) E...
Chapter 8
Figure 8.1 An example of the process of MACCS encoding the molecule diazepam. ...
Figure 8.2 A description of the generation of circular (extended connectivity)...
Figure 8.3 A graphical representation of the flattening of 3D Cartesian space ...
Figure 8.4 A demonstration of the concept of the locality of neurons in a conv...
Figure 8.5 (a) Graphical depiction of the construction of a circular fingerpri...
Figure 8.6 An example of how sequential information is stored in a recurrent n...
Figure 8.7 An example of the generation of an ensemble of directed acyclic gra...
Figure 8.8 The deep fingerprints based upon the recurrent neural network metho...
Figure 8.9 The accuracy (area under the receiver operating characteristic curv...
Figure 8.10 The advantages of pursuing a Δ‐ML method (red) over the equivalent...
Figure 8.11 (a) While the properties calculated directly through quantum mecha...
Figure 8.12 A graphical representation of the Thompson sampling algorithm. 1 →...
Figure 8.13 Information landscapes for the photovoltaic data set (a, a maximiz...
Chapter 9
Figure 9.1 Energy landscape. (a) 1D scheme showing the full landscape (sol...
Figure 9.2 Energy vs. structural quasi‐entropy correlation. (a) MgO (32 at...
Figure 9.3 Reconstructed and real potentials in A
x
B
y
Lennard‐Jones sys...
Figure 9.4 Two quasi‐1D structures ((b) and (c)) that have the same 2‐body...
Figure 9.5 Reconstructed pair potential of Al–Al interaction (
β
= 1)....
Figure 9.6 Al–Al interaction potentials at different densities.
Figure 9.7 Deletion of features in Al.
Figure 9.8 RMSE(E) curves for our models of carbon with different β (a) an...
Figure 9.9 Comparison of performances of our best mixed model (a) with kno...
Figure 9.10 Deletion of features in C.
Figure 9.11 Reconstructed He–He (a) and Xe–Xe (b) pair potentials.
Figure 9.12 Deletion of features in He (a) and Xe (b).
Figure 9.13 Optimal choice of number of parameters pairs and training set ...
Figure 9.14 Comparison of different potentials for Al at 300 K (a) and 200...
Figure 9.15 Comparison of different potentials for α‐U at 1000 K (a) and l...
Figure 9.16 Phonon density of states at 300 K (a) and 775 K (b): 1 – exper...
Figure 9.17 Entropy as a function of temperature: 1 – thermodynamic data f...
Figure 9.18 Radial distribution function and melting temperature of Al. (a...
Figure 9.19 Determination of the optimal (
r
cut
,
p
) set. (a) The dependenc...
Figure 9.20 Learning curves for Ti
4
H
7
on (a) energies and (b) forces.
Figure 9.21 Comparison of
ab initio
(x) and model (y) projections of force...
Cover
Table of Contents
Begin Reading
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
289
290
291
292
293
294
290
Edited by
Olexandr IsayevAlexander TropshaStefano Curtarolo
Editors
Prof. Olexandr Isayev
University of North Carolina at Chapel Hill
UNC Eshelman School of Pharmacy
Campus Box 7568
Chapel Hill, NC
United States
Prof. Alexander Tropsha
University of North Carolina at Chapel Hill
UNC Eshelman School of Pharmacy
Campus Box 7568
Chapel Hill, NC
United States
Prof. Stefano Curtarolo
Duke University
Mechanical Engineering & Mat. Science
144 Hudson Hall
Durham, NC
United States
Cover Image: © Floriana/Getty Images
All books published by Wiley–VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.
Library of Congress Card No.:
applied for
British Library Cataloguing–in–Publication Data
A catalogue record for this book is available from the British Library.
Bibliographic information published by the Deutsche Nationalbibliothek
The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at <http://dnb.d–nb.de>.
© 2019 Wiley–VCH Verlag GmbH & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany
All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form - by photoprinting, microfilm, or any other means - nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.
Print ISBN: 978–3–527–34121–4
ePDF ISBN: 978–3–527–80225–8
ePub ISBN: 978–3–527–80227–2
oBook ISBN: 978–3–527–80226–5
Saulius Gražulis1, Andrius Merkys1, Antanas Vaitkus1, Daniel Chateigner2, Luca Lutterotti3, Peter Moeck4, Miguel Quiros5, Robert T. Downs6, Werner Kaminsky7, and Armel Le Bail8
1Vilnius University, Institute of Biotechnology, Department of Protein‐DNA Interactions, Saulėtekio al. 7, 10257 Vilnius, Lithuania
2Normandie Université, Université de Caen Normandie, CRISMAT‐CNRS, ENSICAEN, IUT‐Caen, boulevard du Maréchal Juin, 6, 14050, Caen Cedex, France
3University of Trento, Department of Industrial Engineering, Via Sommarive 9, 38123, Trento, Italy
4Portland State University, Department of Physics, 1719 SW 10th Avenue, Portland, OR, 97201, USA
5Universidad de Granada, Departamento de Química Inorgánica, Facultad de Ciencias, Avenida de Fuentenueva, 18071, Granada, Spain
6University of Arizona, Department of Geosciences, 1040 E 4 Street, Tucson, AZ, 85721, USA
7University of Washington at Seattle, Department of Chemistry, 4000 15th Avenue NE 36 Bagley Hall, Seattle, WA, 98195‐1700, USA
8Université du Maine, Institut des Molécules et des Matériaux du Mans, Département des Oxydes et Fluorures, CNRS UMR 6283, 72085 Le Mans, France
Science is crucially based on observational data. As an example of an ancient data‐driven discovery, the observation of equinox precession by Hipparchus around 130 BCE comes to mind [1] – Hipparchus compared the longitudes of Spica and Regulus and other bright stars with the measurements from his predecessors, Timocharis and Aristillus, who lived about 100 years earlier, and concluded from the differences that the equinox points drift with time. Needless to say, this discovery could only be made because old observations of Timocharis school were meticulously recorded, accurate enough, and preserved for future generations. Today, the amount of data that scientists collect each year has grown by roughly 10 orders of magnitude, with fields such as astronomy or particle physics currently accumulating from several terabytes (TB) [2] to as much as 15 petabytes (PB) of data per year [3,4].
In the field of crystallography, the need of long‐term data preservation was recognized very early in the field. Currently, the International Union of Crystallography (IUCr) and the crystallographic community take great care with respect to data archiving and data reuse. The IUCr has rigorously described mathematical definitions necessary for crystal structure and experiment description in the International Tables for Crystallography [5] and created the crystallographic information file/framework (CIF) standard for crystallographic data exchange [6,7], which is constantly maintained to address new challenges in data management [8]. Crystal diffraction data has been accumulated systematically in a number of databases since as early as 1941 [9], archived in various crystallographic databases (Table Table 1.1), the largest ones being the Crystallography Open Database (COD) [21], the Cambridge Structural Database (CSD) [24], the Inorganic Crystal Structure Database (ICSD) [25], the Pauling File [28], the Protein Data Bank (PDB) [30], the Powder Diffraction File (PDF) from International Centre for Diffraction Data (ICDD) [9], and the CRYSTMET [26]. Several other databases that focus on specific aspects of crystallographic data exist; the structures they mention are usually included in one or several above‐mentioned databases. References to these specialized databases will be given in the following text.
Table 1.1 Material property and structure databases available online.
No.
Database
Approx. no. of records
License
Current URL
Est.
References
1.
MPOD
300
Public domain
http://mpod.cimav.edu.mx
2010
[
10
]
2.
RRUFF
47 000
Open access
http://rruff.info/
2015
[
11
]
3.
AMCSD
20 000
Open access
http://rruff.geo.arizona.edu/AMS/amcsd.php
2003
([
12
]; [
13
])
4.
IZA Zeolite database
176
a
Open access
http://www.iza‐structure.org/databases/
1996
[
14
]
5.
Bilbao server
—
—
http://www.cryst.ehu.es
1997
[
15
]
6.
B‐IncStrDB (Bilbao Incommensurate Structures Database)
140
Open access
http://webbdcrista1.ehu.es/incstrdb/
2010
[
16
]
7.
MAGNDATA (Bilbao Magnetic Structure Database)
428
Open access
http://webbdcrista1.ehu.es/magndata/
2015
[
17
]
8.
NDB
8 600
Open access
http://ndbserver.rutgers.edu/
1992
([
18
]; [
19
])
9.
COD
400 000
Public domain
http://www.crystallography.net/cod
2003
([
20
]; [
21
])
10.
PCOD
1 000 000
Public domain
http://www.crystallography.net/pcod
2003
[
22
]
11.
TCOD
2 000
Public domain
http://www.crystallography.net/tcod
2013
[
23
]
12.
CSD
800 000
Subscription based
http://www.ccdc.cam.ac.uk/solutions/csd‐system/components/csd/
1965
[
24
]
13.
ICSD
200 000
Subscription based
https://icsd.fiz‐karlsruhe.de/
1987
[
25
]
14.
380 000
Subscription based
http://www.icdd.com/products/pdf4.htm
1941
[
9
]
15.
CRYSTMET
170 000
Subscription based
http://www.TothCanada.com
,
b
https://cds.dl.ac.uk/cgi‐bin/news/disp?crystmet
1996
[
26
]
16.
Linus Pauling file
290 000
Subscription based; free of charge queries accepted
c
http://paulingfile.com
,
http://crystdb.nims.go.jp/index_en.html
1995
([
27
]; [
28
])
17.
PDB
124 000
Open access
http://www.rcsb.org/pdb
1971
([
29
]; [
30
])
18.
BMCD
43 000
Open access
http://xpdb.nist.gov:8060/BMCD4
1995
[
31
]
aThe number of unique zeolite framework types that had been approved and assigned a 3‐letter code by the Structure Commission of the IZA.
bThe page at the http://www.TothCanada.com advertised in [26] seems no longer operational, but the access for subscribers is advertised at https://cds.dl.ac.uk/cgi‐bin/news/disp?crystmet.
cFree of charge queries are offered at http://crystdb.nims.go.jp/index_en.html, but “no reproduction, republication or distribution to third parties of any content is permitted without written permission of NIMS.”
Before 2003, of the above‐mentioned crystal structure archives, only the PDB offered full open access to the crystallographic data it contained; all other databases followed a subscription‐based model, offering little or no data on the Web for the general public or nonsubscribers, as well as requiring purchase of a license for systematic data searches and, occasionally, restricting publication of derived data [32,33]. The advent of the Web, ubiquitous computing, and advantages of open linked data prompted a group of crystallographers to initiate the COD, offering crystal structures for chemical crystallography on similar grounds as the PDB provides them for macromolecular crystallography. Currently, the COD and the PDB remain the two largest databases offering the open‐access model to crystallographic data and together covering the largest domain of crystal structures in an open way. While other databases contain larger collections of crystals structures and claim higher level of data curation than COD [34], they still require acquisition of licenses for systematic data searches.
In this chapter, we will review the COD contents, data collection, and data curation policies. We will then describe various ways how COD data can be accessed and used. Finally, we will give examples of COD applications in the fields of crystallography, chemistry, material identification, and teaching.
Over the years, various researchers found that open access to articles consistently increases citations of these publications [35–39]. Similar trends are observed for data in the field of bioinformatics [40], and one would expect crystallography to follow similar trends. Thus there is a pure pragmatic reason for researchers to deposit data openly so that they are findable, reusable, and citable. For the user of data, the absence of paywalls and use restrictions provides the convenience of one‐click access to data. Finally, there are ethical considerations – most published research were funded by public money, and the society members whose taxes were used to produce scientific results have reasonable expectations that these results would be available to them without demand of extra payment and without restrictions. Understandably, then, many funding agencies require that researchers whom they have supported publish their results under open‐access licenses for both publications and data.
To answer the above‐mentioned concerns, many open databases have been established by researchers. In the following, we describe topic‐specific databases, in addition to more general databases outlined previously.
A list of scientific databases in the field of biosciences can be found in the Nucleic Acids Research [41], and crystallographic databases are listed by the IUCr (http://www.iucr.org/resources/data).
The COD incorporates a continuously increasing number of determined crystal structures, reaching >367 000 entries at the time of writing this chapter (Figure 1.1). The equivalent of COD for structures obtained from first‐principle calculation and/or optimization is Theoretical Crystallography Open Database (TCOD), started in 2013, with consequently a more modest number of entries around 2000 (Figure 1.2). However such entries require long calculation times and one can expect larger increases in the years to come.
Figure 1.1 COD record number growth.
Figure 1.2 TCOD record number growth.
The COD was founded in February 2003 as a grassroot initiative – its establishment was proposed in a letter published at the Structure Determination by Powder Diffractometry (SDPD) mailing list by Michael Berndt:
What if crystallographers work together to establish a public domain database with all relevant crystallographic data? This would not only overcome the current situation with ‘fragmented’ databases, it would also prevent for becoming dependent from monopolists. What would be needed?
1. A small team of engaged scientists with some experience in database and software design to coordinate the project.
2. The authors (i.e. the scientific community = you) who provide the project with database entries (note, that if you haven't sold your experimental results exclusively, you are free to distribute the data to such a database, even if they have already been part of a publication ‐ and a lot of good data have never been published).
3. Free software a) for maintaining the database, b) for data evaluation and calculation of derived data (e.g. calculated powder pattern from crystal structures for search‐match purposes), c) for browsing and retrieval. We are not in the same situation as decades before when the well‐known databases (ICSD, CSD, PDF) started. Today we have the Internet, fast computers, and a big pool of free available software. The question is: Do we have enough scientists who are willing to cooperate?
Several laboratories contributed a lot to the COD at its very beginning. Bob Downs offered his collection of mineralogical data, including the whole American Mineralogist Crystal Structure Database (AMCSD) [13] data set (all the crystal structures previously published in the American Mineralogist that were made freely accessible from the websites of the Mineralogical Society of America). The necessary MySQL/PHP scripts were written by Hareesh Rajan. In the meantime, Daniel Chateigner joined, and less than three weeks after the letter from Michael Berndt, the COD project was announced at various Internet media (Newsgroups, various mailing lists, and What's New pages) by the following letter:
Dear Crystallographers, a project of Crystallography Open Database (COD), accommodating crystal structure atomic coordinates prior to their publication, is under development. It is intended to give faster access to the latest structure determinations, openly. Its development and success depends completely on your contributions, either by data download or/and by giving help in software improvements. Visit the COD project Web pages (www.crystallography.net) for more details and a crystallography database(s) quiz. Thanks for your future help, the COD is yours, it is the right time to do something for an open database controlled by crystallographers, now or never!
The advisory board (wishing to enlarge): Michael Berndt, Daniel Chateigner, Robert T. Downs, Lachlan M.D. Cranswick, Armel Le Bail, Luca Lutterotti, Hareesh Rajan
This letter produced a lot of positive and negative comments. Some researchers who responded positively joined the COD team and the number of entries in the COD increased, attaining more than 5000 entries by the end of March 2003 (3725 CIFs from the AMCSD, 450 CIFs from the Laboratoire des Oxydes et Fluorures, Université du Maine (LdOF), 850 CIFs from the CRISMAT). The CIF2COD computer program (FORTRAN) was built on the basis of CIF2SX with the permission from Louis Farrugia. CIF2COD reads several CIFs (from n.cif to n+m.cif), performs several quality tests, and produces a .txt file containing m+1 lines with the MySQL database (cod) unique table (data) fields (including a, b, c, alpha, beta, gamma, volume, number of elements, space group, chemical formula, reference, and additional text). The first minimal COD search page was coded in the PHP language. Donations continued in April 2003 (1200 CIFs from IPMC) and the IUCr was contacted, asking for permission to download systematically the CIFs freely available at the IUCr website. The decision had to wait for the next IUCr Executive Committee meeting in August 2003. After four months, the number of entries in the COD reached 12 000, essentially by donations, from individuals or laboratories and the AMCSD.
Then came the sad news. Michael Berndt died on 30 June 2003, after a long, serious illness at the age of 39. Lachlan Cranswick went missing on 18 January 2010 at the age of 41 and his body was found later in the water on the Ottawa River, near Deep River. Despite the losses, the COD team continued to implement his plan and to work on the database. Five years after its founding, the COD passed a major milestone in 2008, by archiving the 50 000th entry. To attain completion, the COD should add much more than 40 000 new entries per year and also digitize older data that were published in print form. The required growth rate of the COD was attained in 2011 (Figure 1.1), when automated procedures for crystallographic data collection were implemented. Nevertheless, a lot of work remains to be done and the COD welcomes contributions from all crystallographers in order to accelerate its completion. During the past 10 years, the COD Advisory Board underwent some variations, departures, and new admissions, and the list of coauthors of this chapter reflects the current situation, presenting the main actors of the COD development until now.
The COD collects all published crystal structures with small‐ to medium‐sized unit cells. To facilitate this process, the CIF framework is employed. Currently, COD uses the CIF 1.1 [7] version of the framework. The framework files (CIFs) are used to input data into the COD, as an intermediate versioned archive for storage, and for providing data to the users.
The main founding principle of the COD is open access – all data are readily available on the Internet. COD data records are identified by stable Uniform Resource Identifiers (URIs) and accessible via the REpresentational State Transfer (REST) interface. The COD main page on the Web (http://www.crystallography.net/) states, “All data on this site have been placed in the public domain by the contributors,” which we assume binding for COD Advisory Board, data maintainers, and contributors. All deposited data, unless embargoed by depositors for a fixed amount of time as a “prepublication deposition,” are immediately available after the deposition on the Internet and accessible via the automatically generated stable identifiers. Such arrangement enables immediate and permanent linking of COD structures into the World Wide Web fabric.
Each data item that is committed to the COD repository is first of all checked for the syntactic correctness of the incoming CIF. Since not all submitted files can be guaranteed to conform to the formal CIF definition [42], an error‐correcting CIF parser [43] is employed. This ensures that all COD CIFs can be automatically parsed and supports unassisted COD data processing.
COD aims at collecting all experimentally determined small‐molecule crystal structures into an open‐access resource. “Small‐molecule” category encompasses all inorganic, metal–organic, and organic compounds with an exception of macromolecules – organic polymers. The latter are being collected into dedicated well‐known open‐access databases such as the PDB [44] and the Nucleic Acid Database (NDB) [18,19].
As an experimental database, the COD collects structures determined by any experimental method. However, there are sister databases, the PCOD and the TCOD, which aim to collect predicted and theoretically determined structures, respectively (see Section 1.3.4 for a more comprehensive description).
COD structures may be refined using just X‐ray data and first physical principles (using full‐matrix least‐squares methods), but they may also be refined using restrains (especially when determined using powder diffraction methods) or, more recently, hybrid methods (from experimental powder data using Rietveld and Le Bail methods combined with first principles using density functional theory (DFT)).
The COD acquires most of its structures (over 90%) from peer‐reviewed scientific publications. The rest is deposited by authors either as personal communications or as prepublication depositions. Data published in papers are subjected to checks for conformance with CIF syntax, CIF dictionary definitions, and the completeness of bibliographic and other provenance information. Personal communications and prepublication depositions are in addition checked for conformance to the IUCr data criteria.1 The COD permits both manual deposition by crystallographers using a Web interface (http://www.crystallography.net/cod/deposit) and an automated deposition using various Web‐inspecting engines. Automated Web searches are conducted on journals that publish openly accessible crystallographic supplementary data. Data are also automatically extracted from open‐access publications. Data from other crawlers, such as CrystalEye [45], and other open databases (e.g. AMCSD [13]) are incorporated into the COD on a regular basis, either using automated or semi‐manual procedures. Such a strategy permits broad coverage of published structures with little resources required; it leverages the power of Internet automation while at the same time permitting humans to intervene at critical points when necessary.
It must be noted with regret that some journals still do not provide the supporting data for their papers openly. Data are either located behind the paywalls or available only in subscription‐based databases with explicit restrictions on their reuse. Unfortunately, this makes a technically simple task of collecting all currently published crystal structures into open databases virtually impossible, not for technical but for purely organizational reasons. The barriers are not even related to intellectual property, since published data and facts of nature are not copyrightable. We thus urge everyone who sees virtue in open scientific data exchange and has benefited from open‐access database to approach every publisher and ask them to provide underlying publication data for deposition to open‐access databases or to deposit her or his crystallographic data directly to the COD.
Scientific databases are an indispensable resource in the modern‐day research, and as such they must adhere to the criteria of all properly designed experiments – reproducibility and traceability. Obtained results are of little value if repetition of the same procedures under the same conditions yields a different outcome. The same holds true if the experiments are purely computational in origin such as simulations [46] or compilation of statistical data. In addition to that, any conclusions drawn from claims of untraceable origin become unverifiable and run the risk of polluting every sequential experiment they are used in. As a result, the employment of the Write Once Read Many (WORM) principle, which ensures that once data is written it is never changed irreversibly, becomes a necessity for scientific databases.
Collecting and preserving scientific data is an important endeavor, but maintaining it is a task of no less importance. Reasons behind the need to modify the data are numerous – from a simple human error to new insights about the data or even the introduction of a novel way of describing certain phenomenon. The means for updating scientific articles via the issuing of addenda and errata are well established; however, the same mechanism is usually not applied to the supplementary material. A more common approach is to silently replace the outdated version with a new one leaving the returning reader with a very unexpected sense of jamais vu. The situation is only worsened by the fact that supplementary material is rarely well‐reviewed before publishing, resulting in an even greater need for a proper data maintenance strategy.
Data discrepancies addressed by the COD maintainers can be grouped into three main classes: syntax errors, semantic errors, and errors relating to the crystal structure. Each of these classes requires different detection and correction strategies and affects the data usability in varying degrees (see Table 1.2).
Table 1.2 Error classes routinely addressed by the COD maintainers.
Error class
Ease of detection
Ease of correction
Effect on data usability
Syntax
Detected automatically by the parser
Mostly automatic
Unreadable file
Semantic
Detected mostly automatically by specialized software, requires occasional manual analysis
Automatic and manual
Incorrect supporting information
Crystal structure
Detected by specialized software and manual analysis
Mostly manual
Incorrect crystal structure
The initial step of data management in the COD is the detection and correction of syntactic errors. This kind of discrepancies is especially important since it renders the files unreadable and limits the possibility of any further data maintenance. Crystallographic structures in the COD are stored as CIFs, a format that has been adopted by the crystallographic community. However, even with the widespread use of the CIF format, none of the parsers available at the time were capable enough to satisfy the specific needs arising from the curation of large data sets. As a result, maintainers of the COD have developed an open‐source error‐fixing CIF parser, which is able to correct some of the most prominent syntax errors [43]. Initial file parsing upon deposition as well as the routinely database‐wide checks guarantee that at any given moment all files in the COD can be read correctly according to the CIF format rules.
Syntactical correctness ensures that the files are readable, but does not guarantee the validity of the data stored inside the files – this is the task for semantic validation. Due to a great variety of semantic errors and the fact that they usually only affect a portion of the data in the file, the COD has adapted a very flexible policy regarding discrepancies of this kind. During the initial deposition semantic errors are recognized, automatically corrected, and reported to the depositor and in case an automatic correction is not possible, these errors are recorded in an internal database for further analysis. Once a significant amount of similar errors accumulate, heuristics‐based programs are developed to automatically fix the errors in question. Since it is unreasonable to expect perfect detection of all possible semantic error cases in advance, the file validation strategy also addresses the handling of new kinds of semantic discrepancies that were previously missed during the initial deposition. In this case, heuristics‐based programs are developed for the detection of these new errors and the whole database is revalidated based on the new criterion. In the end, both the new error‐correction programs and the new error‐detection programs are eventually integrated into the deposition step. The described workflow ensures that the overall semantic validity of the COD data set will only increase.
The set of computer programs developed by the COD maintainers for the detection and correction of syntactic errors are collectively called cod‐tools. These tools are capable of recognizing most of the problems listed in the IUCr validation criteria (http://journals.iucr.org/services/cif/checking/autolist.html), such as misspellings of data item names or their enumerated values, as well as some other common issues identified by scanning the COD. Examples of such discrepancies include data items designated to specify temperature containing values in units other than Kelvin or data items used to describe the density of a crystal containing values in kg/m3 instead of g/cm3. Instances of errors like these might not seem significant when handling individual files, but they do complicate the workflow and skew the results of database‐wide analyses. Luckily enough, some of the errors can be automatically corrected by using heuristics (for example, unit designators after the temperature values); others, however, require manual curation.
One type of manually curated errors is the incorrect number of implicit hydrogen atoms. This number, provided using the “_atom_site_attached_hydrogens” data item, specifies the amount of hydrogen atoms attached to the atom site excluding the hydrogen atoms for which coordinates are given explicitly. Such discrepancies are easily spotted even by a novice chemist, but they are much harder to detect automatically. Incorrectly marked hydrogen atoms result in erroneous calculated atom charges, mismatch between the declared and the calculated formulas, and skewed distributions of geometric parameters.
Errors in the coordinates, cell constants, and symmetry are especially difficult to locate and correct. Nevertheless, the structures in the COD are routinely scanned for “bumps” (suspiciously small interatomic distances) and voids. Examination of “bumps” usually reveal modeling errors, unmarked disordered sites, or redundant atoms; several non‐P1 structures, which had all symmetric atoms listed, have been spotted and corrected while scanning the COD. Voids, on the other hand, are a sign of missing atoms or their groups, wrong cell constants, or incorrectly low symmetry. Currently, new means of detecting other geometric anomalies in deposited structures based on statistical distributions of geometric parameters are being developed. Such checks will make the identification of unfinished refinement, missing atoms, and typographical errors in coordinates and cell constants possible.
Not all structures, however, can be successfully corrected. To inform the user and enable the recognition of such entries in automated analyses, a warning or an error flag is added to the CIF manually. Currently there are around 20 such entries in the COD.
Another type of structures in the COD unfit for normal use are the retracted ones. Retraction rate, as reported by RetractionWatch, is around 500–600 retractions/year (http://retractionwatch.com/help‐us‐heres‐some‐of‐what‐were‐working‐on/) and the field of crystallography is not immune to incorrect conclusions and scientific fraud. Since, at least to the knowledge of the COD maintainers, there is no open database listing all retracted publications, the process of retraction in the COD is completely manual. Each entry coming from retracted publications is blanked and excluded from the search so as not to bias automated analyses. However, since the history of all structures is preserved in the COD, retracted structures can be accessed if necessary.
Alongside retractions, there are a few more types of entries that are not desired in the COD but often are identified as such only after the deposition. One of them is duplicates: in order to not overcrowd the COD with repeated entries and thus bias statistical results, deposited structures are compared with the rest of structures in the database during an attempt to locate duplicates. Currently, two structures are assumed to be duplicates if they originate from the same publication, have the same lattice cell constants and contents, are measured at the same temperature and pressure, and are not enantiomers of one another or deliberately suboptimal versions of some properly refined structure (the suboptimal structures are sometimes published to support the space group or refinement parameter choices). We must note, however, that not all duplicates are marked in the COD at the moment. Therefore, new methods to locate duplicates are devised and employed in the COD, almost always requiring supervision of a data curator. As entries are not removed from the COD, duplicates are marked with a special flag, indicating the original entry.
In 2013 results of theoretical calculations being deposited to the COD were spotted. This resulted in the policy of accepting only experimentally detected structures to be reiterated, and a sister database, the TCOD, was opened to house all kinds of theoretically defined structures. Since then more than 400 theoretical structures were identified and marked as such in the COD. Difficulty to identify theoretical structures from data given in CIFs hinders automatic detection of such depositions. However, properties like high numeric precisions of cell constants and coordinates, missing standard uncertainties, and experimental details may be used to guide this otherwise manual task. As with any other structure not fitting the scope or criteria of the COD, theoretical structures are also marked as such instead of being removed.
Scientific data, when used, must be properly cited and available for verification of the conclusion drawn from them. The availability must be ensured both during the research, for the benefit of the scientist conducting it, and at later stages, for peer review and for replications of conclusions reached. Curated databases, however, change over time, and databases like the COD that follow immediate release policy can change at any time and at high rate, comparable with the rate at which data are queried for computations. To make sure that computations done with the COD are repeatable, and inference drawn from them are reproducible, it is crucial that any previous state of the database can be restored. We implement this requirement by using version control on the COD data.
Currently, a Subversion server [47,48] is used to register versions of the COD data in CIFs. Subversion is a powerful, off‐the‐shelf open‐source software system that enables track of changes in a tree of files, assigns each state of the file tree an unchangeable sequential revision number, and allows restoring any previous revision from the repository. Although originally designed as a tool for software development, Subversion offers precisely those functions that are needed for a scientific database of the medium size, such as the COD. The text nature of the CIF format makes them particularly well suited for tracking with revision control systems.
Since the introduction of Subversion, all COD data curation history is available, and any state of the database can be restored. As an additional advantage, Subversion also records movement of files in the file tree and rename operations, thus providing full data provenance of each COD CIF in the version control system since its insertion into the repository. When a COD ID of a structure and a revision number of the structure is known, a unique string of bits (a digital object) describing that structure at a given revision can be retrieved.
The COD MySQL data tables are automatically produced from each current COD revision. These tables themselves are not currently versioned, i.e. currently MySQL tables contain only data from the most recent revision of the COD (although a nightly dump of the COD MySQL database is inserted into the COD Subversion repository). Such implementation was deemed satisfactory, since the primary COD data are CIFs, and MySQL tables for any revision can in principle be reconstructed from the CIFs of that particular revision.
As the database grows, however, and more queries are executed on the MySQL database, and not on the CIF tree, the need arises to quickly perform historic SQL queries, without reconstructing MySQL tables for each revision. This need is explicitly recommended in the Research Data Alliance (RDA) Recommendations for Data Citation [49,50]. Therefore, the COD will implement a possibility to query every revision of COD database online (historic states of MySQL tables will be restored from COD CIFs and marked with corresponding time stamps and revision numbers) and to cite COD queries in a durable and reproducible way, enabling to rerun each historic query, both on the original data and on newer database revisions.
Since COD record contents can change during data curation, a question arises what rules does the COD curation policy follow and what a researcher can rely upon. The current COD data policy is as follows. A COD entry record is essentially a claim made by a data depositor that the specified authors have published certain findings about the structure described in the COD entry. To this extent, the COD data curation team makes reasonable efforts to make each COD entry represent the publication authors' intent. To that end, data in COD entries can be enhanced during the data curation; additional data from the original publication may be added. Data values in CIFs may be corrected if a correct value is clearly specified by the authors in the original publication, and it is clear that the authors meant that value to be published (usually, such corrections also make good physical sense, making it obvious that the curated structure describes better the physical reality). In cases where the intent of the author is not so clear, or where essential data items such as coordinates of atoms or atomic symbols have to be changed, authors are first contacted to approve the changes. In all cases it must be clear that the original finding of the authors meant exactly the curated value and is not a new interpretation of the experiment.
Data curation never involves a new structure solution from the same data, re‐refinement, guessing values from common chemical knowledge or similar investigative steps. Such processes are possible, but in that case, a new COD ID must be assigned to the new structure solution and will be treated by the COD as a new publication.
The data curation process has data uniformity and accuracy of claims as its main aim. All COD structures must use the same conventions to describe analogous situations. In most cases, the IUCr CIF standard provides adequate means for uniform description, and we curate the data records to adhere to these standards. For example, atomic coordinates must be provided either as fractions of cell vectors along the crystal axes or as Cartesian coordinates in an orthogonal frame (in which case orthogonalization matrices relating the used Cartesian frame and the crystal axes must be given). Another instance is the melting point of a crystalline material that must be given in Kelvin. If an original publication contains these data items recorded in different ways (different coordinate systems, different units), COD data curators convert them to the common mandated format, leaving original values in specific COD data items for reference. Sometimes, however, there is no standard way to express certain circumstances; for example, sometimes authors are not sure what is the chemical nature of atom occupying certain site in a crystal unit cell, and they mark such sites using different codes (such as “I1,” “M2,” or so on). COD introduces a uniform notation, “X” for completely unknown atom at a site and “M” for an unknown metal. In that case the original authors' designators might be changed; the curated version (atom site “X”), however, expresses the authors' message “unknown atom” better than the original “I” designator, since the latter can be confused with iodine in the COD context.
The COD follows a continuous release policy – each commit to the COD database is immediately available on the Web and in the public Subversion repository. Each such commit introduces a new COD revision. The COD content is mostly updated on a daily basis, and several revisions can be generated each day. It is therefore important that COD users keep track of which revision they are using for their calculations and data searches. Since such tracking might introduce extra burden, we are providing, after a popular request, quarterly releases of COD data snapshots. Four times a year the latest COD revision is exported, both CIFs and MySQL table dumps, and packed in several most popular data formats. The revision and time stamp of the most recent release is available at http://www.crystallography.net/cod/archives/LAST_RELEASE.txt. Each current release is available for download in the COD archive area:
Current Release:
http://www.crystallography.net/cod/archives/cod‐cifs‐mysql.tgz
http://www.crystallography.net/cod/archives/cod‐cifs‐mysql.txz
http://www.crystallography.net/cod/archives/cod‐cifs‐mysql.zip
(The contents of all three files are identical, so only one is needed to obtain a release.)
Historic releases
: can be found in each year's “data” directory, following the URIs of the type http://www.crystallography.net/cod/archives/<year>/data/; for example, all four releases of 2015 are in
http://www.crystallography.net/cod/archives/2015/data/
.
While the use of COD releases is conceptually simple and does not require the use of version control software and revision tracking, it must be noted that the releases get outdated quickly. Also, downloading a new release repeatedly downloads all previous data anew, wasting bandwidth and time. Thus, frequent COD user's should consider incremental means of updating their COD collection, such as Subversion (“svn”) or Rsync.
The growing need for COD‐like databases for other than experimental structures has sparked the creation of two sister databases: the Predicted Crystallography Open Database (PCOD) for predicted structures and the Theoretical Crystal Structure Database (TCOD) for theoretically constructed structures. Predicted Crystallography Open Database (PCOD) (http://www.crystallography.net/pcod/) was launched in December 2003 with the goal of collecting computationally predicted structures. It was expected that the number of such entries could easily exceed the number of experimentally determined ones. In January 2004, the PCOD offered 200 entries. In February 2007, the number of entries were boosted to more than 60 000 by the deposition of crystal structure predictions using Geometrically Restrained INorganic Structure Prediction (GRINSP) software [22]. As the COD passed a major milestone by archiving the 50 000th entry in 2008, the PCOD climbed over the 100 000 structure limit in the same year. A year later PCOD reached one million entries, most of them being generated by Zeolite Framework Solution (ZEFSA II) [51]. As a fork of the COD, the PCOD has inherited most of its features, such as stable unique data identifiers, data versioning, and Web and MySQL interfaces for searching. An automatic deposition service remains to be implemented in the PCOD.
The Theoretical Crystallography Open Database (http://www.crystallography.net/tcod/) was launched in May 2013, thus addressing the need for an open repository of theoretically computed crystal structures. As methods of computational chemistry enjoy unprecedented growth and computer power increases, a large number of atomistic simulations can be carried out, producing theoretical material structures and calculating their properties using DFT, post‐HF, QM/MM, and other methods. By the end of that year, the TCOD offered around 200 entries. To ensure high quality of deposited data, development of ontologies in a format of CIF dictionaries was initiated. In addition to that, a COD‐like pipeline to check each deposited structure against a set of community‐specified criteria for convergence, computation quality, and reproducibility was developed and installed in the TCOD. As of the time of writing, the TCOD contains more than 2000 entries.
Open‐access Web resources pave the way for unprecedented applications that interconnect and reuse data hosted by many different organizations without the need of coordination between them. Key elements for such cooperation are the interfaces for data access. Commonly used architectural style for both human‐ and machine‐usable Web interfaces is REST, according to which RESTful interfaces are built [52], which use common HTTP requests to stable URLs for data retrieval.
Each entry in the COD consists of a CIF data block, listing the atomic positions of the crystal of interest, and an optional data block for diffraction data (Fobs, powder diffractograms). If an experiment results in more than one CIF data block (N data blocks), they are split across N COD entries.
To provide permanent descriptors, unique identifiers – integers from range 1 000 000 to 9 999 999 – are assigned for each deposited entry upon the deposition into the COD. The COD identifiers are promised to be permanent – both retracted and duplicate entries, which are detected after their deposition, are marked as such instead of removal.
COD identifiers are straightforwardly transformed into stable URIs by prefixing them with http://www.crystallography.net/cod/ and postfixing with file type (.html for general review of an entry, .cif for CIF with atomic positions, and .hkl for the diffraction data file). For example, files of entry 2002916 can be accessed via http://www.crystallography.net/cod/2002916.html, http://www.crystallography.net/cod/2002916.cif, and http://www.crystallography.net/cod/2002916.hkl.
Data can be searched on the Web using simple Web forms that use the COD MySQL database as a fast search index (Figure 1.3):
Figure 1.3 COD search Web interface form.
The COD server returns found results as a paginated HTML table (Figure 1.4). From this page, results can be downloaded in bulk as an archive. COD currently supports ZIP archives for downloaded data. The result table can be downloaded as a comma‐separated value (CSV) file, and the list of selected structures can be obtained as a text file, either one COD number or one COD URI per line.
Figure 1.4 COD search result page, obtained as of 05 November 2016 from the query shown in Figure 1.3.
The same search interface can also be accessed programmatically using the COD RESTful API. The base URL for carrying out searches is http://www.crystallography.net/cod/result, while search terms have to be defined as HTTP GET or POST parameters. An example of such query using the “curl” command line tools is given in Figure 1.5.
Figure 1.5 Example of the COD programmatic search interface.
A list of supported search terms is given in a list below:
text
: textual search; for example, text=caffeine
id
: search by COD identifier; for example, id=3000000
el1, el2, … , el8
: search for elements in composition; for example, el1=Ba&el2=O4
nel1, nel2, … , nel8
: exclude entries with given elements; for example, nel1=Os
vmin, vmax
: minimum and maximum volume of the cell, in Å
3
; for example, vmin=10&vmax=20
minZ, maxZ
: minimum and maximum Z value
minZprime, maxZprime
: minimum and maximum value of Z′
spacegroup
: search by spacegroup
journal, year, volume, issue, doi
: search by terms in bibliography
By default, the result of the structure request is returned in the CIF format; however, additional output formats can be requested.
A combination of search parameters results in logical conjunction (OR operation). The output format can also be controlled using HTTP GET or POST parameter “format,” with one of the following values: “html,” “csv,” “zip,” and “json.” In addition, “lst” value can be used to get the list of COD identifiers, “urls” to get the list of COD URLs and “count” to get the number of entries matching the search query. The default format currently used for the “result” query is “html,” returning a paginated HTML table. Since the request of the search result with no search terms selects all COD entries, this URI can be also used for browsing the COD database by COD ID. Other browsing pages (currently by journal or by publication date; the full list is available at http://www.crystallography.net/cod/browse.html) are actually also implemented using the “result” requests.
As presented in Section 1.4.1.1, each entry in the COD is identified by unique seven‐digit number. COD presents the following URLs for access to the entry‐related data:
Coordinates
:
http://www.crystallography.net/cod/XXXXXXX.cif
Diffraction data
:
http://www.crystallography.net/cod/XXXXXXX.hkl
Metadata in RDF
:
http://www.crystallography.net/cod/XXXXXXX.rdf
Here, the XXXXXXX placeholder should be replaced by a single COD identifier. An example of a query made using these identifiers from the Unix‐style command line is shown in Figure 1.6.
Figure 1.6 Retrieving a specific COD structure using the stable COD URI identifier.
Depositions to the database in the form of CIFs are also available using the RESTful interface. Currently, registration of a depositor account at the COD is required beforehand. The URL of the RESTful deposition interface is http://www.crystallography.net/cod/cgi‐bin/cif‐deposit.pl. All parameters along with a CIF must be provided via HTTP POST:
username
: depositor's username
password
: depositor's password
user_email
: depositor's e‐mail address
cif
: contents of to‐be‐deposited CIF
hkl
: contents of to‐be‐deposited diffraction data file (optional)
deposition_type
: type of deposition, either “published,” “prepublication,” or “personal”
