Bioinformatics -  - E-Book

Bioinformatics E-Book

0,0
116,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Praise for the third edition of Bioinformatics

"This book is a gem to read and use in practice."
Briefings in Bioinformatics

"This volume has a distinctive, special value as it offers an unrivalled level of details and unique expert insights from the leading computational biologists, including the very creators of popular bioinformatics tools."
ChemBioChem

"A valuable survey of this fascinating field. . . I found it to be the most useful book on bioinformatics that I have seen and recommend it very highly."
American Society for Microbiology News

"This should be on the bookshelf of every molecular biologist."
The Quarterly Review of Biolog

The field of bioinformatics is advancing at a remarkable rate. With the development of new analytical techniques that make use of the latest advances in machine learning and data science, today’s biologists are gaining fantastic new insights into the natural world’s most complex systems. These rapidly progressing innovations can, however, be difficult to keep pace with.

The expanded fourth edition of the best-selling Bioinformatics aims to remedy this by providing students and professionals alike with a comprehensive survey of the current field. Revised to reflect recent advances in computational biology, it offers practical instruction on the gathering, analysis, and interpretation of data, as well as explanations of the most powerful algorithms presently used for biological discovery. Bioinformatics, Fourth Edition offers the most readable, up-to-date, and thorough introduction to the field for biologists at all levels, covering both key concepts that have stood the test of time and the new and important developments driving this fast-moving discipline forwards.

This new edition features:  

  • New chapters on metabolomics, population genetics, metagenomics and microbial community analysis, and translational bioinformatics
  • A thorough treatment of statistical methods as applied to biological data
  • Special topic boxes and appendices highlighting experimental strategies and advanced concepts
  • Annotated reference lists, comprehensive lists of relevant web resources, and an extensive glossary of commonly used terms in bioinformatics, genomics, and proteomics

Bioinformatics is an indispensable companion for researchers, instructors, and students of all levels in molecular biology and computational biology, as well as investigators involved in genomics, clinical research, proteomics, and related fields.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1504

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Foreword

Preface

Contributors

About the Companion Website

1 Biological Sequence Databases

Introduction

Nucleotide Sequence Databases

Nucleotide Sequence Flatfiles: A Dissection

Protein Sequence Databases

Summary

Acknowledgments

Internet Resources

Further Reading

References

2 Information Retrieval from Biological Databases

Introduction

Integrated Information Retrieval: The Entrez System

Medical Databases

Organismal Sequence Databases Beyond NCBI

Summary

Further Reading

References

3 Assessing Pairwise Sequence Similarity: BLAST and FASTA

Introduction

Global Versus Local Sequence Alignments

Scoring Matrices

BLAST

BLAST 2 Sequences

MegaBLAST

PSI-BLAST

BLAT

FASTA

Summary

Further Reading

References

4 Genome Browsers

Introduction

The UCSC Genome Browser

UCSC Table Browser

ENSEMBL Genome Browser

Ensembl Biomart

JBrowse

Summary

Further Reading

References

5 Genome Annotation

Introduction

Gene Prediction Methods

Ab Initio Gene Prediction in Prokaryotic Genomes

Ab Initio Gene Prediction in Eukaryotic Genomes

How Well Do Gene Predictors Work?

Assessing Prokaryotic Gene Predictors

Assessing Eukaryotic Gene Predictors

Evidence Generation for Genome Annotation

Gene Annotation and Evidence Generation using Comparative Gene Prediction

Genome Annotation Pipelines

Summary

Acknowledgments

Internet Resources

Further Reading

References

6 Predictive Methods Using RNA Sequences

Introduction

Overview of RNA Secondary Structure Prediction Using Thermodynamics

Dynamic Programming

Accuracy of RNA Secondary Structure Prediction

Predicting the Secondary Structure Common to Multiple RNA Sequences

Practical Introduction to Single-Sequence Methods

Practical Introduction to Multiple Sequence Methods

Other Computational Methods to Study RNA Structure

Comparison of Methods

Predicting RNA Tertiary Structure

Summary

Further Reading

References

7 Predictive Methods Using Protein Sequences

Introduction

One-Dimensional Prediction of Protein Structure

Predicting Protein Function

Summary

Further Reading

References

8 Multiple Sequence Alignments

Introduction

Measuring Multiple Alignment Quality

Making an Alignment: Practical Issues

Commonly Used Alignment Packages

Viewing a Multiple Alignment

Summary

References

9 Molecular Evolution and Phylogenetic Analysis

Introduction

Early Classification Schemes

Sequences As Molecular Clocks

Background Terminology and the Basics

How to Construct a Tree

Marker-Based Evolution Studies

Phylogenetic Analysis and Data Integration

Future Challenges

References

10 Expression Analysis

Introduction

Step 0: Choose an Expression Analysis Technology

Step 1: Design the Experiment

Step 2: Collect and Manage the Data – and Metadata

Step 3: Data Pre-Processing

Step 4: Quality Control

Step 5: Normalization and Batch Effects

Step 6: Exploratory Data Analysis

Step 7: Differential Expression Analysis

Step 8: Exploring Mechanisms Through Functional Enrichment Analysis

Step 9: Developing a Classifier

Single-Cell Sequencing

Summary

Further Reading

References

11 Proteomics and Protein Identification by Mass Spectrometry

Introduction

Mass Spectrometry

Tandem Mass Spectrometry for Peptide Identification

Sample Preparation

Bioinformatics Analysis for MS-based Proteomics

Proteomics Strategies

Peptide Mass Fingerprinting

PMF on the Web

Proteomics and Tandem MS

PSM Software

PSM on the Web

Reporting Standards

Proteomics Data Repositories

Protein/Proteomics Databases

Selected Applications of Proteomics

Summary

Acknowledgments

Internet Resources

Further Reading

References

12 Protein Structure Prediction and Analysis

Introduction to Protein Structures

How Protein Structures are Determined

How Protein Structures are Described

Protein Structure Databases

Visualizing Proteins

Protein Structure Prediction

Protein Structure Evaluation

Protein Structure Comparison

Summary

Further Reading

References

13 Biological Networks and Pathways

Introduction

Pathway and Molecular Interaction Mapping: Experiments and Predictions

Pathway and Molecular Interaction Databases: An Overview

Pathway Databases

Molecular Interaction Databases

Functional Interaction Databases

Strategies for Navigating Pathway and Interaction Databases

Standard Data Formats for Pathways and Molecular Interactions

Pathway Visualization and Analysis

Network Visualization and Analysis

Summary

Acknowledgments

Internet Resources

Further Reading

References

14 Metabolomics

Introduction

Data Formats

Databases

Bioinformatics for Metabolite Identification

Multivariate Statistics

Bioinformatics for Metabolite Interpretation

Summary

Further Reading

References

15 Population Genetics

Introduction

Evolutionary Processes and Genetic Variation

Allele Frequencies and Population Variation

Display Methods

Demographic History Inference

Admixture and Ancestry Estimation

Detection of Natural Selection

Other Applications

Summary

References

16 Metagenomics and Microbial Community Analysis

Introduction

Why Study the Microbiome?

The Origins of Microbiome Analysis

Metagenomic Workflow

General Considerations in Marker-Gene and Metagenomic Data Analysis

Marker Genes

Metagenomic Data Analysis

Other Techniques to Characterize the Microbiome

Summary

Further Reading

References

17 Translational Bioinformatics

Introduction

Databases Describing the Genetics of Human Health

Prediction and Characterization of Impactful Genetic Variants from Sequence

Computing with Patient Phenotype Using Data in Electronic Health Records

Informatics and Precision Medicine

Ethical, Legal, and Social Implications of Translational Medicine

Summary

References

18 Statistical Methods for Biologists

Introduction

Descriptive Representations of Data

Statistical Inference and Statistical Hypothesis Testing

Summary

Acknowledgments

Internet Resources

Further Reading

References

Appendices

1.1 Example of a Flatfile Header in ENA Format

1.2 Example of a Flatfile Header in DDBJ/GenBank Format

1.3 Example of a Feature Table in ENA Format

1.4 Example of a Feature Table in GenBank/DDBJ Format

6.1 Dynamic Programming

Reference

Glossary

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Indicating locations within the feature table.

Chapter 2

Table 2.1 Entrez Boolean search statements.

Chapter 3

Table 3.1 Selecting an appropriate scoring matrix.

Table 3.2 BLAST algorithms.

Table 3.3 Main FASTA algorithms.

Chapter 7

Table 7.1 Disorder prediction performance.

Table 7.2 Performance of selected gene ontology term prediction methods in CAFA2...

Chapter 8

Table 8.1 Aligner performance on BAliBASE3 benchmark.

Chapter 9

Table 9.1 Some common software packages implementing different phylogenetic anal...

Chapter 11

Table 11.1 List of common sources of protein sequences (used in FASTA format).

Table 11.2 Standard search parameters used with sequence database search engines...

Chapter 12

Table 12.1 Relationship between backbone root mean square deviation (RMSD, i...

Chapter 14

Table 14.1 A list of freely available molecular editors and visualization tools.

Table 14.2 A list of open access chemical, spectral, pathway, and metabolomic da...

Chapter 15

Table 15.1 Examples of genes that have undergone natural selection in human popu...

Chapter 17

Table 17.1 Examples of commonly used biomedical ontologies and terminologies in ...

Chapter 18

Table 18.1 Common parametric statistical tests and their non-parametric equivale...

List of Illustrations

Chapter 1

Figure 1.1 The landing page for ENA record U54469.1, providing a graphical vie...

Figure 1.2 Results of a search for the human heterogeneous nuclear ribosomal p...

Figure 1.3 The Subcellular location and Pathology & Biotech sections of ...

Figure 1.4 The Feature viewer rendering of the record for the human heterogene...

Figure 1.5 Expanding the PTM, Structural features, and Variants sections withi...

Chapter 2

Figure 2.1 The exponential growth of GenBank in terms of number of nucleotides...

Figure 2.2 Results of a text-based Entrez query against PubMed using Boolean o...

Figure 2.3 An example of a PubMed record in Abstract format, as returned throu...

Figure 2.4 Neighbors to an entry found in PubMed. The original entry from Figu...

Figure 2.5 The Entrez Gene page for the

DCC

(deleted in colorectal carcinoma) ...

Figure 2.6 A section of the Database of Single Nucleotide Polymorphisms (dbSNP...

Figure 2.7 Entries in the RefSeq protein database corresponding to the origina...

Figure 2.8 The RefSeq entry for the netrin receptor, the protein product of th...

Figure 2.9 The same RefSeq entry for the netrin receptor shown in Figure 2.8, ...

Figure 2.10 Protein structures associated with the RefSeq entry for the human ...

Figure 2.11 The structure summary page for pdb:4URT, the crystal structure of ...

Figure 2.12 A list of structures deemed similar to pdb:4URT using VAST+. The t...

Figure 2.13 Online Mendelian Inheritance in Man (OMIM) entries related to the

Figure 2.14 The Online Mendelian Inheritance in Man (OMIM) entry for the

DCC

g...

Figure 2.15 An example of a list of allelic variants that can be found through...

Figure 2.16 The ClinicalTrials.gov page showing all actively recruiting clinic...

Figure 2.17 A clickable map showing where actively recruiting clinical trials ...

Figure 2.18 The Mouse Genome Informatics (MGI) entry for the

Dcc

gene in mouse...

Figure 2.19 The Zebrafish Information Network (ZFIN) gene page for the

dcc

gen...

Figure 2.20 An example of gene expression data available through the Zebrafish...

Chapter 3

Figure 3.1 The BLOSUM62 scoring matrix (Henikoff and Henikoff 1992). BLOSUM62 ...

Figure 3.2 A nucleotide scoring table. The scoring for the four nucleotide bas...

Figure 3.3 The initiation of a BLAST search. The search begins with query word...

Figure 3.4 BLAST search extension. Length of extension represents the number o...

Figure 3.5 The

National Center for Biotechnology Information

(NCBI) BLAST land...

Figure 3.6 The upper portion of the BLASTP query page. The first section in th...

Figure 3.7 The lower portion of the BLASTP query page, showing algorithm param...

Figure 3.8 Graphical display of BLASTP results. The query sequence is represen...

Figure 3.9 The BLASTP “hit list.” For each sequence found, the user is present...

Figure 3.10 Detailed information on a representative BLASTP hit. The header pr...

Figure 3.11 Performing a BLAST 2 Sequences alignment. Clicking the check box a...

Figure 3.12 Typical output from a BLAST 2 Sequences alignment, based on the qu...

Figure 3.13 Constructing a position-specific scoring matrix (PSSM). In the upp...

Figure 3.14 Performing a PSI-BLAST search. See text for details.

Figure 3.15 Selecting algorithm parameters for a PSI-BLAST search. See text fo...

Figure 3.16 Results of the first round of a PSI-BLAST search. For each sequenc...

Figure 3.17 Results of the second round of a PSI-BLAST search. New sequences i...

Figure 3.18 Submitting a BLAT query. A rat clone from the Cancer Genome Anatom...

Figure 3.19 Results of a BLAT query. Based on the query submitted in Figure 3....

Figure 3.20 The FASTA search strategy. (a) Once FASTA determines words of leng...

Figure 3.21 Search summary from a protein–protein FASTA search, using the sequ...

Figure 3.22 Hit list for the protein–protein FASTA search described in Figure ...

Chapter 4

Figure 4.1 The home page of the UCSC Genome Browser, showing a query for the g...

Figure 4.2 The default view of the UCSC Genome Browser, showing the genomic co...

Figure 4.3 The genomic context of the human

HIF1A

gene, after clicking on

zoom

...

Figure 4.4 The

RefSeq Track Settings

page. The track settings pages are used t...

Figure 4.5 The genomic context of the human

HIF1A

gene, after displaying RefSe...

Figure 4.6 The

Get Genomic Sequence

page that provides an interface for users ...

Figure 4.7 The genomic context of the human

HIF1A

gene, after changing the dis...

Figure 4.8 Configuring the track settings for the

Common SNPs(150)

track. Set ...

Figure 4.9 The genomic context of the human

HIF1A

gene, after changing the col...

Figure 4.10 The

GTEx Gene

track, which depicts median gene expression levels i...

Figure 4.11 BLAT search at the UCSC Genome Browser. (a) This page shows the re...

Figure 4.12 Configuring the UCSC Table Browser. The link to the Table Browser ...

Figure 4.13 The home page of the Ensembl Genome Browser, showing a query for t...

Figure 4.14 The

Gene

tab for the human

PAH

gene. This landing page provides li...

Figure 4.15 Computationally predicted orthologs of the human

PAH

gene, from th...

Figure 4.16 The

Location

tab for the human

PAH

gene. The

Location

tab is divid...

Figure 4.17 Zooming in on the bottom section of the

Location

tab from Figure 4...

Figure 4.18 The Ensembl

Variant

tab. (a) To get more details about SNP rs76296...

Figure 4.19 The Ensembl

Regulatory Build

track. (a) Go to

Configure this page

...

Figure 4.20 The Synteny view at Ensembl. (a) An overview of the syntenic block...

Figure 4.21 Ensembl BLAST output, showing an alignment between the human ADAM1...

Figure 4.22 Using BioMart to retrieve the mouse orthologs of the human RefSeqs...

Figure 4.23 JBrowse display of a predicted

Mnemiopsis

gene (

ML05372a

) from the...

Chapter 5

Figure 5.1 A simplified depiction of a prokaryotic gene or open reading frame ...

Figure 5.2 A simplified depiction of a eukaryotic gene illustrating the multi-...

Figure 5.3 A schematic illustration of the upstream regions of a eukaryotic ge...

Figure 5.4 A schematic illustration of the splice site regions around exons an...

Figure 5.5 Sample output from a GENSCAN analysis of the uroporphyrinogen decar...

Figure 5.6 Schematic representation of measures of gene prediction accuracy at...

Figure 5.7 The typical L-shaped structure of a tRNA molecule. This depicts the...

Figure 5.8 A screenshot montage of the PHASTER web server showing the website ...

Figure 5.9 A screenshot of a BASys bacterial genome annotation output for the ...

Chapter 6

Figure 6.1 The three levels of organization of RNA structure. (a) The primary ...

Figure 6.2 The RNA secondary structure of the 3′ untranslated region of the

Dr

...

Figure 6.3 An illustration of the equilibria of RNA structures in solution. (a...

Figure 6.4 Prediction of conformational free energy for a conformation of RNA ...

Figure 6.5 A simple RNA pseudoknot. This figure illustrates two representation...

Figure 6.6 The input form for the version 3.1 Mfold server. (a) The top and (b...

Figure 6.7 The output page for the Mfold server. Please refer to the text for ...

Figure 6.8 Sample output from the Mfold web server, version 3.1. (a) The secon...

Figure 6.9 RNAstructure web server input form. (a) The top and (b) the bottom ...

Figure 6.10 Sample output from the RNAstructure web server showing the predict...

Figure 6.11 Input form for the RNAstructure web server for multiple-sequence p...

Figure 6.12 Sample output from the RNAstructure web server for multiple-sequen...

Chapter 7

Figure 7.1 Dashboard of the PredictProtein web server. PredictProtein (Yachdav...

Figure 7.2 Protein secondary structure. Experimentally determined three-dimens...

Figure 7.3 Accessible surface area (ASA). The ASA describes the surface that i...

Figure 7.4 Protein secondary structure. Prediction of secondary structure, sol...

Figure 7.5 Types of transmembrane proteins. Experimentally determined three-di...

Figure 7.6 Transmembrane helix prediction by TMSEG. TMSEG (Bernhofer et al. 20...

Figure 7.7 Annotations of human tumor suppressor P53 (P53_HUMAN). (a) InterPro...

Figure 7.8 Prediction of subcellular localization. Visual output from LocTree3...

Figure 7.9 From predicting

single amino acid sequence variant

(SAV) effects to...

Chapter 8

Figure 8.1 An example multiple sequence alignment of seven globin protein sequ...

Figure 8.2 An outline of the simple progressive multiple alignment process. Th...

Figure 8.3 Aligner accuracy versus total single-threaded run time using the BA...

Figure 8.4 Total single-threaded execution time (

y

-axis) for different aligner...

Figure 8.5 Ratio of total run time relative to single-threaded execution (

y

-ax...

Figure 8.6 Protein and RNA multiple sequence alignments as visualized using Ja...

Figure 8.7 Linked

coding sequence

(

CDS

), protein, and three-dimensional struct...

Chapter 9

Figure 9.1 Different ways to visualize a tree. In this example, the same tree ...

Figure 9.2 Alignments illustrating sequence similarity versus sequence identit...

Figure 9.3 The differences between orthologs, paralogs, and xenologs. The ance...

Figure 9.4 The difference between phylogenetic signal and phylogenetic noise. ...

Figure 9.5 Character-based versus distance-based phylogenetic methods. Charact...

Figure 9.6 Rooting a tree with an outgroup.

Escherichia coli

bacteria are comm...

Figure 9.7 Workflow for a protein-based phylogenetic analysis using the PHYLIP...

Figure 9.8 Phylogenetic relationships can be visualized using different types ...

Figure 9.9 Excerpt of a

Salmonella

minimum spanning tree. Types of

Salmonella

...

Chapter 10

Figure 10.1 Example of an

MA

plot before (a) and after (b) normalization.

A

, o...

Figure 10.2 Histogram of the base mismatch (MM) rate across multiple RNA-seq s...

Figure 10.3 Overview of quantile normalization. We start with the box on the t...

Figure 10.4 Batch effects principal components analysis (PCA) example. Boxplot...

Figure 10.5 A simple illustration of the process of hierarchical clustering. (...

Figure 10.6 Heatmap showing clustering of gene expression data of the 100 most...

Figure 10.7 First two components of principal component analysis (PCA) on the ...

Figure 10.8 Principal component analysis (PCA) is a dimensionality reduction m...

Figure 10.9 Illustration of how one can select

k

when performing consensus clu...

Figure 10.10

Receiver operating characteristic

(

ROC

) curve for a model desig...

Chapter 11

Figure 11.1 Gene(s) to proteoforms. This figure illustrates the complexity of ...

Figure 11.2 Quadrupole mass analyzer. Schematic of a quadrupole mass analyzer,...

Figure 11.3 Time of flight (TOF) mass analyzer. Schematic of a TOF mass analyz...

Figure 11.4 (a) Tandem mass spectrometry (MS). Schematic of a triple quadrupol...

Figure 11.5 Fragmentation tandem mass spectrometry (MS/MS, or MS

2

) spectrum. A...

Figure 11.6 Polypeptide backbone cleavage produces different product ion speci...

Figure 11.7 Post-translational modifications (PTMs) take place at different am...

Figure 11.8 Data pre-processing workflow of a mass spectrum. Different steps i...

Figure 11.9 Shotgun proteomics workflow. Schematic showing different steps inv...

Figure 11.10 A schematic diagram comparing the label-free approach with the di...

Figure 11.11 Peptide mass fingerprinting (PMF) workflow. Schematic showing dif...

Figure 11.12 Mascot peptide mass fingerprinting (PMF). PMF submission screen a...

Figure 11.13 Peptide sequencing via tandem mass spectrometry (MS/MS) spectra i...

Figure 11.14 Peptide sequence tag searching. Schematic illustrating how a sequ...

Figure 11.15 Peptide spectrum match (PSM). Annotated MS

2

spectrum showing matc...

Figure 11.16 Mascot search engine. Mascot MS

2

database search submission windo...

Figure 11.17 Proteomics. A broad classification of proteomics and the biologic...

Chapter 12

Figure 12.1 A flow diagram illustrating the steps used to experimentally prepa...

Figure 12.2 An example of a nuclear magnetic resonance (NMR) “blurrogram” of a...

Figure 12.3 The different levels of protein structures illustrating: (a) prima...

Figure 12.4 Examples of different types of protein folds including (a) the fou...

Figure 12.5 An illustration of standard amino acid residue and peptide bond ge...

Figure 12.6 An example of a Protein Data Bank formatted file showing the first...

Figure 12.7 A Ramachandran plot for the thioredoxin protein (Protein Data Bank...

Figure 12.8 A screenshot of the Research Collaboratory for Structural Bioinfor...

Figure 12.9 A screenshot of an image of

Escherichia coli

thioredoxin as genera...

Figure 12.10 An illustration of the four major approaches to rendering protein...

Figure 12.11 An example of the high-quality images that can be created using a...

Figure 12.12 An illustration of a homology model (b) of

Escherichia coli

thior...

Figure 12.13 A schematic illustration of how threading is performed. (a) A que...

Figure 12.14 An example of the high-quality postscript output data from PROCHE...

Figure 12.15 An example of the CATH database description of

Escherichia coli

t...

Chapter 13

Figure 13.1 The Reactome database pathway view. The central view shows pathway...

Figure 13.2 The EcoCyc database cellular overview of

Escherichia coli

metaboli...

Figure 13.3 An example of metabolic pathway reconstruction from Kyoto Encyclop...

Figure 13.4 A BioGRID database record. A screenshot of the result page for a B...

Figure 13.5 An IntAct database search for the human

MDM2

gene. A summary of al...

Figure 13.6 An example of the main STRING query result page. A network of rela...

Figure 13.7 A query result from GeneMANIA. Each node in the network represents...

Figure 13.8 The AKT pathway as represented by a traditional method (top left, ...

Figure 13.9 The main components of the

Proteomics Standards Initiative–Molecul

...

Figure 13.10 The valine biosynthesis pathway dynamically drawn by the Pathway ...

Figure 13.11 Output from the PathVisio software showing a portion of a human c...

Figure 13.12 The set of symbol types available in the Systems Biology Graphica...

Figure 13.13 The

Drosophila melanogaster

cell cycle drawn using Systems Biolog...

Figure 13.14 The results of pathway enrichment analysis using the g:Profiler t...

Figure 13.15 A Gene Set Enrichment Analysis (GSEA) enrichment figure. The bott...

Figure 13.16 An enrichment map showing two enriched themes. Each node represen...

Figure 13.17 An introduction to terminology and visual notation used in the co...

Figure 13.18 Zooming in on a network in Cytoscape shows part of a large connec...

Figure 13.19 An overview of a pathway analysis workflow, summarizing multiple ...

Chapter 14

Figure 14.1 A diagram illustrating the typical workflow for a metabolomic expe...

Figure 14.2 An example of a Molecular Design Limited (MDL) chemical fingerprin...

Figure 14.3 An example of a MOL file for a two-dimensional representation of L...

Figure 14.4 An example of an nmrML data file for L-alanine. The actual file is...

Figure 14.5 The JSpectraViewer image for L-alanine. JSpectraViewer is a Java a...

Figure 14.6 A selection of two screenshots from the PubChem web pages for the ...

Figure 14.7 Two screenshots of the gas chromatography–mass spectrometry (GC-MS...

Figure 14.8 Two screenshots from the Human Metabolome Database (HMDB) entry fo...

Figure 14.9 A simplified illustration of how spectral deconvolution works for ...

Figure 14.10 Two screenshots of the Bayesil web server. (a) A nuclear magnetic...

Figure 14.11 An illustration of how spectral deconvolution works for gas chrom...

Figure 14.12 An illustration of how principal component analysis can be though...

Figure 14.13 A three-dimensional principal component analysis (PCA) “scores” p...

Figure 14.14 The MetaboAnalyst Module Overview page. This page allows users to...

Figure 14.15 The MetaboAnalyst Data Upload page. This page allows users to upl...

Figure 14.16 The MetaboAnalyst Data Normalization page. The optimal normalizat...

Figure 14.17 The MetaboAnalyst Data Normalization and Scaling results, generat...

Figure 14.18 A two-dimensional principal component analysis (PCA) “scores” plo...

Figure 14.19 The principal component analysis (PCA) “loadings” plot, showing t...

Figure 14.20 The partial least squares discriminant analysis (PLS-DA) plot sho...

Figure 14.21 An example of an

R

2

/

Q

2

plot generated by MetaboAnalyst using the ...

Figure 14.22 A variable importance in projection plot showing which metabolite...

Figure 14.23 A pathway impact plot showing the importance of different pathway...

Chapter 15

Figure 15.1 Principal components analysis (PCA) of nine world populations and ...

Figure 15.2 The coalescent process. Although the ancestral population contains...

Figure 15.3 Multiple sequentially Markovian coalescent (MSMC) estimate of popu...

Figure 15.4 Admixture analysis of nine populations and three test samples. Ind...

Figure 15.5 A Manhattan plot of Composite of Multiple Signals (CMS) scores (

Y

-...

Chapter 16

Figure 16.1 General workflow for DNA-based microbiome analysis.

Figure 16.2 FastQC summary of DNA sequence read quality for an Illumina sequen...

Figure 16.3 Primary structure and variable regions of the

16S

ribosomal RNA ge...

Figure 16.4

k

-mer decomposition of a nucleotide sequence with

k

 = 2. Two seque...

Figure 16.5 Rarefaction curves for microbial communities sampled from six diff...

Figure 16.6 Unweighted phylogenetic alpha- and beta-diversity measures. Left: ...

Figure 16.7 Principal coordinate analysis (a) vs. non-metric multidimensional ...

Figure 16.8 Visualizing the differences between two groups of gut microbiome s...

Chapter 17

Figure 17.1 ClinVar entry for a benign variant in the cystic fibrosis gene (

CF

...

Figure 17.2

Receiver operating characteristic

(

ROC

s) curves of five submissi...

Chapter 18

Figure 18.1 Relationships between observation, data, information, and knowledg...

Figure 18.2 Types of variables and their hierarchical relationships.

Figure 18.3 Organization of an example dataset. (a) Part of a two-dimensional ...

Figure 18.4 Commonly used descriptive statistics for sample variables. Light b...

Figure 18.5 Covariance versus correlation. The red sample has higher sample va...

Figure 18.6 Example histogram demonstrating the frequency of black cherry tree...

Figure 18.7 Example boxplot and related variant graphs. (a) Schematic diagram ...

Figure 18.8 Anscombe's quartet. Scatterplots with regression lines for four fa...

Figure 18.9 Scatterplot of the first two principal components (PCs) from princ...

Figure 18.10 Example of how to make a graph descriptive.

Figure 18.11 The standard normal distribution.

Figure 18.12 Other well-described discrete and continuous distributions common...

Figure 18.13 Bond length and coordination angle histograms for coordinated met...

Figure 18.14 Overview of the process of statistical inference.

FUV

stands for ...

Figure 18.15 Truth table with descriptions of type I and II errors.

Figure 18.16 Diagram illustrating the relationships between a probability dens...

Figure 18.17 Using a Student's

t

-test to test a null hypothesis.

Figure 18.18 Relationship between population and sample mean distributions..

Figure 18.19 An approximate power analysis diagram for a Student's

t

-test.

Appendices

Figure 6.A.1 Pseudo-computer code for the fill order of

V

(

i

,

j

) and

W

(

i

,

j

). Thi...

Figure 6.A.2 The filled

V

(

i

,

j

) array for sequence GCGGGUACCGAUCGUCGC.

Figure 6.A.3 The filled

W

(

i

,

j

) array for sequence GCGGGUACCGAUCGUCGC.

Figure 6.A.4 Illustrations of maximum hydrogen bond conformations as found by ...

Figure 6.A.5 Flowchart for structure traceback. Traceback starts by placing 1,...

Figure 6.A.6 The secondary structure of rGCGGGUACCGAUCGUCGC with 17 hydrogen b...

Guide

Cover

Table of Contents

Begin Reading

Pages

iii

iv

vii

ix

x

xi

xii

xiii

xiv

xv

xvi

xvii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

Bioinformatics

 

Edited by

Andreas D. Baxevanis, Gary D. Bader, and David S. Wishart

 

 

Fourth Edition

 

 

 

 

 

 

 

This fourth edition first published 2020

© 2020 John Wiley & Sons, Inc.

Edition History

Wiley-Blackwell (1e, 2000), Wiley-Blackwell (2e, 2001), Wiley-Blackwell (3e, 2005)

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Andreas D. Baxevanis, Gary D. Bader, and David S. Wishart to be identified as the authors of the editorial material in this work has been asserted in accordance with law.

Registered Office

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data

Names: Baxevanis, Andreas D., editor. | Bader, Gary D., editor. | Wishart, David S., editor.

Title: Bioinformatics / edited by Andreas D. Baxevanis, Gary D. Bader, David S. Wishart.

Other titles: Bioinformatics (Baxevanis)

Description: Fourth edition. | Hoboken, NJ : Wiley, 2020. | Includes bibliographical references and index.

Identifiers: LCCN 2019030489 (print) | ISBN 9781119335580 (cloth) | ISBN 9781119335962 (adobe pdf) | ISBN 9781119335955 (epub)

Subjects: MESH: Computational Biology--methods | Sequence Analysis--methods | Base Sequence | Databases, Nucleic Acid | Databases, Protein

Classification: LCC QH324.2 (print) | LCC QH324.2 (ebook) | NLM QU 550.5.S4 | DDC 570.285–dc23

LC record available at https://lccn.loc.gov/2019030489

LC ebook record available at https://lccn.loc.gov/2019030490

Cover Design: Wiley

Cover Images: © David Wishart, background © Suebsiri/Getty Images

Foreword

As I review the material presented in the fourth edition of Bioinformatics I am moved in two ways, related to both the past and the future.

Looking to the past, I am moved by the amazing evolution that has occurred in our field since the first edition of this book appeared in 1998. Twenty-one years is a long, long time in any scientific field, but especially so in the agile field of bioinformatics. To use the well-trodden metaphor of the “biology moonshot,” the launchpad at the beginning of the twenty-first century was the determination of the human genome. Discovery is not the right word for what transpired – we knew it was there and what was needed. Synergy is perhaps a better word; synergy of technological development, experiment, computation, and policy. A truly collaborative effort to continuously share, in a reusable way, the collective efforts of many scientists. Bioinformatics was born from this synergy and has continued to grow and flourish based on these principles.

That growth is reflected in both the scope and depth of what is covered in these pages. These attributes are a reflection of the increased complexity of the biological systems that we study (moving from “simple” model organisms to the human condition) and the scales at which those studies take place. As a community we have professed multiscale modeling without much to show for it, but it would seem to be finally here. We now have the ability to connect the dots from molecular interactions, through the pathways to which those molecules belong to the cells they affect, to the interactions between those cells through to the effects they have on individuals within a population. Tools and methodologies that were novel in earlier editions of this book are now routine or obsolete, and newer, faster, and more accurate procedures are now with us. This will continue, and as such this book provides a valuable snapshot of the scope and depth of the field as it exists today.

Looking to the future, this book provides a foundation for what is to come. For me this is a field more aptly referred to (and perhaps a new subtitle for the next edition) as Biomedical Data Science. Sitting as I do now, as Dean of a School of Data Science which collaborates openly across all disciplines, I see rapid change akin to what happened to birth bioinformatics 20 or more years ago. It will not take 20 years for other disciplines to catch up; I predict it will take 2! The accomplishments outlined in this book can help define what other disciplines will accomplish with their own data in the years to come. Statistical methods, cloud computing, data analytics, notably deep learning, the management of large data, visualization, ethics policy, and the law surrounding data are generic. Bioinformatics has so much to offer, yet it will also be influenced by other fields in a way that has not happened before. Forty-five years in academia tells me that there is nothing to compare across campuses to what is happening today. This is both an opportunity and a threat. The editors and authors of this edition should be complimented for setting the stage for what is to come.

Philip E. Bourne, University of Virginia

Preface

In putting together this textbook, we hope that students from a range of fields – including biology, computer science, engineering, physics, mathematics, and statistics – benefit by having a convenient starting point for learning most of the core concepts and many useful practical skills in the field of bioinformatics, also known as computational biology.

Students interested in bioinformatics often ask about how should they acquire training in such an interdisciplinary field as this one. In an ideal world, students would become experts in all the fields mentioned above, but this is actually not necessary and realistically too much to ask. All that is required is to combine their scientific interests with a foundation in biology and any single quantitative field of their choosing. While the most common combination is to mix biology with computer science, incredible discoveries have been made through finding creative intersections with any number of quantitative fields. Indeed, many of these quantitative fields typically overlap a great deal, especially given their foundational use of mathematics and computer programming. These natural relationships between fields provide the foundation for integrating diverse expertise and insights, especially when in the context of performing bioinformatic analyses.

While bioinformatics is often considered an independent subfield of biology, it is likely that the next generation of biologists will not consider bioinformatics as being separate and will instead consider gaining bioinformatics and data science skills as naturally as they learn how to use a pipette. They will learn how to program a computer, likely starting in elementary school. Other data science knowledge areas, such as math, statistics, machine learning, data processing, and data visualization will also be part of any core curriculum. Indeed, the children of one of the editors recently learned how to construct bar plots and other data charts in kindergarten! The same editor is teaching programming in R (an important data science programming language) to all incoming biology graduate students at his university starting this year.

As bioinformatics and data science become more naturally integrated in biology, it is worth noting that these fields actively espouse a culture of open science. This culture is motivated by thinking about why we do science in the first place. We may be curious or like problem solving. We could also be motivated by the benefits to humanity that scientific advances bring, such as tangible health and economic benefits. Whatever the motivating factor, it is clear that the most efficient way to solve hard problems is to work together as a team, in a complementary fashion and without duplication of effort. The only way to make sure this works effectively is to efficiently share knowledge and coordinate work across disciplines and research groups. Presenting scientific results in a reproducible way, such as freely sharing the code and data underlying the results, is also critical. Fortunately, there are an increasing number of resources that can help facilitate these goals, including the bioRxiv preprint server, where papers can be shared before the very long process of peer review is completed; GitHub, for sharing computer code; and data science notebook technology that helps combine code, figures, and text in a way that makes it easier to share reproducible and reusable results.

We hope this textbook helps catalyze this transition of biology to a quantitative, data science-intensive field. As biological research advances become ever more built on interdisciplinary, open, and team science, progress will dramatically speed up, laying the groundwork for fantastic new discoveries in the future.

We also deeply thank all of the chapter authors for contributing their knowledge and time to help the many future readers of this book learn how to apply the myriad bioinformatic techniques covered within these pages to their own research questions.

Andreas D. Baxevanis

Gary D. Bader

David S. Wishart

Contributors

Gary D. Bader, PhD  is a Professor at The Donnelly Centre at the University of Toronto, Toronto, Canada, and a leader in the field of Network Biology. Gary completed his postdoctoral work in Chris Sander's group in the Computational Biology Center (cBio) at Memorial Sloan-Kettering Cancer Center in New York. Gary completed his PhD in the laboratory of Christopher Hogue in the Department of Biochemistry at the University of Toronto and a BSc in Biochemistry at McGill University in Montreal. Dr. Bader uses molecular interaction, pathway, and -omics data to gain a “causal” mechanistic understanding of normal and disease phenotypes. His laboratory develops novel computational approaches that combine molecular interaction and pathway information with -omics data to develop clinically predictive models and identify therapeutically targetable pathways. He also helps lead the Cytoscape, GeneMANIA, and Pathway Commons pathway and network analysis projects.

Geoffrey J. Barton, PhD  is Professor of Bioinformatics and Head of the Division of Computational Biology at the University of Dundee School of Life Sciences, Dundee, UK. Before moving to Dundee in 2001, he was Head of the Protein Data Bank in Europe and the leader of the Research and Development Team at the EMBL European Bioinformatics Institute (EBI). Prior to joining EMBL-EBI, he was Head of Genome Informatics at the Wellcome Trust Centre for Human Genetics, University of Oxford, a position he held concurrently with a Royal Society University Research Fellowship in the Department of Biochemistry. Geoff's longest running research interest is using computational methods to study the relationship between a protein's sequence, its structure, and its function. His group has contributed many tools and techniques in the field of protein sequence and structure analysis and structure prediction. Two of the best known are the Jalview multiple alignment visualization and analysis workbench, which is in use by over 70 000 groups for research and teaching, and the JPred multi-neural net protein secondary structure prediction algorithm, which performs predictions on up to 500 000 proteins/month for users worldwide. In addition to his work related to protein sequence and structure, Geoff has collaborated on many projects that probe biological processes using proteomic and high-throughput sequencing approaches. Geoff's group has deep expertise in RNA-seq methods and has recently published a two-condition 48-replicate RNA-seq study that is now a key reference work for users of this technology.

Andreas D. Baxevanis, PhD  is the Director of Computational Biology for the National Institutes of Health's (NIH) Intramural Research Program. He is also a Senior Scientist leading the Computational Genomics Unit at the NIH's National Human Genome Research Institute, Bethesda, MD, USA. His research program is centered on probing the interface between genomics and developmental biology, focusing on the sequencing and analysis of invertebrate genomes that can yield insights of relevance to human health, particularly in the areas of regeneration, allorecognition, and stem cell biology. His accomplishments have been recognized by the Bodossaki Foundation's Academic Prize in Medicine and Biology in 2000, Greece's highest award for young scientists of Greek heritage. In 2014, he was elected to the Johns Hopkins Society of Scholars, recognizing alumni who have achieved marked distinction in their field of study. He was the recipient of the NIH's Ruth L. Kirschstein Mentoring Award in 2015, in recognition of his commitment to scientific training, education, and mentoring. In 2016, Dr. Baxevanis was elected as a Senior Member of the International Society for Computational Biology for his sustained contributions to the field and, in 2018, he was elected as a Fellow of the American Association for the Advancement of Science for his distinguished contributions to the field of comparative genomics.

Robert G. Beiko, PhD  is a Professor and Associate Dean for Research in the Faculty of Computer Science at Dalhousie University, Halifax, Nova Scotia, Canada. He is a former Tier II Canada Research Chair in Bioinformatics (2007–2017), an Associate Editor at mSystems and BMC Bioinformatics, and a founding organizer of the Canadian Bioinformatics Workshops in Metagenomics and Genomic Epidemiology. He is also the lead editor of the recently published book Microbiome Analysis in the Methods in Molecular Biology series. His research focuses on microbial genomics, evolution, and ecology, with concentrations in the area of lateral gene transfer and microbial community analysis.

Fiona S.L. Brinkman, PhD, FRSC  is a Professor in Bioinformatics and Genomics in the Department of Molecular Biology and Biochemistry at Simon Fraser University, Vancouver, British Columbia, Canada, with cross-appointments in Computing Science and the Faculty of Health Sciences. She is most known for her research and development of widely used computer software that aids both microbe (PSORTb, IslandViewer) and human genomic (InnateDB) evolutionary/genomics analyses, along with her insights into pathogen evolution. She is currently co-leading a national effort – the Integrated Rapid Infectious Disease Analysis Project – the goal of which is to use microbial genomes as a fingerprint to better track and understand the spread and evolution of infectious diseases. She has also been leading development into an approach to integrate very diverse data for the Canadian CHILD Study birth cohort, including microbiome, genomic, epigenetic, environmental, and social data. She coordinates community-based genome annotation and database development for resources such as the Pseudomonas Genome Database. She also has a strong interest in bioinformatics education, including developing the first undergraduate curricula used as the basis for the first White Paper on Canadian Bioinformatics Training in 2002. She is on several committees and advisory boards, including the Board of Directors for Genome Canada; she chairs the Scientific Advisory Board for the European Nucleotide Archive (EMBL-EBI). She has received a number of awards, including a TR100 award from MIT, and, most recently, was named as a Fellow of the Royal Society of Canada.

Andrew Emili, PhD  is a Professor in the Departments of Biochemistry (Medical School) and Biology (Arts and Sciences) at Boston University (BU), Boston, MA, USA, and the inaugural Director of the BU Center for Network Systems Biology (CNSB). Prior to Boston, Dr. Emili was a founding member and Principal Investigator for 18 years at the Donnelly Center for Cellular and Biomolecular Research at the University of Toronto, one of the premier research centers in integrative molecular biology. Dr. Emili is an internationally recognized leader in functional proteomics, systems biology, and precision mass spectrometry. His group develops and applies innovative technologies to systematically map protein interaction networks and macromolecular complexes of cells and tissues on a global scale, publishing “interactome” maps of unprecedented quality, scope, and resolution.

Tatyana Goldberg, PhD  is a postdoctoral scientist at the Technical University of Munich, Germany. She obtained her PhD in Bioinformatics under the supervision of Dr. Burkhard Rost. Her research focuses on developing models that can predict the localization of proteins within cells. The results of her study contribute to a variety of applications, including the development of pharmaceuticals for the treatment of Alzheimer disease and cancer.

Emma J. Griffiths, PhD  is a research associate in the Department of Pathology and Laboratory Medicine at the University of British Columbia in Vancouver, Canada, working with Dr. William Hsiao. Dr. Griffiths received her PhD from the Department of Biochemistry and Biomedical Sciences at McMaster University in Hamilton, Canada, with her doctoral work focusing on the evolutionary relationships between different groups of bacteria. She has since pursued postdoctoral training in the fields of chemical and fungal genetics and microbial genomics with Dr. Fiona Brinkman in the Department of Biochemistry and Molecular Biology at Simon Fraser University in Vancouver, Canada. Her current work focuses on the development of ontology-driven applications designed to improve pathogen genomics contextual data (“metadata”) exchange during public health investigations.

Desmond G. Higgins, PhD  is Professor of Bioinformatics in University College Dublin, Ireland, where his laboratory works on genomic data analysis and sequence alignment algorithms. He earned his doctoral degree in zoology from Trinity College Dublin, Ireland, and has worked in the field of bioinformatics since 1985. His group maintains and develops the Clustal package for multiple sequence alignment in collaboration with groups in France, Germany, and the United Kingdom. Dr. Higgins wrote the first version of Clustal in Dublin in 1988. He then moved to the EMBL Data Library group located in Heidelberg in 1990 and later to EMBL-EBI in Hinxton. This coincided with the release of ClustalW and, later, ClustalX, which has been extremely widely used and cited. Currently, he has run out of version letters so is working on Clustal Omega, specifically designed for making extremely large protein alignments.

Lynn B. Jorde, PhD  has been on the faculty of the University of Utah School of Medicine, Salt Lake City, UT, USA, since 1979 and holds the Mark and Kathie Miller Presidential Endowed Chair in Human Genetics. He was appointed Chair of the Department of Human Genetics in September 2009. Dr. Jorde's laboratory has published scientific articles on human genetic variation, high-altitude adaptation, the genetic basis of human limb malformations, and the genetics of common diseases such as hypertension, juvenile idiopathic arthritis, and inflammatory bowel disease. Dr. Jorde is the lead author of Medical Genetics, a textbook that is now in its fifth edition and translated into multiple foreign languages. He is the co-recipient of the 2008 Award for Excellence in Education from the American Society of Human Genetics (ASHG). He served two 3-year terms on the Board of Directors of ASHG and, in 2011, he was elected as president of ASHG. In 2012, he was elected as a Fellow of the American Association for the Advancement of Science.

Marieke L. Kuijjer, PhD  is a Group Leader at the Centre for Molecular Medicine Norway (NCMM, a Nordic EMBL partner), University of Oslo, Norway, where she runs the Computational Biology and Systems Medicine group. She obtained her doctorate in the laboratory of Dr. Pancras Hogendoorn in the Department of Pathology at the Leiden University Medical Center in the Netherlands. After this, she continued her scientific training as a postdoctoral researcher in the laboratory of Dr. John Quackenbush at the Dana-Farber Cancer Institute and Harvard T.H. Chan School of Public Health, during which she won a career development award and a postdoctoral fellowship. Dr. Kuijjer's research focuses on solving fundamental biological questions through the development of new methods in computational and systems biology and on implementing these techniques to better understand gene regulation in cancer. Dr. Kuijjer serves on the editorial board of Cancer Research.

David H. Mathews, MD, PhD  is a professor of Biochemistry and Biophysics and also of Biostatistics and Computational Biology at the University of Rochester Medical Center, Rochester, NY, USA. He also serves as the Associate Director of the University of Rochester's Center for RNA Biology. His involvement in education includes directing the Biophysics PhD program and teaching a course in Python programming and algorithms for doctoral students without a programming background. His group studies RNA biology and develops methods for RNA secondary structure prediction and molecular modeling of three-dimensional structure. His group developed and maintains RNAstructure, a widely used software package for RNA structure prediction and analysis.

Sean D. Mooney, PhD  has spent his career as a researcher and group leader in biomedical informatics. He now leads Research IT for UW Medicine and is leading efforts to support and build clinical research informatic platforms as its first Chief Research Information Officer (CRIO) and as a Professor in the Department of Biomedical Informatics and Medical Education at the University of Washington, Seattle, WA, USA. Previous to being appointed as CRIO, he was an Associate Professor and Director of Bioinformatics at the Buck Institute for Research on Aging. As an Assistant Professor, he was appointed in Medical and Molecular Genetics at Indiana University School of Medicine and was the founding Director of the Indiana University School of Medicine Bioinformatics Core. In 1997, he received his BS with Distinction in Biochemistry and Molecular Biology from the University of Wisconsin at Madison. He received his PhD from the University of California in San Francisco in 2001, then pursued his postdoctoral studies under an American Cancer Society John Peter Hoffman Fellowship at Stanford University.

Stephen J. Mooney, PhD  is an Acting Assistant Professor in the Department of Epidemiology at the University of Washington, Seattle, WA, USA. He developed the CANVAS system for collecting data from Google Street View imagery as a graduate student, and his research focuses on contextual influences on physical activity and transport-related injury. He's a methods geek at heart.

Hunter N.B. Moseley, PhD  is an Associate Professor in the Department of Molecular and Cellular Biochemistry at the University of Kentucky, Lexington, KY, USA. He is also the Informatics Core Director within the Resource Center for Stable Isotope Resolved Metabolomics, Associate Director for the Institute for Biomedical Informatics, and a member of the Markey Cancer Center. His research interests include developing computational methods, tools, and models for analyzing and interpreting many types of biological and biophysical data that enable new understanding of biological systems and related disease processes. His formal education spans multiple disciplines including chemistry, mathematics, computer science, and biochemistry, with expertise in algorithm development, mathematical modeling, structural bioinformatics, and systems biochemistry, particularly in the development of automated analyses of nuclear magnetic resonance and mass spectrometry data as well as knowledge–data integration.

Yanay Ofran, PhD  is a Professor and head of the Laboratory of Functional Genomics and Systems Biology at Bar Ilan University in Tel Aviv, Israel. His research focuses on biomolecular recognition and its role in health and disease. Professor Ofran is also the founder of Biolojic Design, a biopharmaceutical company that uses artificial intelligence approaches to design epitope-specific antibodies. He is also the co-founder of Ukko, a biotechnology company that uses computational tools to design safe proteins for the food and agriculture sectors.

Joseph N. Paulson, PhD  is a Statistical Scientist within Genentech's Department of Biostatistics, San Francisco, CA, USA, working on designing clinical trials and biomarker discovery. Previously, he was a Research Fellow in the Department of Biostatistics and Computational Biology at the Dana-Farber Cancer Institute and Department of Biostatistics at the Harvard T.H. Chan School of Public Health. He graduated with a PhD in Applied Mathematics, Statistics, and Scientific Computation from the University of Maryland, College Park where he was a National Science Foundation Graduate Fellow. As a statistician and computational biologist, his interests include clinical trial design, biomarker discovery, development of computational methods for the analysis of high-throughput sequencing data while accounting for technical artifacts, and the microbiome.

Sadhna Phanse, MSc  is a Bioinformatics Analyst at the Donnelly Centre for Cellular and Biomolecular Research at the University of Toronto, Toronto, Canada. She has been active in the field of proteomics since 2006 as a member of the Emili research group. Her current work involves the use of bioinformatics methods to investigate biological systems and molecular association networks in human cells and model organisms.

John Quackenbush, PhD  is Professor of Computational Biology and Bioinformatics and Chair of the Department of Biostatistics at the Harvard T.H. Chan School of Public Health, Boston, MA, USA. He also holds appointments in the Channing Division of Network Medicine of Brigham and Women's Hospital and at the Dana-Farber Cancer Institute. He is a recognized expert in computational and systems biology and its applications to the study of a wide range of human diseases and the factors that drive those diseases and their responses to therapy. Dr. Quackenbush has long been an advocate for open science and reproducible research. As a founding member and past president of the Functional Genomics Data Society (FGED), he was a developer of the