116,99 €
Praise for the third edition of Bioinformatics
"This book is a gem to read and use in practice."
—Briefings in Bioinformatics
"This volume has a distinctive, special value as it offers an unrivalled level of details and unique expert insights from the leading computational biologists, including the very creators of popular bioinformatics tools."
—ChemBioChem
"A valuable survey of this fascinating field. . . I found it to be the most useful book on bioinformatics that I have seen and recommend it very highly."
—American Society for Microbiology News
"This should be on the bookshelf of every molecular biologist."
—The Quarterly Review of Biolog
The field of bioinformatics is advancing at a remarkable rate. With the development of new analytical techniques that make use of the latest advances in machine learning and data science, today’s biologists are gaining fantastic new insights into the natural world’s most complex systems. These rapidly progressing innovations can, however, be difficult to keep pace with.
The expanded fourth edition of the best-selling Bioinformatics aims to remedy this by providing students and professionals alike with a comprehensive survey of the current field. Revised to reflect recent advances in computational biology, it offers practical instruction on the gathering, analysis, and interpretation of data, as well as explanations of the most powerful algorithms presently used for biological discovery. Bioinformatics, Fourth Edition offers the most readable, up-to-date, and thorough introduction to the field for biologists at all levels, covering both key concepts that have stood the test of time and the new and important developments driving this fast-moving discipline forwards.
This new edition features:
Bioinformatics is an indispensable companion for researchers, instructors, and students of all levels in molecular biology and computational biology, as well as investigators involved in genomics, clinical research, proteomics, and related fields.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1504
Veröffentlichungsjahr: 2020
Cover
Foreword
Preface
Contributors
About the Companion Website
1 Biological Sequence Databases
Introduction
Nucleotide Sequence Databases
Nucleotide Sequence Flatfiles: A Dissection
Protein Sequence Databases
Summary
Acknowledgments
Internet Resources
Further Reading
References
2 Information Retrieval from Biological Databases
Introduction
Integrated Information Retrieval: The Entrez System
Medical Databases
Organismal Sequence Databases Beyond NCBI
Summary
Further Reading
References
3 Assessing Pairwise Sequence Similarity: BLAST and FASTA
Introduction
Global Versus Local Sequence Alignments
Scoring Matrices
BLAST
BLAST 2 Sequences
MegaBLAST
PSI-BLAST
BLAT
FASTA
Summary
Further Reading
References
4 Genome Browsers
Introduction
The UCSC Genome Browser
UCSC Table Browser
ENSEMBL Genome Browser
Ensembl Biomart
JBrowse
Summary
Further Reading
References
5 Genome Annotation
Introduction
Gene Prediction Methods
Ab Initio Gene Prediction in Prokaryotic Genomes
Ab Initio Gene Prediction in Eukaryotic Genomes
How Well Do Gene Predictors Work?
Assessing Prokaryotic Gene Predictors
Assessing Eukaryotic Gene Predictors
Evidence Generation for Genome Annotation
Gene Annotation and Evidence Generation using Comparative Gene Prediction
Genome Annotation Pipelines
Summary
Acknowledgments
Internet Resources
Further Reading
References
6 Predictive Methods Using RNA Sequences
Introduction
Overview of RNA Secondary Structure Prediction Using Thermodynamics
Dynamic Programming
Accuracy of RNA Secondary Structure Prediction
Predicting the Secondary Structure Common to Multiple RNA Sequences
Practical Introduction to Single-Sequence Methods
Practical Introduction to Multiple Sequence Methods
Other Computational Methods to Study RNA Structure
Comparison of Methods
Predicting RNA Tertiary Structure
Summary
Further Reading
References
7 Predictive Methods Using Protein Sequences
Introduction
One-Dimensional Prediction of Protein Structure
Predicting Protein Function
Summary
Further Reading
References
8 Multiple Sequence Alignments
Introduction
Measuring Multiple Alignment Quality
Making an Alignment: Practical Issues
Commonly Used Alignment Packages
Viewing a Multiple Alignment
Summary
References
9 Molecular Evolution and Phylogenetic Analysis
Introduction
Early Classification Schemes
Sequences As Molecular Clocks
Background Terminology and the Basics
How to Construct a Tree
Marker-Based Evolution Studies
Phylogenetic Analysis and Data Integration
Future Challenges
References
10 Expression Analysis
Introduction
Step 0: Choose an Expression Analysis Technology
Step 1: Design the Experiment
Step 2: Collect and Manage the Data – and Metadata
Step 3: Data Pre-Processing
Step 4: Quality Control
Step 5: Normalization and Batch Effects
Step 6: Exploratory Data Analysis
Step 7: Differential Expression Analysis
Step 8: Exploring Mechanisms Through Functional Enrichment Analysis
Step 9: Developing a Classifier
Single-Cell Sequencing
Summary
Further Reading
References
11 Proteomics and Protein Identification by Mass Spectrometry
Introduction
Mass Spectrometry
Tandem Mass Spectrometry for Peptide Identification
Sample Preparation
Bioinformatics Analysis for MS-based Proteomics
Proteomics Strategies
Peptide Mass Fingerprinting
PMF on the Web
Proteomics and Tandem MS
PSM Software
PSM on the Web
Reporting Standards
Proteomics Data Repositories
Protein/Proteomics Databases
Selected Applications of Proteomics
Summary
Acknowledgments
Internet Resources
Further Reading
References
12 Protein Structure Prediction and Analysis
Introduction to Protein Structures
How Protein Structures are Determined
How Protein Structures are Described
Protein Structure Databases
Visualizing Proteins
Protein Structure Prediction
Protein Structure Evaluation
Protein Structure Comparison
Summary
Further Reading
References
13 Biological Networks and Pathways
Introduction
Pathway and Molecular Interaction Mapping: Experiments and Predictions
Pathway and Molecular Interaction Databases: An Overview
Pathway Databases
Molecular Interaction Databases
Functional Interaction Databases
Strategies for Navigating Pathway and Interaction Databases
Standard Data Formats for Pathways and Molecular Interactions
Pathway Visualization and Analysis
Network Visualization and Analysis
Summary
Acknowledgments
Internet Resources
Further Reading
References
14 Metabolomics
Introduction
Data Formats
Databases
Bioinformatics for Metabolite Identification
Multivariate Statistics
Bioinformatics for Metabolite Interpretation
Summary
Further Reading
References
15 Population Genetics
Introduction
Evolutionary Processes and Genetic Variation
Allele Frequencies and Population Variation
Display Methods
Demographic History Inference
Admixture and Ancestry Estimation
Detection of Natural Selection
Other Applications
Summary
References
16 Metagenomics and Microbial Community Analysis
Introduction
Why Study the Microbiome?
The Origins of Microbiome Analysis
Metagenomic Workflow
General Considerations in Marker-Gene and Metagenomic Data Analysis
Marker Genes
Metagenomic Data Analysis
Other Techniques to Characterize the Microbiome
Summary
Further Reading
References
17 Translational Bioinformatics
Introduction
Databases Describing the Genetics of Human Health
Prediction and Characterization of Impactful Genetic Variants from Sequence
Computing with Patient Phenotype Using Data in Electronic Health Records
Informatics and Precision Medicine
Ethical, Legal, and Social Implications of Translational Medicine
Summary
References
18 Statistical Methods for Biologists
Introduction
Descriptive Representations of Data
Statistical Inference and Statistical Hypothesis Testing
Summary
Acknowledgments
Internet Resources
Further Reading
References
Appendices
1.1 Example of a Flatfile Header in ENA Format
1.2 Example of a Flatfile Header in DDBJ/GenBank Format
1.3 Example of a Feature Table in ENA Format
1.4 Example of a Feature Table in GenBank/DDBJ Format
6.1 Dynamic Programming
Reference
Glossary
Index
End User License Agreement
Chapter 1
Table 1.1 Indicating locations within the feature table.
Chapter 2
Table 2.1 Entrez Boolean search statements.
Chapter 3
Table 3.1 Selecting an appropriate scoring matrix.
Table 3.2 BLAST algorithms.
Table 3.3 Main FASTA algorithms.
Chapter 7
Table 7.1 Disorder prediction performance.
Table 7.2 Performance of selected gene ontology term prediction methods in CAFA2...
Chapter 8
Table 8.1 Aligner performance on BAliBASE3 benchmark.
Chapter 9
Table 9.1 Some common software packages implementing different phylogenetic anal...
Chapter 11
Table 11.1 List of common sources of protein sequences (used in FASTA format).
Table 11.2 Standard search parameters used with sequence database search engines...
Chapter 12
Table 12.1 Relationship between backbone root mean square deviation (RMSD, i...
Chapter 14
Table 14.1 A list of freely available molecular editors and visualization tools.
Table 14.2 A list of open access chemical, spectral, pathway, and metabolomic da...
Chapter 15
Table 15.1 Examples of genes that have undergone natural selection in human popu...
Chapter 17
Table 17.1 Examples of commonly used biomedical ontologies and terminologies in ...
Chapter 18
Table 18.1 Common parametric statistical tests and their non-parametric equivale...
Chapter 1
Figure 1.1 The landing page for ENA record U54469.1, providing a graphical vie...
Figure 1.2 Results of a search for the human heterogeneous nuclear ribosomal p...
Figure 1.3 The Subcellular location and Pathology & Biotech sections of ...
Figure 1.4 The Feature viewer rendering of the record for the human heterogene...
Figure 1.5 Expanding the PTM, Structural features, and Variants sections withi...
Chapter 2
Figure 2.1 The exponential growth of GenBank in terms of number of nucleotides...
Figure 2.2 Results of a text-based Entrez query against PubMed using Boolean o...
Figure 2.3 An example of a PubMed record in Abstract format, as returned throu...
Figure 2.4 Neighbors to an entry found in PubMed. The original entry from Figu...
Figure 2.5 The Entrez Gene page for the
DCC
(deleted in colorectal carcinoma) ...
Figure 2.6 A section of the Database of Single Nucleotide Polymorphisms (dbSNP...
Figure 2.7 Entries in the RefSeq protein database corresponding to the origina...
Figure 2.8 The RefSeq entry for the netrin receptor, the protein product of th...
Figure 2.9 The same RefSeq entry for the netrin receptor shown in Figure 2.8, ...
Figure 2.10 Protein structures associated with the RefSeq entry for the human ...
Figure 2.11 The structure summary page for pdb:4URT, the crystal structure of ...
Figure 2.12 A list of structures deemed similar to pdb:4URT using VAST+. The t...
Figure 2.13 Online Mendelian Inheritance in Man (OMIM) entries related to the
Figure 2.14 The Online Mendelian Inheritance in Man (OMIM) entry for the
DCC
g...
Figure 2.15 An example of a list of allelic variants that can be found through...
Figure 2.16 The ClinicalTrials.gov page showing all actively recruiting clinic...
Figure 2.17 A clickable map showing where actively recruiting clinical trials ...
Figure 2.18 The Mouse Genome Informatics (MGI) entry for the
Dcc
gene in mouse...
Figure 2.19 The Zebrafish Information Network (ZFIN) gene page for the
dcc
gen...
Figure 2.20 An example of gene expression data available through the Zebrafish...
Chapter 3
Figure 3.1 The BLOSUM62 scoring matrix (Henikoff and Henikoff 1992). BLOSUM62 ...
Figure 3.2 A nucleotide scoring table. The scoring for the four nucleotide bas...
Figure 3.3 The initiation of a BLAST search. The search begins with query word...
Figure 3.4 BLAST search extension. Length of extension represents the number o...
Figure 3.5 The
National Center for Biotechnology Information
(NCBI) BLAST land...
Figure 3.6 The upper portion of the BLASTP query page. The first section in th...
Figure 3.7 The lower portion of the BLASTP query page, showing algorithm param...
Figure 3.8 Graphical display of BLASTP results. The query sequence is represen...
Figure 3.9 The BLASTP “hit list.” For each sequence found, the user is present...
Figure 3.10 Detailed information on a representative BLASTP hit. The header pr...
Figure 3.11 Performing a BLAST 2 Sequences alignment. Clicking the check box a...
Figure 3.12 Typical output from a BLAST 2 Sequences alignment, based on the qu...
Figure 3.13 Constructing a position-specific scoring matrix (PSSM). In the upp...
Figure 3.14 Performing a PSI-BLAST search. See text for details.
Figure 3.15 Selecting algorithm parameters for a PSI-BLAST search. See text fo...
Figure 3.16 Results of the first round of a PSI-BLAST search. For each sequenc...
Figure 3.17 Results of the second round of a PSI-BLAST search. New sequences i...
Figure 3.18 Submitting a BLAT query. A rat clone from the Cancer Genome Anatom...
Figure 3.19 Results of a BLAT query. Based on the query submitted in Figure 3....
Figure 3.20 The FASTA search strategy. (a) Once FASTA determines words of leng...
Figure 3.21 Search summary from a protein–protein FASTA search, using the sequ...
Figure 3.22 Hit list for the protein–protein FASTA search described in Figure ...
Chapter 4
Figure 4.1 The home page of the UCSC Genome Browser, showing a query for the g...
Figure 4.2 The default view of the UCSC Genome Browser, showing the genomic co...
Figure 4.3 The genomic context of the human
HIF1A
gene, after clicking on
zoom
...
Figure 4.4 The
RefSeq Track Settings
page. The track settings pages are used t...
Figure 4.5 The genomic context of the human
HIF1A
gene, after displaying RefSe...
Figure 4.6 The
Get Genomic Sequence
page that provides an interface for users ...
Figure 4.7 The genomic context of the human
HIF1A
gene, after changing the dis...
Figure 4.8 Configuring the track settings for the
Common SNPs(150)
track. Set ...
Figure 4.9 The genomic context of the human
HIF1A
gene, after changing the col...
Figure 4.10 The
GTEx Gene
track, which depicts median gene expression levels i...
Figure 4.11 BLAT search at the UCSC Genome Browser. (a) This page shows the re...
Figure 4.12 Configuring the UCSC Table Browser. The link to the Table Browser ...
Figure 4.13 The home page of the Ensembl Genome Browser, showing a query for t...
Figure 4.14 The
Gene
tab for the human
PAH
gene. This landing page provides li...
Figure 4.15 Computationally predicted orthologs of the human
PAH
gene, from th...
Figure 4.16 The
Location
tab for the human
PAH
gene. The
Location
tab is divid...
Figure 4.17 Zooming in on the bottom section of the
Location
tab from Figure 4...
Figure 4.18 The Ensembl
Variant
tab. (a) To get more details about SNP rs76296...
Figure 4.19 The Ensembl
Regulatory Build
track. (a) Go to
Configure this page
...
Figure 4.20 The Synteny view at Ensembl. (a) An overview of the syntenic block...
Figure 4.21 Ensembl BLAST output, showing an alignment between the human ADAM1...
Figure 4.22 Using BioMart to retrieve the mouse orthologs of the human RefSeqs...
Figure 4.23 JBrowse display of a predicted
Mnemiopsis
gene (
ML05372a
) from the...
Chapter 5
Figure 5.1 A simplified depiction of a prokaryotic gene or open reading frame ...
Figure 5.2 A simplified depiction of a eukaryotic gene illustrating the multi-...
Figure 5.3 A schematic illustration of the upstream regions of a eukaryotic ge...
Figure 5.4 A schematic illustration of the splice site regions around exons an...
Figure 5.5 Sample output from a GENSCAN analysis of the uroporphyrinogen decar...
Figure 5.6 Schematic representation of measures of gene prediction accuracy at...
Figure 5.7 The typical L-shaped structure of a tRNA molecule. This depicts the...
Figure 5.8 A screenshot montage of the PHASTER web server showing the website ...
Figure 5.9 A screenshot of a BASys bacterial genome annotation output for the ...
Chapter 6
Figure 6.1 The three levels of organization of RNA structure. (a) The primary ...
Figure 6.2 The RNA secondary structure of the 3′ untranslated region of the
Dr
...
Figure 6.3 An illustration of the equilibria of RNA structures in solution. (a...
Figure 6.4 Prediction of conformational free energy for a conformation of RNA ...
Figure 6.5 A simple RNA pseudoknot. This figure illustrates two representation...
Figure 6.6 The input form for the version 3.1 Mfold server. (a) The top and (b...
Figure 6.7 The output page for the Mfold server. Please refer to the text for ...
Figure 6.8 Sample output from the Mfold web server, version 3.1. (a) The secon...
Figure 6.9 RNAstructure web server input form. (a) The top and (b) the bottom ...
Figure 6.10 Sample output from the RNAstructure web server showing the predict...
Figure 6.11 Input form for the RNAstructure web server for multiple-sequence p...
Figure 6.12 Sample output from the RNAstructure web server for multiple-sequen...
Chapter 7
Figure 7.1 Dashboard of the PredictProtein web server. PredictProtein (Yachdav...
Figure 7.2 Protein secondary structure. Experimentally determined three-dimens...
Figure 7.3 Accessible surface area (ASA). The ASA describes the surface that i...
Figure 7.4 Protein secondary structure. Prediction of secondary structure, sol...
Figure 7.5 Types of transmembrane proteins. Experimentally determined three-di...
Figure 7.6 Transmembrane helix prediction by TMSEG. TMSEG (Bernhofer et al. 20...
Figure 7.7 Annotations of human tumor suppressor P53 (P53_HUMAN). (a) InterPro...
Figure 7.8 Prediction of subcellular localization. Visual output from LocTree3...
Figure 7.9 From predicting
single amino acid sequence variant
(SAV) effects to...
Chapter 8
Figure 8.1 An example multiple sequence alignment of seven globin protein sequ...
Figure 8.2 An outline of the simple progressive multiple alignment process. Th...
Figure 8.3 Aligner accuracy versus total single-threaded run time using the BA...
Figure 8.4 Total single-threaded execution time (
y
-axis) for different aligner...
Figure 8.5 Ratio of total run time relative to single-threaded execution (
y
-ax...
Figure 8.6 Protein and RNA multiple sequence alignments as visualized using Ja...
Figure 8.7 Linked
coding sequence
(
CDS
), protein, and three-dimensional struct...
Chapter 9
Figure 9.1 Different ways to visualize a tree. In this example, the same tree ...
Figure 9.2 Alignments illustrating sequence similarity versus sequence identit...
Figure 9.3 The differences between orthologs, paralogs, and xenologs. The ance...
Figure 9.4 The difference between phylogenetic signal and phylogenetic noise. ...
Figure 9.5 Character-based versus distance-based phylogenetic methods. Charact...
Figure 9.6 Rooting a tree with an outgroup.
Escherichia coli
bacteria are comm...
Figure 9.7 Workflow for a protein-based phylogenetic analysis using the PHYLIP...
Figure 9.8 Phylogenetic relationships can be visualized using different types ...
Figure 9.9 Excerpt of a
Salmonella
minimum spanning tree. Types of
Salmonella
...
Chapter 10
Figure 10.1 Example of an
MA
plot before (a) and after (b) normalization.
A
, o...
Figure 10.2 Histogram of the base mismatch (MM) rate across multiple RNA-seq s...
Figure 10.3 Overview of quantile normalization. We start with the box on the t...
Figure 10.4 Batch effects principal components analysis (PCA) example. Boxplot...
Figure 10.5 A simple illustration of the process of hierarchical clustering. (...
Figure 10.6 Heatmap showing clustering of gene expression data of the 100 most...
Figure 10.7 First two components of principal component analysis (PCA) on the ...
Figure 10.8 Principal component analysis (PCA) is a dimensionality reduction m...
Figure 10.9 Illustration of how one can select
k
when performing consensus clu...
Figure 10.10
Receiver operating characteristic
(
ROC
) curve for a model desig...
Chapter 11
Figure 11.1 Gene(s) to proteoforms. This figure illustrates the complexity of ...
Figure 11.2 Quadrupole mass analyzer. Schematic of a quadrupole mass analyzer,...
Figure 11.3 Time of flight (TOF) mass analyzer. Schematic of a TOF mass analyz...
Figure 11.4 (a) Tandem mass spectrometry (MS). Schematic of a triple quadrupol...
Figure 11.5 Fragmentation tandem mass spectrometry (MS/MS, or MS
2
) spectrum. A...
Figure 11.6 Polypeptide backbone cleavage produces different product ion speci...
Figure 11.7 Post-translational modifications (PTMs) take place at different am...
Figure 11.8 Data pre-processing workflow of a mass spectrum. Different steps i...
Figure 11.9 Shotgun proteomics workflow. Schematic showing different steps inv...
Figure 11.10 A schematic diagram comparing the label-free approach with the di...
Figure 11.11 Peptide mass fingerprinting (PMF) workflow. Schematic showing dif...
Figure 11.12 Mascot peptide mass fingerprinting (PMF). PMF submission screen a...
Figure 11.13 Peptide sequencing via tandem mass spectrometry (MS/MS) spectra i...
Figure 11.14 Peptide sequence tag searching. Schematic illustrating how a sequ...
Figure 11.15 Peptide spectrum match (PSM). Annotated MS
2
spectrum showing matc...
Figure 11.16 Mascot search engine. Mascot MS
2
database search submission windo...
Figure 11.17 Proteomics. A broad classification of proteomics and the biologic...
Chapter 12
Figure 12.1 A flow diagram illustrating the steps used to experimentally prepa...
Figure 12.2 An example of a nuclear magnetic resonance (NMR) “blurrogram” of a...
Figure 12.3 The different levels of protein structures illustrating: (a) prima...
Figure 12.4 Examples of different types of protein folds including (a) the fou...
Figure 12.5 An illustration of standard amino acid residue and peptide bond ge...
Figure 12.6 An example of a Protein Data Bank formatted file showing the first...
Figure 12.7 A Ramachandran plot for the thioredoxin protein (Protein Data Bank...
Figure 12.8 A screenshot of the Research Collaboratory for Structural Bioinfor...
Figure 12.9 A screenshot of an image of
Escherichia coli
thioredoxin as genera...
Figure 12.10 An illustration of the four major approaches to rendering protein...
Figure 12.11 An example of the high-quality images that can be created using a...
Figure 12.12 An illustration of a homology model (b) of
Escherichia coli
thior...
Figure 12.13 A schematic illustration of how threading is performed. (a) A que...
Figure 12.14 An example of the high-quality postscript output data from PROCHE...
Figure 12.15 An example of the CATH database description of
Escherichia coli
t...
Chapter 13
Figure 13.1 The Reactome database pathway view. The central view shows pathway...
Figure 13.2 The EcoCyc database cellular overview of
Escherichia coli
metaboli...
Figure 13.3 An example of metabolic pathway reconstruction from Kyoto Encyclop...
Figure 13.4 A BioGRID database record. A screenshot of the result page for a B...
Figure 13.5 An IntAct database search for the human
MDM2
gene. A summary of al...
Figure 13.6 An example of the main STRING query result page. A network of rela...
Figure 13.7 A query result from GeneMANIA. Each node in the network represents...
Figure 13.8 The AKT pathway as represented by a traditional method (top left, ...
Figure 13.9 The main components of the
Proteomics Standards Initiative–Molecul
...
Figure 13.10 The valine biosynthesis pathway dynamically drawn by the Pathway ...
Figure 13.11 Output from the PathVisio software showing a portion of a human c...
Figure 13.12 The set of symbol types available in the Systems Biology Graphica...
Figure 13.13 The
Drosophila melanogaster
cell cycle drawn using Systems Biolog...
Figure 13.14 The results of pathway enrichment analysis using the g:Profiler t...
Figure 13.15 A Gene Set Enrichment Analysis (GSEA) enrichment figure. The bott...
Figure 13.16 An enrichment map showing two enriched themes. Each node represen...
Figure 13.17 An introduction to terminology and visual notation used in the co...
Figure 13.18 Zooming in on a network in Cytoscape shows part of a large connec...
Figure 13.19 An overview of a pathway analysis workflow, summarizing multiple ...
Chapter 14
Figure 14.1 A diagram illustrating the typical workflow for a metabolomic expe...
Figure 14.2 An example of a Molecular Design Limited (MDL) chemical fingerprin...
Figure 14.3 An example of a MOL file for a two-dimensional representation of L...
Figure 14.4 An example of an nmrML data file for L-alanine. The actual file is...
Figure 14.5 The JSpectraViewer image for L-alanine. JSpectraViewer is a Java a...
Figure 14.6 A selection of two screenshots from the PubChem web pages for the ...
Figure 14.7 Two screenshots of the gas chromatography–mass spectrometry (GC-MS...
Figure 14.8 Two screenshots from the Human Metabolome Database (HMDB) entry fo...
Figure 14.9 A simplified illustration of how spectral deconvolution works for ...
Figure 14.10 Two screenshots of the Bayesil web server. (a) A nuclear magnetic...
Figure 14.11 An illustration of how spectral deconvolution works for gas chrom...
Figure 14.12 An illustration of how principal component analysis can be though...
Figure 14.13 A three-dimensional principal component analysis (PCA) “scores” p...
Figure 14.14 The MetaboAnalyst Module Overview page. This page allows users to...
Figure 14.15 The MetaboAnalyst Data Upload page. This page allows users to upl...
Figure 14.16 The MetaboAnalyst Data Normalization page. The optimal normalizat...
Figure 14.17 The MetaboAnalyst Data Normalization and Scaling results, generat...
Figure 14.18 A two-dimensional principal component analysis (PCA) “scores” plo...
Figure 14.19 The principal component analysis (PCA) “loadings” plot, showing t...
Figure 14.20 The partial least squares discriminant analysis (PLS-DA) plot sho...
Figure 14.21 An example of an
R
2
/
Q
2
plot generated by MetaboAnalyst using the ...
Figure 14.22 A variable importance in projection plot showing which metabolite...
Figure 14.23 A pathway impact plot showing the importance of different pathway...
Chapter 15
Figure 15.1 Principal components analysis (PCA) of nine world populations and ...
Figure 15.2 The coalescent process. Although the ancestral population contains...
Figure 15.3 Multiple sequentially Markovian coalescent (MSMC) estimate of popu...
Figure 15.4 Admixture analysis of nine populations and three test samples. Ind...
Figure 15.5 A Manhattan plot of Composite of Multiple Signals (CMS) scores (
Y
-...
Chapter 16
Figure 16.1 General workflow for DNA-based microbiome analysis.
Figure 16.2 FastQC summary of DNA sequence read quality for an Illumina sequen...
Figure 16.3 Primary structure and variable regions of the
16S
ribosomal RNA ge...
Figure 16.4
k
-mer decomposition of a nucleotide sequence with
k
= 2. Two seque...
Figure 16.5 Rarefaction curves for microbial communities sampled from six diff...
Figure 16.6 Unweighted phylogenetic alpha- and beta-diversity measures. Left: ...
Figure 16.7 Principal coordinate analysis (a) vs. non-metric multidimensional ...
Figure 16.8 Visualizing the differences between two groups of gut microbiome s...
Chapter 17
Figure 17.1 ClinVar entry for a benign variant in the cystic fibrosis gene (
CF
...
Figure 17.2
Receiver operating characteristic
(
ROC
s) curves of five submissi...
Chapter 18
Figure 18.1 Relationships between observation, data, information, and knowledg...
Figure 18.2 Types of variables and their hierarchical relationships.
Figure 18.3 Organization of an example dataset. (a) Part of a two-dimensional ...
Figure 18.4 Commonly used descriptive statistics for sample variables. Light b...
Figure 18.5 Covariance versus correlation. The red sample has higher sample va...
Figure 18.6 Example histogram demonstrating the frequency of black cherry tree...
Figure 18.7 Example boxplot and related variant graphs. (a) Schematic diagram ...
Figure 18.8 Anscombe's quartet. Scatterplots with regression lines for four fa...
Figure 18.9 Scatterplot of the first two principal components (PCs) from princ...
Figure 18.10 Example of how to make a graph descriptive.
Figure 18.11 The standard normal distribution.
Figure 18.12 Other well-described discrete and continuous distributions common...
Figure 18.13 Bond length and coordination angle histograms for coordinated met...
Figure 18.14 Overview of the process of statistical inference.
FUV
stands for ...
Figure 18.15 Truth table with descriptions of type I and II errors.
Figure 18.16 Diagram illustrating the relationships between a probability dens...
Figure 18.17 Using a Student's
t
-test to test a null hypothesis.
Figure 18.18 Relationship between population and sample mean distributions..
Figure 18.19 An approximate power analysis diagram for a Student's
t
-test.
Appendices
Figure 6.A.1 Pseudo-computer code for the fill order of
V
(
i
,
j
) and
W
(
i
,
j
). Thi...
Figure 6.A.2 The filled
V
(
i
,
j
) array for sequence GCGGGUACCGAUCGUCGC.
Figure 6.A.3 The filled
W
(
i
,
j
) array for sequence GCGGGUACCGAUCGUCGC.
Figure 6.A.4 Illustrations of maximum hydrogen bond conformations as found by ...
Figure 6.A.5 Flowchart for structure traceback. Traceback starts by placing 1,...
Figure 6.A.6 The secondary structure of rGCGGGUACCGAUCGUCGC with 17 hydrogen b...
Cover
Table of Contents
Begin Reading
iii
iv
vii
ix
x
xi
xii
xiii
xiv
xv
xvi
xvii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
Edited by
Andreas D. Baxevanis, Gary D. Bader, and David S. Wishart
Fourth Edition
This fourth edition first published 2020
© 2020 John Wiley & Sons, Inc.
Edition History
Wiley-Blackwell (1e, 2000), Wiley-Blackwell (2e, 2001), Wiley-Blackwell (3e, 2005)
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Andreas D. Baxevanis, Gary D. Bader, and David S. Wishart to be identified as the authors of the editorial material in this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data
Names: Baxevanis, Andreas D., editor. | Bader, Gary D., editor. | Wishart, David S., editor.
Title: Bioinformatics / edited by Andreas D. Baxevanis, Gary D. Bader, David S. Wishart.
Other titles: Bioinformatics (Baxevanis)
Description: Fourth edition. | Hoboken, NJ : Wiley, 2020. | Includes bibliographical references and index.
Identifiers: LCCN 2019030489 (print) | ISBN 9781119335580 (cloth) | ISBN 9781119335962 (adobe pdf) | ISBN 9781119335955 (epub)
Subjects: MESH: Computational Biology--methods | Sequence Analysis--methods | Base Sequence | Databases, Nucleic Acid | Databases, Protein
Classification: LCC QH324.2 (print) | LCC QH324.2 (ebook) | NLM QU 550.5.S4 | DDC 570.285–dc23
LC record available at https://lccn.loc.gov/2019030489
LC ebook record available at https://lccn.loc.gov/2019030490
Cover Design: Wiley
Cover Images: © David Wishart, background © Suebsiri/Getty Images
As I review the material presented in the fourth edition of Bioinformatics I am moved in two ways, related to both the past and the future.
Looking to the past, I am moved by the amazing evolution that has occurred in our field since the first edition of this book appeared in 1998. Twenty-one years is a long, long time in any scientific field, but especially so in the agile field of bioinformatics. To use the well-trodden metaphor of the “biology moonshot,” the launchpad at the beginning of the twenty-first century was the determination of the human genome. Discovery is not the right word for what transpired – we knew it was there and what was needed. Synergy is perhaps a better word; synergy of technological development, experiment, computation, and policy. A truly collaborative effort to continuously share, in a reusable way, the collective efforts of many scientists. Bioinformatics was born from this synergy and has continued to grow and flourish based on these principles.
That growth is reflected in both the scope and depth of what is covered in these pages. These attributes are a reflection of the increased complexity of the biological systems that we study (moving from “simple” model organisms to the human condition) and the scales at which those studies take place. As a community we have professed multiscale modeling without much to show for it, but it would seem to be finally here. We now have the ability to connect the dots from molecular interactions, through the pathways to which those molecules belong to the cells they affect, to the interactions between those cells through to the effects they have on individuals within a population. Tools and methodologies that were novel in earlier editions of this book are now routine or obsolete, and newer, faster, and more accurate procedures are now with us. This will continue, and as such this book provides a valuable snapshot of the scope and depth of the field as it exists today.
Looking to the future, this book provides a foundation for what is to come. For me this is a field more aptly referred to (and perhaps a new subtitle for the next edition) as Biomedical Data Science. Sitting as I do now, as Dean of a School of Data Science which collaborates openly across all disciplines, I see rapid change akin to what happened to birth bioinformatics 20 or more years ago. It will not take 20 years for other disciplines to catch up; I predict it will take 2! The accomplishments outlined in this book can help define what other disciplines will accomplish with their own data in the years to come. Statistical methods, cloud computing, data analytics, notably deep learning, the management of large data, visualization, ethics policy, and the law surrounding data are generic. Bioinformatics has so much to offer, yet it will also be influenced by other fields in a way that has not happened before. Forty-five years in academia tells me that there is nothing to compare across campuses to what is happening today. This is both an opportunity and a threat. The editors and authors of this edition should be complimented for setting the stage for what is to come.
Philip E. Bourne, University of Virginia
In putting together this textbook, we hope that students from a range of fields – including biology, computer science, engineering, physics, mathematics, and statistics – benefit by having a convenient starting point for learning most of the core concepts and many useful practical skills in the field of bioinformatics, also known as computational biology.
Students interested in bioinformatics often ask about how should they acquire training in such an interdisciplinary field as this one. In an ideal world, students would become experts in all the fields mentioned above, but this is actually not necessary and realistically too much to ask. All that is required is to combine their scientific interests with a foundation in biology and any single quantitative field of their choosing. While the most common combination is to mix biology with computer science, incredible discoveries have been made through finding creative intersections with any number of quantitative fields. Indeed, many of these quantitative fields typically overlap a great deal, especially given their foundational use of mathematics and computer programming. These natural relationships between fields provide the foundation for integrating diverse expertise and insights, especially when in the context of performing bioinformatic analyses.
While bioinformatics is often considered an independent subfield of biology, it is likely that the next generation of biologists will not consider bioinformatics as being separate and will instead consider gaining bioinformatics and data science skills as naturally as they learn how to use a pipette. They will learn how to program a computer, likely starting in elementary school. Other data science knowledge areas, such as math, statistics, machine learning, data processing, and data visualization will also be part of any core curriculum. Indeed, the children of one of the editors recently learned how to construct bar plots and other data charts in kindergarten! The same editor is teaching programming in R (an important data science programming language) to all incoming biology graduate students at his university starting this year.
As bioinformatics and data science become more naturally integrated in biology, it is worth noting that these fields actively espouse a culture of open science. This culture is motivated by thinking about why we do science in the first place. We may be curious or like problem solving. We could also be motivated by the benefits to humanity that scientific advances bring, such as tangible health and economic benefits. Whatever the motivating factor, it is clear that the most efficient way to solve hard problems is to work together as a team, in a complementary fashion and without duplication of effort. The only way to make sure this works effectively is to efficiently share knowledge and coordinate work across disciplines and research groups. Presenting scientific results in a reproducible way, such as freely sharing the code and data underlying the results, is also critical. Fortunately, there are an increasing number of resources that can help facilitate these goals, including the bioRxiv preprint server, where papers can be shared before the very long process of peer review is completed; GitHub, for sharing computer code; and data science notebook technology that helps combine code, figures, and text in a way that makes it easier to share reproducible and reusable results.
We hope this textbook helps catalyze this transition of biology to a quantitative, data science-intensive field. As biological research advances become ever more built on interdisciplinary, open, and team science, progress will dramatically speed up, laying the groundwork for fantastic new discoveries in the future.
We also deeply thank all of the chapter authors for contributing their knowledge and time to help the many future readers of this book learn how to apply the myriad bioinformatic techniques covered within these pages to their own research questions.
Andreas D. Baxevanis
Gary D. Bader
David S. Wishart
Gary D. Bader, PhD is a Professor at The Donnelly Centre at the University of Toronto, Toronto, Canada, and a leader in the field of Network Biology. Gary completed his postdoctoral work in Chris Sander's group in the Computational Biology Center (cBio) at Memorial Sloan-Kettering Cancer Center in New York. Gary completed his PhD in the laboratory of Christopher Hogue in the Department of Biochemistry at the University of Toronto and a BSc in Biochemistry at McGill University in Montreal. Dr. Bader uses molecular interaction, pathway, and -omics data to gain a “causal” mechanistic understanding of normal and disease phenotypes. His laboratory develops novel computational approaches that combine molecular interaction and pathway information with -omics data to develop clinically predictive models and identify therapeutically targetable pathways. He also helps lead the Cytoscape, GeneMANIA, and Pathway Commons pathway and network analysis projects.
Geoffrey J. Barton, PhD is Professor of Bioinformatics and Head of the Division of Computational Biology at the University of Dundee School of Life Sciences, Dundee, UK. Before moving to Dundee in 2001, he was Head of the Protein Data Bank in Europe and the leader of the Research and Development Team at the EMBL European Bioinformatics Institute (EBI). Prior to joining EMBL-EBI, he was Head of Genome Informatics at the Wellcome Trust Centre for Human Genetics, University of Oxford, a position he held concurrently with a Royal Society University Research Fellowship in the Department of Biochemistry. Geoff's longest running research interest is using computational methods to study the relationship between a protein's sequence, its structure, and its function. His group has contributed many tools and techniques in the field of protein sequence and structure analysis and structure prediction. Two of the best known are the Jalview multiple alignment visualization and analysis workbench, which is in use by over 70 000 groups for research and teaching, and the JPred multi-neural net protein secondary structure prediction algorithm, which performs predictions on up to 500 000 proteins/month for users worldwide. In addition to his work related to protein sequence and structure, Geoff has collaborated on many projects that probe biological processes using proteomic and high-throughput sequencing approaches. Geoff's group has deep expertise in RNA-seq methods and has recently published a two-condition 48-replicate RNA-seq study that is now a key reference work for users of this technology.
Andreas D. Baxevanis, PhD is the Director of Computational Biology for the National Institutes of Health's (NIH) Intramural Research Program. He is also a Senior Scientist leading the Computational Genomics Unit at the NIH's National Human Genome Research Institute, Bethesda, MD, USA. His research program is centered on probing the interface between genomics and developmental biology, focusing on the sequencing and analysis of invertebrate genomes that can yield insights of relevance to human health, particularly in the areas of regeneration, allorecognition, and stem cell biology. His accomplishments have been recognized by the Bodossaki Foundation's Academic Prize in Medicine and Biology in 2000, Greece's highest award for young scientists of Greek heritage. In 2014, he was elected to the Johns Hopkins Society of Scholars, recognizing alumni who have achieved marked distinction in their field of study. He was the recipient of the NIH's Ruth L. Kirschstein Mentoring Award in 2015, in recognition of his commitment to scientific training, education, and mentoring. In 2016, Dr. Baxevanis was elected as a Senior Member of the International Society for Computational Biology for his sustained contributions to the field and, in 2018, he was elected as a Fellow of the American Association for the Advancement of Science for his distinguished contributions to the field of comparative genomics.
Robert G. Beiko, PhD is a Professor and Associate Dean for Research in the Faculty of Computer Science at Dalhousie University, Halifax, Nova Scotia, Canada. He is a former Tier II Canada Research Chair in Bioinformatics (2007–2017), an Associate Editor at mSystems and BMC Bioinformatics, and a founding organizer of the Canadian Bioinformatics Workshops in Metagenomics and Genomic Epidemiology. He is also the lead editor of the recently published book Microbiome Analysis in the Methods in Molecular Biology series. His research focuses on microbial genomics, evolution, and ecology, with concentrations in the area of lateral gene transfer and microbial community analysis.
Fiona S.L. Brinkman, PhD, FRSC is a Professor in Bioinformatics and Genomics in the Department of Molecular Biology and Biochemistry at Simon Fraser University, Vancouver, British Columbia, Canada, with cross-appointments in Computing Science and the Faculty of Health Sciences. She is most known for her research and development of widely used computer software that aids both microbe (PSORTb, IslandViewer) and human genomic (InnateDB) evolutionary/genomics analyses, along with her insights into pathogen evolution. She is currently co-leading a national effort – the Integrated Rapid Infectious Disease Analysis Project – the goal of which is to use microbial genomes as a fingerprint to better track and understand the spread and evolution of infectious diseases. She has also been leading development into an approach to integrate very diverse data for the Canadian CHILD Study birth cohort, including microbiome, genomic, epigenetic, environmental, and social data. She coordinates community-based genome annotation and database development for resources such as the Pseudomonas Genome Database. She also has a strong interest in bioinformatics education, including developing the first undergraduate curricula used as the basis for the first White Paper on Canadian Bioinformatics Training in 2002. She is on several committees and advisory boards, including the Board of Directors for Genome Canada; she chairs the Scientific Advisory Board for the European Nucleotide Archive (EMBL-EBI). She has received a number of awards, including a TR100 award from MIT, and, most recently, was named as a Fellow of the Royal Society of Canada.
Andrew Emili, PhD is a Professor in the Departments of Biochemistry (Medical School) and Biology (Arts and Sciences) at Boston University (BU), Boston, MA, USA, and the inaugural Director of the BU Center for Network Systems Biology (CNSB). Prior to Boston, Dr. Emili was a founding member and Principal Investigator for 18 years at the Donnelly Center for Cellular and Biomolecular Research at the University of Toronto, one of the premier research centers in integrative molecular biology. Dr. Emili is an internationally recognized leader in functional proteomics, systems biology, and precision mass spectrometry. His group develops and applies innovative technologies to systematically map protein interaction networks and macromolecular complexes of cells and tissues on a global scale, publishing “interactome” maps of unprecedented quality, scope, and resolution.
Tatyana Goldberg, PhD is a postdoctoral scientist at the Technical University of Munich, Germany. She obtained her PhD in Bioinformatics under the supervision of Dr. Burkhard Rost. Her research focuses on developing models that can predict the localization of proteins within cells. The results of her study contribute to a variety of applications, including the development of pharmaceuticals for the treatment of Alzheimer disease and cancer.
Emma J. Griffiths, PhD is a research associate in the Department of Pathology and Laboratory Medicine at the University of British Columbia in Vancouver, Canada, working with Dr. William Hsiao. Dr. Griffiths received her PhD from the Department of Biochemistry and Biomedical Sciences at McMaster University in Hamilton, Canada, with her doctoral work focusing on the evolutionary relationships between different groups of bacteria. She has since pursued postdoctoral training in the fields of chemical and fungal genetics and microbial genomics with Dr. Fiona Brinkman in the Department of Biochemistry and Molecular Biology at Simon Fraser University in Vancouver, Canada. Her current work focuses on the development of ontology-driven applications designed to improve pathogen genomics contextual data (“metadata”) exchange during public health investigations.
Desmond G. Higgins, PhD is Professor of Bioinformatics in University College Dublin, Ireland, where his laboratory works on genomic data analysis and sequence alignment algorithms. He earned his doctoral degree in zoology from Trinity College Dublin, Ireland, and has worked in the field of bioinformatics since 1985. His group maintains and develops the Clustal package for multiple sequence alignment in collaboration with groups in France, Germany, and the United Kingdom. Dr. Higgins wrote the first version of Clustal in Dublin in 1988. He then moved to the EMBL Data Library group located in Heidelberg in 1990 and later to EMBL-EBI in Hinxton. This coincided with the release of ClustalW and, later, ClustalX, which has been extremely widely used and cited. Currently, he has run out of version letters so is working on Clustal Omega, specifically designed for making extremely large protein alignments.
Lynn B. Jorde, PhD has been on the faculty of the University of Utah School of Medicine, Salt Lake City, UT, USA, since 1979 and holds the Mark and Kathie Miller Presidential Endowed Chair in Human Genetics. He was appointed Chair of the Department of Human Genetics in September 2009. Dr. Jorde's laboratory has published scientific articles on human genetic variation, high-altitude adaptation, the genetic basis of human limb malformations, and the genetics of common diseases such as hypertension, juvenile idiopathic arthritis, and inflammatory bowel disease. Dr. Jorde is the lead author of Medical Genetics, a textbook that is now in its fifth edition and translated into multiple foreign languages. He is the co-recipient of the 2008 Award for Excellence in Education from the American Society of Human Genetics (ASHG). He served two 3-year terms on the Board of Directors of ASHG and, in 2011, he was elected as president of ASHG. In 2012, he was elected as a Fellow of the American Association for the Advancement of Science.
Marieke L. Kuijjer, PhD is a Group Leader at the Centre for Molecular Medicine Norway (NCMM, a Nordic EMBL partner), University of Oslo, Norway, where she runs the Computational Biology and Systems Medicine group. She obtained her doctorate in the laboratory of Dr. Pancras Hogendoorn in the Department of Pathology at the Leiden University Medical Center in the Netherlands. After this, she continued her scientific training as a postdoctoral researcher in the laboratory of Dr. John Quackenbush at the Dana-Farber Cancer Institute and Harvard T.H. Chan School of Public Health, during which she won a career development award and a postdoctoral fellowship. Dr. Kuijjer's research focuses on solving fundamental biological questions through the development of new methods in computational and systems biology and on implementing these techniques to better understand gene regulation in cancer. Dr. Kuijjer serves on the editorial board of Cancer Research.
David H. Mathews, MD, PhD is a professor of Biochemistry and Biophysics and also of Biostatistics and Computational Biology at the University of Rochester Medical Center, Rochester, NY, USA. He also serves as the Associate Director of the University of Rochester's Center for RNA Biology. His involvement in education includes directing the Biophysics PhD program and teaching a course in Python programming and algorithms for doctoral students without a programming background. His group studies RNA biology and develops methods for RNA secondary structure prediction and molecular modeling of three-dimensional structure. His group developed and maintains RNAstructure, a widely used software package for RNA structure prediction and analysis.
Sean D. Mooney, PhD has spent his career as a researcher and group leader in biomedical informatics. He now leads Research IT for UW Medicine and is leading efforts to support and build clinical research informatic platforms as its first Chief Research Information Officer (CRIO) and as a Professor in the Department of Biomedical Informatics and Medical Education at the University of Washington, Seattle, WA, USA. Previous to being appointed as CRIO, he was an Associate Professor and Director of Bioinformatics at the Buck Institute for Research on Aging. As an Assistant Professor, he was appointed in Medical and Molecular Genetics at Indiana University School of Medicine and was the founding Director of the Indiana University School of Medicine Bioinformatics Core. In 1997, he received his BS with Distinction in Biochemistry and Molecular Biology from the University of Wisconsin at Madison. He received his PhD from the University of California in San Francisco in 2001, then pursued his postdoctoral studies under an American Cancer Society John Peter Hoffman Fellowship at Stanford University.
Stephen J. Mooney, PhD is an Acting Assistant Professor in the Department of Epidemiology at the University of Washington, Seattle, WA, USA. He developed the CANVAS system for collecting data from Google Street View imagery as a graduate student, and his research focuses on contextual influences on physical activity and transport-related injury. He's a methods geek at heart.
Hunter N.B. Moseley, PhD is an Associate Professor in the Department of Molecular and Cellular Biochemistry at the University of Kentucky, Lexington, KY, USA. He is also the Informatics Core Director within the Resource Center for Stable Isotope Resolved Metabolomics, Associate Director for the Institute for Biomedical Informatics, and a member of the Markey Cancer Center. His research interests include developing computational methods, tools, and models for analyzing and interpreting many types of biological and biophysical data that enable new understanding of biological systems and related disease processes. His formal education spans multiple disciplines including chemistry, mathematics, computer science, and biochemistry, with expertise in algorithm development, mathematical modeling, structural bioinformatics, and systems biochemistry, particularly in the development of automated analyses of nuclear magnetic resonance and mass spectrometry data as well as knowledge–data integration.
Yanay Ofran, PhD is a Professor and head of the Laboratory of Functional Genomics and Systems Biology at Bar Ilan University in Tel Aviv, Israel. His research focuses on biomolecular recognition and its role in health and disease. Professor Ofran is also the founder of Biolojic Design, a biopharmaceutical company that uses artificial intelligence approaches to design epitope-specific antibodies. He is also the co-founder of Ukko, a biotechnology company that uses computational tools to design safe proteins for the food and agriculture sectors.
Joseph N. Paulson, PhD is a Statistical Scientist within Genentech's Department of Biostatistics, San Francisco, CA, USA, working on designing clinical trials and biomarker discovery. Previously, he was a Research Fellow in the Department of Biostatistics and Computational Biology at the Dana-Farber Cancer Institute and Department of Biostatistics at the Harvard T.H. Chan School of Public Health. He graduated with a PhD in Applied Mathematics, Statistics, and Scientific Computation from the University of Maryland, College Park where he was a National Science Foundation Graduate Fellow. As a statistician and computational biologist, his interests include clinical trial design, biomarker discovery, development of computational methods for the analysis of high-throughput sequencing data while accounting for technical artifacts, and the microbiome.
Sadhna Phanse, MSc is a Bioinformatics Analyst at the Donnelly Centre for Cellular and Biomolecular Research at the University of Toronto, Toronto, Canada. She has been active in the field of proteomics since 2006 as a member of the Emili research group. Her current work involves the use of bioinformatics methods to investigate biological systems and molecular association networks in human cells and model organisms.
John Quackenbush, PhD is Professor of Computational Biology and Bioinformatics and Chair of the Department of Biostatistics at the Harvard T.H. Chan School of Public Health, Boston, MA, USA. He also holds appointments in the Channing Division of Network Medicine of Brigham and Women's Hospital and at the Dana-Farber Cancer Institute. He is a recognized expert in computational and systems biology and its applications to the study of a wide range of human diseases and the factors that drive those diseases and their responses to therapy. Dr. Quackenbush has long been an advocate for open science and reproducible research. As a founding member and past president of the Functional Genomics Data Society (FGED), he was a developer of the
