70,99 €
There is an increasing need throughout the biomedical sciences for a greater understanding of knowledge-based systems and their application to genomic and proteomic research. This book discusses knowledge-based and statistical approaches, along with applications in bioinformatics and systems biology. The text emphasizes the integration of different methods for analysing and interpreting biomedical data. This, in turn, can lead to breakthrough biomolecular discoveries, with applications in personalized medicine.
Key Features:
Students, researchers, and industry professionals with a background in biomedical sciences, mathematics, statistics, or computer science will benefit from this book. It will also be useful for readers worldwide who want to master the application of bioinformatics to real-world situations and understand biological problems that motivate algorithms.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 680
Veröffentlichungsjahr: 2011
Table of Contents
Title Page
Copyright
Preface
List of Contributors
PART I FUNDAMENTALS
Section 1 Knowledge-Driven Approaches
Chapter 1: Knowledge-Based Bioinformatics
1.1 Introduction
1.2 Formal Reasoning for Bioinformatics
1.3 Knowledge Representations
1.4 Collecting Explicit Knowledge
1.5 Representing Common Knowledge
1.6 Capturing Novel Knowledge
1.7 Knowledge Discovery Applications
1.8 Semantic Harmonization: the Power and Limitation of Ontologies
1.9 Text Mining and Extraction
1.10 Gene Expression
1.11 Pathways and Mechanistic Knowledge
1.12 Genotypes and Phenotypes
1.13 The Web's Role in Knowledge Mining
1.14 New Frontiers
1.15 References
Chapter 2: Knowledge-Driven Approaches to Genome-Scale Analysis
2.1 Fundamentals
2.2 Challenges in Knowledge-Driven Approaches
2.3 Current Knowledge-Based Bioinformatics Tools
2.4 3R Systems: Reading, Reasoning and Reporting the Way Towards Biomedical Discovery
2.5 The Hanalyzer: a Proof of 3R Concept
2.6 Acknowledgements
2.7 References
Chapter 3: Technologies and Best Practices for Building Bio-Ontologies
3.1 Introduction
3.2 Knowledge Representation Languages and Tools for Building Bio-Ontologies
3.3 Best Practices for Building Bio-Ontologies
3.4 Conclusion
3.5 Acknowledgements
3.6 References
Chapter 4: Design, Implementation and Updating of Knowledge Bases
4.1 Introduction
4.2 Sources of Data in Bioinformatics Knowledge Bases
4.3 Design of Knowledge Bases
4.4 Implementation of Knowledge Bases
4.5 Updating of Knowledge Bases
4.6 Conclusions
4.7 References
Section 2 Data-Analysis Approaches
Chapter 5: Classical Statistical Learning in Bioinformatics
5.1 Introduction
5.2 Significance Testing
5.3 Exploratory Analysis
5.4 Classification and Prediction
5.5 References
Chapter 6: Bayesian Methods in Genomics and Proteomics Studies
6.1 Introduction
6.2 Bayes Theorem and Some Simple Applications
6.3 Inference of Population Structure from Genetic Marker Data
6.4 Inference of Protein Binding Motifs from Sequence Data
6.5 Inference of Transcriptional Regulatory Networks from Joint Analysis of Protein–DNA Binding Data and Gene Expression Data
6.6 Inference of Protein and Domain Interactions from Yeast Two-Hybrid Data
6.7 Conclusions
6.8 Acknowledgements
6.9 References
Chapter 7: Automatic Text Analysis for Bioinformatics Knowledge Discovery
7.1 Introduction
7.2 Information Needs for Biomedical Text Mining
7.3 Principles of Text Mining
7.4 Development Issues
7.5 Success Stories
7.6 Conclusion
7.7 References
PART II APPLICATIONS
Section 3 Gene and Protein Information
Chapter 8: Fundamentals of Gene Ontology Functional Annotation
8.1 Introduction
8.2 Gene Ontology (GO)
8.3 Comparative Genomics and Electronic Protein Annotation
8.4 Community Annotation
8.5 Limitations
8.6 Accessing GO Annotations
8.7 Conclusions
8.8 References
Chapter 9: Methods for Improving Genome Annotation
9.1 The Basis of Gene Annotation
9.2 The Impact of Next Generation Sequencing on Genome Annotation
9.3 References
Chapter 10: Sequences from Prokaryotic, Eukaryotic, and Viral Genomes Available Clustered According to Phylotype on a Self-Organizing Map
10.1 Introduction
10.2 Batch-Learning SOM (BLSOM) Adapted for Genome Informatics
10.3 Genome Sequence Analyses Using BLSOM
10.4 Conclusions and Discussion
10.5 References
Section 4 Biomolecular Relationships and Meta-Relationships
Chapter 11: Molecular Network Analysis and Applications
11.1 Introduction
11.2 Topology Analysis and Applications
11.3 Network Motif Analysis
11.4 Network Modular Analysis and Applications
11.5 Network Comparison
11.6 Network Analysis Software and Tools
11.7 Summary
11.8 Acknowledgement
11.9 References
Chapter 12: Biological Pathway Analysis: an Overview of Reactome and Other Integrative Pathway Knowledge Bases
12.1 Biological Pathway Analysis and Pathway Knowledge Bases
12.2 Overview of High-Throughput Data Capture Technologies and Data Repositories
12.3 Brief Review of Selected Pathway Knowledge Bases
12.4 How does Information Get into Pathway Knowledge Bases?
12.5 Introduction to Data Exchange Languages
12.6 Visualization Tools
12.7 Use Case: Pathway Analysis in Reactome Using Statistical Analysis of High-Throughput Data Sets
12.8 Discussion: Challenges and Future Directions of Pathway Knowledge Bases
12.9 References
Chapter 13: Methods and Challenges of Identifying Biomolecular Relationships and Networks Associated with Complex Diseases/Phenotypes, and their Application to Drug Treatments
13.1 Complex Traits: Clinical Phenomenology and Molecular Background
13.2 Why It is Challenging to Infer Relationships between Genes and Phenotypes in Complex Traits?
13.3 Bottom-Up or Top-Down: Which Approach is More Useful in Delineating Complex Traits Key Drivers?
13.4 High-Throughput Technologies and their Applications in Complex Traits Genetics
13.5 Integrative Systems Biology: A Comprehensive Approach to Mining High-Throughput Data
13.6 Methods Applying Systems Biology Approach in the Identification of Functional Relationships from Gene Expression Data
13.7 Advantages of Networks Exploration in Molecular Biology and Drug Discovery
13.8 Practical Examples of Applying Systems Biology Approaches and Network Exploration in the Identification of Functional Modules and Disease-Causing Genes in Complex Phenotypes/Diseases
13.9 Challenges and Future Directions
13.10 References
Trends and Conclusion
Index
This edition first published 2010
© 2010 John Wiley & Sons Ltd
Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
Knowledge based bioinformatics : from analysis to interpretation / edited by Gil Alterovitz, Marco Ramoni.
p. ; cm.
Includes bibliographical references and index.
ISBN 978-0-470-74831-2 (cloth)
1. Bioinformatics. 2. Expert systems (Computer science) I. Alterovitz, Gil. II. Ramoni, Marco F.
[DNLM: 1. Computational Biology. 2. Expert Systems. 3. Medical Informatics.
4. Molecular Biology. QU 26.5 K725 2010]
QH324.25.K66 2010
572.80285 – dc22
2010010927
A catalogue record for this book is available from the British Library.
ISBN: 978-0-470-74831-2
Preface
The information generated by progressive biomedical research is increasing rapidly, resulting in a tremendous increase in the biological data resource, including protein and gene databases, model organism databases, annotation databases, biomolecular interaction databases, microarray data, scientific literature data, and much more. The challenge is in representation, integration, analysis and interpretation of the available knowledge and data. The book, Knowledge-Based Bioinformatics: From Analysis to Interpretation, is an endeavor to address the above challenges. The driving force is the need for more background information and broader coverage of recent developments in the field of knowledge-based systems and data-analysis approaches, and their applications to deal with issues that arise from the current increase of biological data in genomic and proteomic research. Also, opportunity exists in utilizing these vast amounts of valuable information for benefit in fitness and disease conditions.
Knowledge-Based Bioinformatics: From Analysis to Interpretation, introduces knowledge-driven approaches, methods, and implementation techniques for bioinformatics. The book includes coverage from data-driven Bayesian networks to ontology-based analysis with applications in the field of bioinformatics. It is divided into four sections. The first section provides an overview of knowledge-driven approaches. Chapter 1, Knowledge-based bioinformatics, presents the current status of biomedical research and significance of knowledge-driven approaches in analyzing the data generated. The focus is on current utilization of the approaches and further enhancement required for advancing the biomedical knowledge. Chapter 2, Knowledge-driven approaches to genome-scale analysis, further explains the concept and covers various systems used for supporting biomedical discovery in genome-scale data. It emphasizes the importance of the knowledge-driven approaches for utilizing the existing knowledge, and challenges to overcome in their development and application. Chapter 3, Technologies and best practices for building bio-ontologies, reviews the process of building bio-ontologies, analyzing the benefits and problems of modeling biological knowledge axiomatically, especially with regards to automated reasoning. It also focuses on various knowledge representation languages, tools and community-level best practices to help the reader to make informed decisions when building bio-ontologies. In Chapter 4, Design, implementation and updating of knowledge bases, the focus is on architecture of knowledge bases. It describes various bioinformatics knowledge bases and the approach taken to meet the challenges of acquisition, maintenance, and interpretation of large amounts of data, and the methodology to efficiently mine the data.
In the second section, the focus shifts from knowledge-driven approaches to data-analysis approaches. Chapter 5, Classical statistical learning in bioinformatics, reviews various statistical methods and recent advances in analysis and interpretation of the data. Also in this chapter, classical concerns with multiple testing with focus on the empirical Bayes method, practical issues to be considered in treatments for genomics, various investigative analysis procedures, and traditional and modern classification procedures are reviewed. Chapter 6, Bayesian methods in genomics and proteomics studies, provides further insight into the Bayesian methods. The chapter focuses on concepts in Bayesian methods, computational methods for statistical inference of Bayesian models, and their applications in genomics and proteomics. Chapter 7, Automatic text analysis for bioinformatics knowledge discovery, introduces the basic concepts and current methodologies applied in biomedical text mining. The chapter provides an outlook on recent advances in automatic literature analysis and the contribution to knowledge discovery in the biomedical domain as well as integration of bioinformatics knowledge bases and the results from automatic literature analysis.
The third section covers gene and protein information. Chapter 8, Fundamentals of gene ontology functional annotation, reviews the current approach to functional annotation with emphasis on Gene Ontology annotation. Also, the chapter reviews currently available mainstream GO browsers and methods to access GO annotations from some of the more specialized GO browsers, as well as the effect of functional gene annotation on biological data analysis. Chapter 9, Methods for improving genome annotation, focuses on recent progress in automated and manual annotations and their application to produce the human consensus coding sequence gene set, and also describes various types of non-coding loci found within the human genome. Chapter 10, Sequences from prokaryotic, eukaryotic, and viral genomes available clustered according to phylotype on a Self-Organizing Map, demonstrates a novel bioinformatics tool for large-scale comprehensive studies of phylotype-specific sequence characteristics for a wide range of genomes. The chapter discusses this interesting method of genome analysis that could provide a new systematic strategy for revealing microbial diversity, relative abundance of different phylotype members of uncultured microorganisms, and unveil the genome signatures.
In the fourth and last section, the book moves to biomolecular relationships and meta-relationships. Chapter 11, Molecular network analysis and applications, provides an overview of current methods for analyzing large-scale biomolecular networks and major applications on biological problems using these network approaches. Also, this chapter addresses the current and next-generation network visualization and analysis tools and future challenges in analyzing the biomolecular networks. Chapter 12, Biological pathway analysis: an overview of Reactome and other integrative pathway knowledge bases, provides further insight into the use of pathway analysis tools to identify relevant biological pathways within large and complex data sets derived from various high-throughput technology platforms. The focus of the review is on the Reactome database and several closely related pathway knowledge bases. Chapter 13, Methods and challenges of identifying biomolecular relationships and networks associated with complex diseases/phenotypes, and their application to drug treatments, explores various interesting methods to infer regulatory biomolecular interactions as well as meta-relationships and molecular relationships in complex disorders and drug treatments. The chapter addresses the challenges involved in the mapping of disease symptoms, identifying novel drug targets, and tailoring patient treatments.
The book, Knowledge-Based Bioinformatics: From Analysis to Interpretation, is the outcome of an international effort, including contributors from 19 institutions located in 7 countries. It brings into light the pioneering research and cutting-edge technologies developed and used by leading experts, and their combined efforts to deal with large volumes of data and derive functional knowledge to enhance biomedical research. The extensive coverage of topics from fundamental methods to application make it a vital reference for researchers and industry professionals, and an essential text for upper level undergraduate/first year graduate students studying the subject.
For the publication of this book, the contribution of many people from this cross-disciplinary field of bioinformatics has been significant. The editors would like to thank the contributing authors including: Eric Karl Neumann (Ch. 1), Hannah Tipney (Ch. 2), Lawrence Hunter (Ch. 2), Mikel Egaña Aranguren (Ch. 3), Robert Stevens (Ch. 3), Erick Antezana (Ch. 3), Jesualdo Tomás Fernández-Breis (Ch. 3), Martin Kuiper (Ch. 3), Vladimir Mironov (Ch. 3), Sarah Hunter (Ch. 4), Rolf Apweiler (Ch. 4), Maria Jesus Martin (Ch. 4), Mark Reimers (Ch. 5), Ning Sun (Ch. 6), Hongyu Zhao (Ch. 6), Dietrich Rebholz-Schuhmann (Ch. 7), Jung-jae Kim (Ch. 7), Varsha K. Khodiyar (Ch. 8), Emily C. Dimmer (Ch. 8), Rachael P. Huntley (Ch. 8), Ruth C. Lovering (Ch. 8), Jonathan Mudge (Ch. 9), Jennifer Harrow (Ch. 9), Takashi Abe (Ch. 10), Shigehiko Kanaya (Ch. 10), Toshimichi Ikemura (Ch. 10), Minlu Zhang (Ch. 11), Jingyuan Deng (Ch. 11), Chunsheng V. Fang (Ch. 11), Xiao Zhang (Ch. 11), Long Jason Lu (Ch. 11), Robin A. Haw (Ch. 12), Marc E. Gillespie (Ch. 12), Michael A. Caudy (Ch. 12) and Mie Rizig (Ch. 13). The editors would also like to thank the book proposal and book draft anonymous reviewers. The editors would like to thank all the people who helped in reviewing the manuscript. The editors would like to acknowledge and thank Alpa Bajpai for her important role in editing this book.
Gil Alterovitz, Ph.D.
Marco Ramoni, Ph.D.
List of Contributors
Takashi Abe
Nagahama Institute of Bio-science and Technology, Japan [email protected]
Erick Antezana
Norwegian University of Science and Technology, Norway [email protected]
Rolf Apweiler
European Bioinformatics Institute, Cambridge, UK [email protected]
Mikel Egaña Aranguren
University of Murcia, Spain [email protected]
Michael A. Caudy
Gnomics Web Services New York, USA [email protected]
Jingyuan Deng
Division of Biomedical Informatics Cincinnati Children's Hospital Medical Center, USA [email protected]
Emily C. Dimmer
European Bioinformatics Institute Cambridge, UK [email protected]
Chunsheng V. Fang
Division of Biomedical Informatics Cincinnati Children's Hospital Medical Center, USA [email protected]
Jesualdo Tomás Fernández-Breis
University of Murcia, Spain [email protected]
Marc E. Gillespie
College of Pharmacy and Allied Health Professions St. John's University, New York, USA [email protected]
Jennifer Harrow
Wellcome Trust Sanger Institute Cambridge, UK [email protected]
Robin A. Haw
Department of Informatics and Bio-computing, Ontario Institute for Cancer Research, Canada [email protected]
Lawrence Hunter
University of Colorado Denver School of Medicine, USA [email protected]
Sarah Hunter
European Bioinformatics Institute Cambridge, UK [email protected]
Rachael P. Huntley
European Bioinformatics Institute Cambridge, UK [email protected]
Toshimichi Ikemura
Nagahama Institute of Bio-science and Technology, Japan [email protected]
Shigehiko Kanaya
Department of Bioinformatics and Genomes, Nara Institute of Science and Technology, Japan [email protected]
Varsha K. Khodiyar
Centre for Cardiovascular Genetics, University College London, UK [email protected]
Jung-jae Kim
School of Computer Engineering Nanyang Technological University Singapore [email protected]
Martin Kuiper
Norwegian University of Science and Technology, Norway [email protected]
Ruth C. Lovering
Centre for Cardiovascular Genetics University College London, UK [email protected]
Long Jason Lu
Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, USA [email protected]
Maria Jesus Martin
European Bioinformatics Institute Cambridge, UK [email protected]
Vladimir Mironov
Norwegian University of Science and Technology, Norway [email protected]
Jonathan Mudge
Wellcome Trust Sanger Institute Cambridge, UK [email protected]
Eric Karl Neumann
Clinical Semantics Group Lexington, MA, USA [email protected]
Dietrich Rebholz-Schuhmann
European Bioinformatics Institute Cambridge, UK [email protected]
Mark Reimers
Department of Biostatistics, Virginia Commonwealth University, USA [email protected]
Mie Rizig
Department of Mental Health Sciences, Windeyer Institute London, UK [email protected]
Robert Stevens
University of Manchester, UK [email protected]
Ning Sun
Department of Epidemiology and Public Health, Yale University School of Medicine, USA [email protected]
Hannah Tipney
University of Colorado Denver School of Medicine, USA [email protected]
Minlu Zhang
Division of Biomedical Informatics Cincinnati Children's Hospital Medical Center, USA [email protected]
Xiao Zhang
Division of Biomedical Informatics Cincinnati Children's Hospital Medical Center, USA [email protected]
Hongyu Zhao
Department of Epidemiology and Public Health, Yale University School of Medicine, USA [email protected]
PART I
FUNDAMENTALS
Section 1
Knowledge-Driven Approaches
Chapter 1
Knowledge-Based Bioinformatics
Eric Karl Neumann
1.1 Introduction
Each day, biomedical researchers discover new insights about our biological knowledge, augmenting by leaps our collective understanding of how our bodies work and why they fail us at times. Today, in one minute we accumulate as much information as we would have from an entire year just three decades ago. Much of it is made available through publishing and databases. However, any group's effective comprehension of this full complement of knowledge is not possible today; the stream of real-time publications and database uploads cannot be parsed and indexed as accessible and application-ready knowledge yet. This has become a major goal for the research community, so that we can utilize the gains made through the all the funded research initiatives. This is what we mean by biomedical knowledge-driven applications (KDAs).
Knowledge is a powerful concept and is central to our scientific pursuits. However, knowledge is a term that too often has been loosely used to help sell an idea or a technology. One group argues that knowledge is a human asset, and that all attempts to digitally capture it are fruitless; another side argues that any specialized database containing curated information is a knowledge system. The label ‘knowledge’ comes to connote information contained by an agent or system that (we wish) appears to have significant value (enough to be purchased). Although the freedom to use labels and ideas should not be impeded, an agreed use of concepts like knowledge would help align community efforts, rather than obfuscate them. Without this consensus, we will not be able to define and apply principles of knowledge to relevant research and development issues that would serve the public. The definition for knowledge needs to be clear, uncomplicated, and practical:
(1) Some aspects of Knowledge can be digitized, since much of our lives depends on the use of computers and the Internet.
(2) Knowledge is different from data or stored information; it must include context and sufficient embedded semantics so that its relevancy to a problem can be determined.
(3) Information becomes Knowledge when it is applicable to more general problems.
Knowledge is about understanding acquired and annotated (sometimes validated) information in conjunction with the context in which it was originally observed and where it had significance. The basic elements in the content need to be appropriately abstracted (classification) into corresponding concepts (usually existing) so that they can be efficiently reapplied in more general situations. A future medical challenge may deal with different items (humans vs. animals), but nonetheless share some of the situational characteristics and generalized ideas of a previously captured biomedical insight. Finding this piece of knowledge at the right time so that it can be applied to an analogous but distinct situation is what separates knowledge from information. Since this is something humans have been doing by themselves for a long time, we have typically been associating knowledge exclusively with human endeavors and interactions (e.g., ‘sticky, local, and contextual,’ Prusak and Davenport, 2000).
KDA is essential for both industrial and academic biomedical research; the need to create and apply knowledge effectively is driven by economic incentives and the nature of how the world works together. In industry, the access to public and enterprise knowledge needs to be both available and in a form that allows for seamless combinations of the two sets. Concepts must enable the bridging between different sources, such that the connected union set provides a business advantage over competitors. Academic research is not that different in having internal and external knowledge, but once a novel combination has been found, validated and expounded, the knowledge is then submitted to peer review and published in an open community. Here, rather than supporting business drivers, scientific advancement occurs when researchers strive to be recognized for their contribution of novel and relevant scientific insights. The free and efficient (and sometimes open) flow of knowledge is key in both cases (Neumann and Prusak, 2007).
In preparation for the subsequent discussions, it is worth clarifying what will be meant by data, information, and knowledge. The experimentalists' definition of data will be used for the most part unless otherwise noted, and that is information measured or generated by experiments. Information will refer to all forms of digitized resources (aka data by other definitions) that can be stored and recalled from a program; it may or may not be structured. Finally, based on the above discussion, knowledge refers to information that can be applied to specific problems, usually separate from the sources and experiments from which they were derived. Knowledge can exist in both humans and digital systems, the former being more flexible to interpretation; the latter relies on the application of formal logic and well-defined semantics.
This chapter begins by providing a review of historical and contemporary knowledge discovery in bioinformatics, ranging from formal reasoning, to knowledge representation, to the issues surrounding common knowledge, and to the capture of new knowledge. Using this initial background as a framework, it then focuses on individual current knowledge discovery applications, organized by the various components and approaches: ontologies, text information extraction, gene expression analysis, pathways, and genotype–phenotype mappings. The chapter finishes by discussing the increasing relevance of the Web and the emerging use of Linked Data (Semantic Web) ‘data aggregative’ and ‘data articulative’ approaches. The potential impact of these new technologies on the ongoing pursuit of knowledge discovery in bioinformatics is described, and offered as practical direction for the research community.
1.2 Formal Reasoning for Bioinformatics
Computationally based knowledge applications originate from AI projects back in the late 1950s that were designed to perform reasoning and inferencing based on forms of first-order logic (FOL). Specifically, inferencing is the processing of available information to draw a conclusion that is either logically plausible (inconclusive support) or logically necessary (fully sufficient and necessary). This typically involves a large set of chained reasoning tasks that attempt to exhaustively infer precise conclusions by looking at all available information and applying specified rules.
Logical reasoning is divided into three main forms: deduction, induction, and abduction. These all involve working with preconditions (antecedents), conclusions (consequents), and the rules that associate these two parts. Each one tries to solve for one of these as unknowns given the other two knowns. Deduction is about solving for the consequent given the antecedent and the rule; induction is about finding the rule that determines the consequent based on the known precondition; and abduction is about determining the precondition based on the conclusions and the rules followed. Abduction is more prone to problems since multiple preconditions can give rise to the same conclusions, and is not as frequently employed; we will therefore focus only on deduction and induction here.
Deduction is what most people are familiar with, and is the basis for syllogisms: ‘All men are mortal; Socrates is a man: Therefore Socrates is mortal!’ Deductive reasoning requires no further observations; it simply requires applying rules to information on preconditions. The difficulty is that in order to perform some useful reasoning, one must have a lot of deep knowledge in the form of rules so that one can produce solid conclusions. Mathematics lends itself well here, but attempts to do this in biology are limited to simple problems: ‘P53 plays a role in cancer regulation; Gene × affects P53: Therefore Gene × may play a role in a cancer.’ The rule may be sound and generalized, but the main shortcoming here is that most people could have performed this kind of inference without invoking a computational reasoner. Evidence is still scant that such reasoning can be usefully applied to areas such as genetics and molecular biology.
Induction is more computationally challenging, but may have more real-world applications. It benefits from having lots of evidence and observations on which to create rules or entailments, which, of course, there is plenty of in research. Induction works on looking for patterns that are consistent, but can be relaxed using statistical significance to allow for imperfect data. For instance, if one regularly observes that most kinases downstream of NF-kB are up-regulated in certain lymphomas, one can propose a rule that specifies this up-regulation relation in these cancers. Induction produces rule statements that have antecedents and consequents. For induction to work effectively one must have (1) sufficient data, including negative facts (when things didn't happen); (2) sufficient associated data (metadata), describing the context and conditions (experimental design) under which the data were created; and (3) a listing of currently known associations which one can use to specifically focus on novel relations and avoid duplication. Induction by itself cannot determine cause and effect, but with sufficient experimental control, one can determine which rules are indeed causal. Indeed, induction can be used to generate hypotheses from previous data in order to design testable experiments.
Induction relies heavily on the available facts present in sources of knowledge. These change with time, and consequently inductive reasoning may yield different results depending on what information has recently been assimilated. In other words, as new facts come to light, new conclusions will arise out of induction, thereby extending knowledge. Indeed, a key reason that standardized databases such as Gene Expression Omnibus (GEO, www.ncbi.nlm.nih.gov/geo/) exist is so we can discover new knowledge by looking across many sets of experimental data, longitudinally and laterally.
Often, reasoning requires one to make ‘open world assumptions’ (OWAs) of the information (e.g., Ling-Ling is a panda), which means that if a relevant statement is missing (Ling-Ling is human is absent), it must be assumed plausible unless (1) proven false (Ling-Ling's parents are not human), (2) shown to be inconsistent (pandas and humans are disjoint), or (3) the negation of the statement is provided (Ling-Ling is not human). OWAs affect deduction by expanding the potential solution space, since some preconditions are unknown and therefore unbounded (not yet able to be fixed). Hence, a receptor with no discovered ligand should be treated as a potential receptor for many different signaling processes (ligands are often associated with biological processes). Once a ligand is determined, the signaling consequences of the receptor are narrowed according to the ligand.
With induction, inference under OWAs will usually be incomplete, since a rule cannot be exactly determined if relevant variables are unknown. Hence some partial patterns may be observed, but they will appear to have exceptions to the rule. For example, a drug target for colon cancer may not respond to inhibitors reliably due to regulation escape through an unbeknownst alternative pathway branch. Once such a cross-talk path is uncovered, it becomes obvious to try and inhibit two targets together, one in each pathway, to prevent any regulatory escape (aka combinatoric therapy).
Another relevant illustration is the inclusion of Gene Ontology (GO) terms within gene records. Their presence suggests that evidence exists to recommend assigning a role or location to the gene. However, the absence of the attribute ‘regulation of cell communication’ could signify a few things: (1) the gene has yet to be assessed for involvement in ‘regulation of cell communication’; (2) the gene has been briefly reviewed, and no obvious evidence was found; and (3) the gene has been thoroughly assessed by a sufficient inclusionary criteria. Since there is no way to determine, today, what the absence of a term implies, this would suggest that knowledge mining based on presence or absence of GO terms will often be misleading.
OWAs often cannot be automatically applied to relational database management systems (RDBMSs), since the absence of an entry or fact in a record may indeed mean it was measured but not found. A relational database's logical consistency could be improved if it explicitly indicated which facts were always measured (i.e., lack of fact implies measured and not observed), and which ones were sometimes measured (i.e., if measured, always stated, therefore lack of fact implies not measured). The measurement attribute would need to include this semantic constraint in an accessible metamodel, such as an ontology.
Together, deduction and induction are the basis for most knowledge discovery systems, and can be invoked in a number of ways, including non-formal logic approaches, for example SQL (structured query language) in relational databases, or Bayesian statistical methods. Applying inference effectively to large corpora of knowledge requires careful planning and optimization, since the size of information can easily outpace the computation resources required due to combinatorial explosion. It should be noted that biology is notoriously difficult to generalize completely into rules; for example, the statement ‘P is a protein iff P is triplet-encoded by a Gene’ is almost always true, but not in the case of gramicidin D, a linear pentadecapeptide that is synthesized de novo by a multi-enzyme complex (Kessler et al., 2004). The failure of AI, 25 years ago, was in part due to not realizing this kind of real-world logic problem. We hope to have learned our lessons from this episode, and to apply logical reasoning to large sets of bioinformatic information more prudently.
1.3 Knowledge Representations
Knowledge Representations (KRs) are essential for the application of reasoning methodologies, providing a precise, formal structure (ontology) to describe instances or individuals, their relations to each other, and their classification into classes or kinds. In addition to these ontological elements, general axioms such as subsumption (class–subclass hierarchies) and property restrictions (e.g., P has Child C iff P is a Father ∨ P is a Mother) can be defined using common elements of logic. The emergence of the OWL Web ontology language from the W3C (World Wide Web Consortium) means that such logic expressions can be defined and applied to information resources (IRs) across the Web, enabling the establishment of KRs that span many sites over the Internet and many kinds of information resources. This is an attractive vision and could generate enormous benefits, but in order for all KRs to work together, there still needs to be coherence and consistency between the ontologies defined (in OWL) and used. Efforts such as the OBO (Open Biomedical Ontologies) Foundry are attempting to do this, but also illustrate how difficult this process is.
In the remainder of this chapter, we will take advantage of a W3C standard format known as N3 (www.w3.org/TeamSubmission/n3/) for describing knowledge representations and factual relations; the triple predicate form ‘A Brel C’ is to be interpreted as ‘Entity A has relation Brel with entity C.’ Any term of the form ‘?B’ signifies a named variable that can be anything that makes the predicate true; for example ‘?g a Gene’ means ?g could be any gene, and the double clause ‘?p a Protein. ?p is_expressed_in Liver’ means any protein is expressed in liver. Furthermore, ‘;’ signifies a conjunction between phrases with the same subject but multiple predicates (‘?p a Protein ; is_expressed_in Liver’ as in the above). Lastly, ‘[]’ brackets are used to specify any entity whose name is unknown (or doesn't matter) but which has relations contained within the brackets: ‘?p is_expressed_in [a Neural_Tissue; stage Embryonic].’ One should recognize that such sets of triples result in the formation of a system of entity nodes related to other entity nodes, better known as a graph.
1.4 Collecting Explicit Knowledge
A major prerequisite of knowledge-driven approaches is the need to collect and structure digital resources as KRs (a subset of IRs), to be stored in knowledge bases (KBs) and used in knowledge applications. Resources can include digital data, text-mined relations, common axioms (subsumption, transitivity), common knowledge, domain knowledge, specialized rules, and the Web in general. Such resources will often come from Internet-accessible sources, and it is assumed that they can be referenced similarly from different systems. Web accessibility requires the use of common and uniform resource identifiers (URIs) for each entity as well as the source system; the additional restriction of uniqueness is not as easy to implement, and can be deferred as long as it is possible to determine whether two or more identifiers refer to the same thing (e.g., owl:sameAs).
In biomedical research, recognizing where knowledge comes from is just as important as knowing it. Phenomena in biology cannot be rigorously proven as in mathematics, but rather are supported by layers of hypotheses and combinations of models. Since these are advanced by researchers with different working assumptions and based on evidence that often is local, keeping track of the context surrounding each hypothesis is essential for proper reasoning and knowledge management. Scientists have been working this way for centuries, and much of this has been done through the use of references in publications whenever (hypothetical) claims are compared, corroborated, or refuted. One recent activity that is bridging between the traditional publication model and the emerging KR approach is the SWAN project (Ciccarese et al., 2008), which has a strong focus on supporting evidence-based reasoning for the molecular and genetic causes of Alzheimer's disease.
Knowledge provenance is necessary when managing hypotheses as they either acquire additional supporting evidence (accumulating but never conclusive), or are disproved by a single critical fact that comes to light (single point of failure). Modal logic (see below), which allows one to define hypotheses (beliefs) based on partial and open world assumptions (Fagin et al., 1995), can dramatically alter a given knowledge base when a new assumption or fact is introduced to the reasoner (or researcher). As we begin to accumulate more hypotheses while at the same time having to review new information, our knowledge base will be subject to major and frequent inference-driven updates. This dependency argues strongly for employing a common and robust provenance framework for both scientific facts and (hypotheses) models. Without this capability, one will never know for sure on what specific arguments or facts a model is based, hence impeding effective Knowledge Discovery (KD). It goes without saying that this capability will need to work on and across the Web.
The biomedical research community has, to a large extent, a vast set of common knowledge that is openly shared. New abstracts and new data are put on public sites daily whenever they are approved or accepted, and many are indexed by search engines and associated with controlled vocabulary (e.g., MeSH). However, this collection is not automatically or easily assimilated into individual applications using knowledge representations, so that researchers cannot compare or infer new findings against their existing knowledge. This barrier to knowledge discovery could be removed by ensuring that new published reports and data are organized following principles of common knowledge.
1.5 Representing Common Knowledge
Common knowledge refers to knowledge that is generally known (and accessible) by everyone in a given community, and which can be formally described. Common knowledge usually differs from tacit knowledge (Prusak and Davenport, 2000) and common sense, both of which are virtually impossible to explicitly codify and which require assumptions that are non-deducible1. For these reasons we will focus specifically on explicit common knowledge as it applies to bioinformatic applications.
An example of explicit common knowledge is ‘all living things require an energy source to live.’ More relevant to bioinformaticists is the central dogma of biology which states: ‘genes are transcribed into mRNA which translate into proteins; implying protein information cannot flow back to DNA,’ or formally:
∀ Protein ∃ Gene (Gene transcribes_into mRNA ∧ mRNA
translates_into
Protein) ⇒ ¬ (Protein reverse_translate Gene).
This is a very relevant chunk of common knowledge that not only maps proteins to genes, but even constrains the gene and protein sequences (up to codon ambiguity). In fact, it is so common, that it has been (for many years) hard-wired into most bioinformatic applications. The knowledge is therefore not only common, but pervasive and embedded, to the point where we have no further need to recode this in formal logic. However, this is not the case for more recent insights such as SNP (single nucleotide polymorphism) associations with diseases, where the polymorphism does not alter the codons directly, but the protein is either truncated or spliced differently. Since the set of SNPs is constantly evolving, it is essential to make these available using formal common knowledge. The following (simplified) example captures this at a high level:
∀ Genetic_Disease ∃ Gene ∃ Protein ∃ SNP (SNP within
Gene ∧ Gene
expresses Protein ∧ SNP modifies Protein ∧ SNP associated
Genetic_Disease) ⇒ SNP root_cause_of Genetic_Disease.
Most of these relations (protein structure and expression changes) are being curated into databases along with their disease and gene (and sequence) associations. It would be a powerful supplement if such knowledge rules could be available as well to researchers and their applications. An immediate benefit would be to allow for application to extend their functionality without need for software updates by vendors; simply download the new rules based on common understanding to reason with local knowledge.
Due to the vastness of common knowledge around all biomedical domains (including all instances of genes, diseases, and genotypes), it is very difficult to explicitly formalize all of it and place it in a single KB. However, if one considers public data sources as references of knowledge, then the amount of digitally encoded knowledge can be quickly and greatly augmented. This does require some mechanism for wrapping these sources with formal logic, for example associating entities with classes. Fortunately, the OWL-RDF (resource description framework) model is a standard that supports this kind of information system wrapping, whereby entities become identified with URIs and can be typed by classes defined in separate OWL documents. Any logical constraints presumed on database content (e.g., no GO process attribute means no evidence found to date for gene) can be explicitly defined using OWL (and other axiomatic descriptions); these would also be publicly accessible from the main source site.
Common knowledge is useful for most forms of reasoning, since it facilitates making connections between specific instances of (local) problems and generalized rules or facts. Novel relations could be deduced on a regular basis from the latest new findings, and deeper patterns induced from increasing numbers of data sets. Many believe that true inference is not possible without the proper encoding of complete common knowledge. Though it will take time to reach this level of common knowledge, it appears that there is interest in heading towards such open knowledge environments (see www.esi-bethesda.com/ncrrworkshops/kebr/index.aspx). If enough benefits are realized in biomedicine along the way, more organized support will emerge to accelerate the process.
The process for establishing common knowledge can be handled by a form of logic known as modal logic (Fagin et al., 1995), which allows different agents (or scientists) to be able to reason with each other though they may have different subsets of knowledge at a given time (i.e., each knows only part of the story). The goal here is to somehow make this disjoint knowledge become common to all. Here, common knowledge is (1) knowledge (φ) all members know about (EGφ), and importantly (2) something known by all members to be known to the other members. The last item applies to itself as well, forming an infinite chain of ‘he knows that she knows that he knows that…’ signifying complete awareness of held knowledge
Another way to understand this, is that if Amy knows × about something, and Bob knows only Y, and × and Y are both required to solve a research problem (possibly unknown to Amy and Bob), then Amy and Bob need to combine their respective sets as common knowledge to solve a given problem. In the real world this manifests itself as experts (or expert systems) who are called upon when there is a gap in knowledge, such as when an oncologist calls on a bioinformatician to help analyze biomarker results. Automating this knowledge expert process could greatly improve the efficiency for any researcher when trying to deduce if their new experimental findings have uncovered new insights based on current knowledge.
In lieu of a formal method for accessing common knowledge, researchers typically resort to searching through local databases or using Google (discussed later) in hopes of filling their knowledge gaps. However, when searching a RDBMS with a query, one must know how to pose the query explicitly. This often results in not uncovering any new significant knowledge, since one requires sufficient prior knowledge to enter the right query in the first place, in which case one is searching only ‘under the street lamp.’ More likely, only particular instances of facts are uncovered, such as dates, numeric attributes and instance qualia. This is a distinguishing feature that separates databases from knowledge bases, and illustrates that databases can support at best very focused and constrained knowledge discovery. Specifically, queries using SQL produce limited knowledge, since they typically do not uncover generalized relations between things. Ontological relations rely on the ability to infer classes, groups of relations, transitivity (multi-joins), and rule satisfiability; these are the instruments by which general and usable knowledge can be uncovered. Standard relational databases (by themselves) are too restrictive for this kind of reasoning and do not properly encode (class and relation) information for practical knowledge discovery.
In many cases, in the bioinformatics community this has come to be viewed as knowledge within databases. For example, the curated protein database Swiss-Prot/UniProt is accepted as a high quality source of reviewed and validated knowledge for proteins, including mutational and splice variants, relations to disorders, and the complexes which they constitute. In fact, it is often the case that curated sets of information are informally raised to the level of knowledge by the community. This definition is more about practice and interpretation than any formal logic definition. Nonetheless, it is relevant and valid for the community of researchers, who often do apply logic constraints on the application of this information: if a novel protein polymorphism not in Swiss-Prot is discovered and validated, it is accepted as real and eventually becomes included into Swiss-Prot.
Nonetheless, databases are full of valuable information and could be re-formatted or wrapped by an ontological layer that would support knowledge inference and discovery, defined here as implicit knowledge resources (IR → KR). If this were to happen, structured data stores could be federated into a system of biomedical common knowledge: knowledge agents could be created that apply modal logic reasoning to crawl across different knowledge resources on the Web in search of new insights. Suffice it to say, practical modal logic is still an emerging and incomplete area of research. Currently, humans are best at identifying which new facts should be incorporated into new knowledge regarding a specific subject or phenomenon; hence researchers would be best served by being provided with intelligent knowledge assistants that can help identify, review, compare, and assimilate new findings from these biomedical IRs. There is a lot of knowledge in biology that could be formally common, consequently there is a clear need to transform public biomedical information sources to work in concert with knowledge applications and practices. Furthermore, this includes the Web, which has already become a core component of all scientific research communities.
1.6 Capturing Novel Knowledge
Not everything can be considered common knowledge; there are large collections of local-domain knowledge consisting of works and models (published and pre-published) created by individual research groups. This is usually knowledge that has not been completely vetted or validated yet (hypotheses and beliefs), but nonetheless can be accessed by others who wish to refute or corroborate the proposed hypotheses as part of the scientific method. This knowledge is connected to and relies on the fundamentals of biology, which are themselves common knowledge (since they form the basis of scientific common ground). So this implies that we are looking for a model which allows connecting local knowledge easily with common knowledge.
Research information is knowledge that is in flux; it is comprised of assumptions and proposed models (mechanisms of action). In modal logic (Fagin et al., 1995) this is comparable to the KD45 axioms: an agent (individual or system) can believe in something not yet proven true, but if shown to be false, the agent cannot believe in it anymore; that is, logic contradictions are not allowed. KD45 succinctly sums up how the scientific process works with competing hypotheses, and how all parallel hypotheses can co-exist until evidence emerges that proves some to be incorrect.
Therefore, the finding (by a research group), that a mutation in the BRCA2 gene is always associated with type 2 breast cancer, strongly argues against any other gene being the primary cause for type 2 susceptibility. Findings that have strong causal relations, such as nucleotide level changes and phenotypes of people always carrying these, are prime examples of how new-findings knowledge can work together with common knowledge. As more data is generated, this process will need to be streamlined and automated; and to prevent too many false positives from being retained, the balanced use of logic and statistics will be critical.
The onslaught of large volumes of information being generated by experiments and subsequent analyses requires proper data set tracking, including the capture of experimental conditions for each study. The key to managing all these associated facts is enforcing data provenance. Without context and provenance, most experimental data will be rendered unusable for other researchers, a problem already identified by research agencies (Nature Editorial, 2009). Provenance will ensure a reliable chain of evidence associated by conditions and working hypotheses that can be used to infer high-value knowledge associations from new findings.
1.7 Knowledge Discovery Applications
Once common and local knowledge are available to systems in a machine-interpretable form, the construction and use of knowledge-discovery applications that can work over these sources becomes practical and empowering. KDA has its roots in what has been labeled Knowledge Discovery and Data Mining (KDD, Fayyad et al., 1996), which consists of several computational approaches that came together in the mid 1990s under a common goal. The main objective of KDD is to turn collected data into knowledge, where knowledge is something of high value that can be applied directly to specific problems. Specifically, it had become apparent that analysts were ‘drowning in information, but starving for knowledge’ (Naisbitt, 1982). It was hoped that KDD would evolve into a formal process that could be almost entirely computationally driven. It was to assist knowledge workers during exploratory data analysis (EDA) when confronted by large sets of data. The extraction of interesting insights via KDD follows more or less inductive reasoning models.
KDD utilizes approaches such as first-order logic and data mining (DM) to extract patterns from data, but distinguishes itself from DM in that the patterns must be validated, made intelligently interpretable, and applicable to a problem. More formally, KDD defines a process that attempts to find expression patterns (E) in sets of Facts (F) that have a degree of certainty (C) and novelty (N) associated with them, while also being useful (U) and simple (S) enough to be interpreted. Statistics plays a strong role in KDD, since finding patterns with significance requires the proper application and interpretation of statistical theories. Some basic KDD tools include Decision Trees, Classification and Regression, Probabilistic Graphs, and Relational Learning. Most of these can be divided into supervised learning and unsupervised learning. Some utilize propositional logic more than others.
Key issues that KDD was trying to address include:
Data classificationInterpretation of outcomes (uncovering relations, extraction of laws)Relevance and significance of data patternsIdentification and removal of confounding effects (Simpson's paradox, http://plato.stanford.edu/entries/paradox-simpson/).Patterns may be known (or hypothesized) in advance, but KDD is supposed to aid in the extraction of such patterns based on the statistical structure of the data and any available domain knowledge. Clearly, information comes in a few flavors: quantitative and qualitative (symbolic). KDD was intended to take advantage of both wherever possible. Symbolic relations embedded in both empirical data (e.g., what conditions were different samples subjected to?) and domain knowledge (e.g., patient outcomes are affected by their genotypes) begin to demonstrate the true symbolic nature of information. That is, data is about tying together relations and attributes, whether it is arranged as tables of values or sets of assertions. The question arises, how can we use this to more efficiently find patterns in data? The key here is understanding that relational data can be generalized as data graphs: collections of nodes connected by edges, analogous to how formal relational knowledge structures are to function (see above).
Indeed, all the KDD tools listed have some form of graph representation: decision trees, classification trees, regression weighted nodes, probabilistic graphs, and relational models (Getoor et al., 2007). The linked nodes can represent samples, observations, background factors, modeling components (e.g., θi), outcomes, dependencies, and hidden variables. It would follow that a common way to represent data and relational properties using graphs could help generalize KDD approaches and allow them to be used in concert with each other. This is substantial, since we now have a generalized way for any application to access and handle knowledge and facts using a common format system, based on graph representation (and serialized by the W3C standard formats, RDF-XML or RDF-N3).
More recently, work by Koller and others has shown that the structure of data models (relations between different tables or sets of things) can be exploited to help identify significant relations within the data (Getoor et al., 2007). That is, the data must already be in a graph-knowledge form in order to be effectively mined statistically. To give an example, if a table containing tested-subject responses for a treatment is linked to the treatment-dosing table and the genetic alleles table, then looking for causal response relations is a matter of following these links and calculating the appropriate aggregate statistics. Specifically, if one compares all the responses in conjunction with the drug and dosing used as well as the subject's genotype, then by applying Bayesian inference, strong interactions between both factors can be identified. Hence data graph structures can be viewed as first-order ‘hypotheses’ for potential interactions.
By the mid 1990's, the notion of publishing useful information on the Web began to take off, allowing it to be linked and accessed by other sites: the Web as a system of common knowledge took root and applications began to work with this. This was followed by efforts to define ontologies in a way that would work from anywhere on the Web, and with anything that was localized anywhere on the Web. A proposal was eventually submitted to the DARPA (Defense Advanced Research Projects Agency) program to support typing things on the Web. It was funded in 2000 and became known as the DAML (DARPA Agent Mark-up Language, www.daml.org/) project. This became the forerunner of the Semantic Web, and eventually transformed into the OWL ontology language that is based on Description Logic (DL).
Dozens of applications for KDD have been proposed in many different domains, but its effectiveness in any one area over the other is unclear. To this end, an open challenge has been initiated, called the KDDCup (www.kdnuggets.com/datasets/kddcup.html), to see how well KDD can be applied to different problem spaces. It has gained a large following in bioinformatics, addressing such diverse areas as:
Prediction of gene/protein function and localizationPrediction of molecular bioactivity for drug designInformation extraction from biomedical articlesYeast gene regulation predictionIdentification of pulmonary embolisms from three-dimensional computed tomography dataComputer Aided Detection (CAD) of early stage breast cancer from X-ray images.KDD was conceived during a time when the shortcomings of AI had surfaced, and the Web's potential was just emerging. We are now in an age where documents can be linked by other documents anywhere in the world; where communities can share knowledge and experiences; where data can be linked to meaning. The most recent rendition of this progress is the Semantic Web.
1.8 Semantic Harmonization: the Power and Limitation of Ontologies
One of the most important requirements for data integration or aggregation is for all data producers and consumers to utilize common semantics (Rubin et al., 2006). In the past, it had been assumed that data integration was about common formats (syntax), but that assumed that if one knows the structure, one can infer the data meaning (semantics). This is now known to be grossly oversimplified, and semantics must also be clearly defined. RDF addressed the syntax issue by forcing all data relations to be binary based, therefore modeling all components as triples (subject, relation/property, object).
The emergence of the W3C OWL ontology standard has enabled the formal definition of many biological and medical concepts and relations. OWL is based on description logic, a FOL formalism that was developed in the 1980s to support class (concept) subsumption and relations between instances of classes. Using OWL, a knowledge engineer can create class hierarchies of concepts that map to real-world observations; for instance, ‘Genes are encoded in DNA and themselves encode proteins.’ OWL's other key feature is that it can be referenced from anywhere on the Web (e.g., used by a database) and incorporated into other non-local logical structures (ontology extension). It was designed to so that any defined ontological components are identifiable nodes on the Web; that is, all users can refer to the same defined Class. The most current version of OWL is OWL2, based on SROIQ logic supporting more expressive logic (Horrocks et al., 2006). The OWL format is modeled after the Resource Description Framework (RDF) that will be described later.
The OWL standard allows knowledge systems to utilize ontologies defined by various groups, such as Gene Ontology, UniProt, BioPAX, and Disease Ontology. Data sets that one wishes to align with the ontologies now can apply a well-specified mechanism: simply reference the ontology URI from within a data documents and system. By doing so, all the data in the system is formally associated with concepts, as well as the relations concepts have with each other. Any third party also looking at the data can instantly find (over the Web) which ontologies were used to define the set.
Many of the current activities around developing ontologies in OWL are about defining common sets of concepts and relations for molecular biology (genes, proteins, and mechanisms) and biomedicine (diseases and symptoms). However, there is still no general agreement of how (completely) to define basic concepts (e.g., gene, protein) or what upper-level biological ontologies should look like or do. It is not inconceivable that this process will take many years still.
1.9 Text Mining and Extraction
One common source of knowledge that many scientists wish to access is from the unstructured text of scientific publications. Due to the increasing volume of published articles, it is widely recognized (Hunter and Cohen, 2006) that researchers are unable to keep up with the flow of new research findings. Much expectation is placed on using computers to help researchers deal with this imbalance by mining the content for the relevant findings of each paper. However, there are many different things that can be mined out of research papers, and to do so completely and accurately is not possible today. Therefore, we will focus here only on the extraction of specific subsets of embedded information, including gene and biomolecule causal effects, molecular interactions and compartments, phenotype–gene associations, and disease treatments.
One way to mine content is simply to index key words and phrases based on text patterns and usage frequency. This is all search engines do, including Google. This does quite well in finding significant occurrences of words; however it fails to find exactly what is being said about something, that is, its semantics. For instance, indexing the phrase ‘… cytokine modulation may be the basis for the therapeutic effects of both anti-estrogens in experimental SLE.’ One can readily identify cytokine modulation (CM) and its association with therapeutic effects (TE) or experimental SLE (xSLE), but the assertion that ‘CM is a TE for xSLE’ cannot be inferred from co-occurrence. Hence, limited knowledge about things being mentioned can be obtained using indexing, such as two concepts occurring in the same sentence, but the relation between them (if there is one) remains ambiguous.
Word-phrase indexing is very practical for search, but for scientific knowledge inquiries it is insufficient; what is specifically needed is the extraction of relations R(A, B). Although more challenging, there has been a significant effort invested to mine relations about specific kinds of entities from natural language. This is referred to as Information Extraction (IE), and it relies much more heavily on understanding some aspects of phrase semantics. Clearly this hinges on predefining classes of entities and sets of relations that are semantically mapped to each other (ontology). The objective of this is to quickly glean key relations about things like biomolecules, biostructures, and bioprocesses from research articles, thereby permitting the rapid creation of accessible knowledge bases (KBs) about such entities.
As an example, if one wanted to find out (from published research) if a particular gene is associated with any known disease mechanisms, one would query the KB for any relation of the form ?gene ?rel [a Disease] (as well as [a Disease] ?rel ?gene). This form would allow any relation to a gene to be identified from the KB, where ?rel could mean ‘is associated with,’ ‘influences,’ ‘suppresses,’ or ‘is over expressed in.’ These relations should be defined in an ontology and the appropriate domain and range entity classes explicitly included. For IE to be most effective, it is useful to focus only on specific kinds of relations one is interested in, rather than trying to support a universal set. This helps reduce dealing with all the complexities of finding and interpreting specific relations from word-phrase patterns in natural languages, a problem that is far from being solved generically. Hence, it is desirable to have modules or cartridges for different IE target tasks, and which utilize different ontologies and controlled vocabularies.
Several open source or publicly accessible IE systems exist, including GATE, Geneways, OpenCalais, TextRunner, and OpenDMAP. OpenDMAP is specifically designed to extract predicates defined in the OBO system of ontologies (Relationship Ontology, RO), specifically those involved in protein transport, protein–protein interactions, and the cell-specific gene expression (Hunter et al., 2008). They had applied it to over 4000 journals, where they extracted 72 460 transport, 265 795 interaction, and 176 153 expression statements, after accounting for errors (type 1 and type 2). Many of the errors are attributable to misidentification of genes and protein names. This issue will not be resolved by better semantic tools, since it is more basic and related to entity identification.
One possibility being considered by the community is that future publications may explicitly include formal identifiers for entities in the text, as well as controlled vocabularies, linked ontologies, and a specific predicate statement regarding the conclusion of the paper. Automated approaches that create such embedded assignments are being investigated throughout the research community, but so far show varying degrees of completeness and correctness, that is, both type 1 and type 2 errors. Much of this may be best avoided if the authors would include such embedding during the writing of their papers. Attractive as this sounds, it will require the development of easy to use and non-invasive tools for authors that do not impact on their writing practices. It will be quite interesting to follow the developments in this technology area over the next few years.
1.10 Gene Expression
Gene Expression Analytics (GEA) is one of the most widely applied methodologies in bioinformatics, mixing data mining with knowledge discovery. Its advantage is that it combines experimentally controlled conditions with large-scale genomic measurements; as a technology platform has become commoditized so it can be applied cost effectively to large samples. Its weakness is that at best it is an average of many cells and cell types, which may be varied states, resulting in confounder effects; in addition, transcript levels usually do not correspond to protein levels. It has become one of a set of tools to investigate and identify biomarkers that can be applied to the research and treatment of diseases. There is great expectation here to successfully apply knowledge-driven approaches around these applications, justifying the enormous investment in funded research and development to create knowledge to support the plethora of next-generation research.
GEA works with experimentally derived data, but shows best results when used in conjunction with gene annotations and sample-associated information; in essence, the expression patterns for many genomic regions (including multiple transcript regions per gene in order to handle splice variants) under various sample conditions (multiple affected individuals, genotypes, therapeutic perturbations, dosing, time-course, recovery). The data construct produced is an N × K matrix, M, of expression levels, where N is the number of probes and K
