164,99 €
The first comprehensive overview of preprocessing, mining, and postprocessing of biological data Molecular biology is undergoing exponential growth in both the volume and complexity of biological data--and knowledge discovery offers the capacity to automate complex search and data analysis tasks. This book presents a vast overview of the most recent developments on techniques and approaches in the field of biological knowledge discovery and data mining (KDD)--providing in-depth fundamental and technical field information on the most important topics encountered. Written by top experts, Biological Knowledge Discovery Handbook: Preprocessing, Mining, and Postprocessing of Biological Data covers the three main phases of knowledge discovery (data preprocessing, data processing--also known as data mining--and data postprocessing) and analyzes both verification systems and discovery systems. BIOLOGICAL DATA PREPROCESSING * Part A: Biological Data Management * Part B: Biological Data Modeling * Part C: Biological Feature Extraction * Part D Biological Feature Selection BIOLOGICAL DATA MINING * Part E: Regression Analysis of Biological Data * Part F Biological Data Clustering * Part G: Biological Data Classification * Part H: Association Rules Learning from Biological Data * Part I: Text Mining and Application to Biological Data * Part J: High-Performance Computing for Biological Data Mining Combining sound theory with practical applications in molecular biology, Biological Knowledge Discovery Handbook is ideal for courses in bioinformatics and biological KDD as well as for practitioners and professional researchers in computer science, life science, and mathematics.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 2299
Veröffentlichungsjahr: 2015
Contents
Cover
Wiley Series on Bioinformatics: Computational Techniques and Engineering
Title Page
Copyright
Dedication
Preface
Contributors
Section I: Biological Data Preprocessing
Part A: Biological Data Management
Chapter 1: Genome and Transcriptome Sequence Databases for Discovery, Storage, and Representation of Alternative Splicing Events
1.1 Introduction
1.2 Splicing
1.3 Alternative Splicing
1.4 Alternative Splicing Databases
1.5 Data Mining from Alternative Splicing Databases
Acknowledgments
Web Resources
References
Chapter 2: Cleaning, Integrating, and Warehousing Genomic Data from Biomedical Resources
2.1 Introduction
2.2 Related Work
2.3 Typology of Data Quality Problems in Biomedical Resources
2.4 Cleaning, Integrating, and Warehousing Biomedical Data
2.5 Conclusions and Perspectives
Web Resources
References
Chapter 3: Cleansing of Mass Spectrometry Data for Protein Identification and Quantification
3.1 Introduction
3.2 Preprocessing Approach for Improving Protein Identification
3.3 Identification Filtering Approach for Improving Protein Identification
3.4 Evaluation Results
3.5 Conclusion
References
Chapter 4: Filtering Protein–Protein Interactions by Integration of Ontology Data
4.1 Introduction
4.2 Evaluation of Semantic Similarity
4.3 Identification of False Protein–Protein Interaction Data
4.4 Conclusion
References
Part B: Biological Data Modeling
Chapter 5: Complexity and Symmetries in DNA sequences
5.1 Introduction
5.2 Archaea
5.3 Patterns on Indicator Matrix
5.4 Measure of Complexity and Information
5.5 Complex Root Representation of DNA Words
5.6 DNA Walks
5.7 Wavelet Analysis
5.8 Algorithm of Short Haar Discrete Wavelet Transform
5.9 Conclusions
References
Chapter 6: Ontology-Driven Formal Conceptual Data Modeling for Biological Data Analysis
6.1 Introduction
6.2 Description Logics for Conceptual Data Modeling
6.3 Extensions
6.4 Automated Reasoning and Biological Knowledge Discovery
6.5 Conclusions and Outlook
References
Chapter 7: Biological Data Integration Using Network Models
7.1 Introduction
7.2 Biological Network Models
7.3 Network Models in Understanding Disease
7.4 Future Challenges
Acknowledgment
References
Chapter 8: Network Modeling of Statistical Epistasis
8.1 Introduction
8.2 Epistasis and Detection
8.3 Network
8.4 Gene-Association Interaction Network
8.5 Statistical Epistasis Networks
8.6 Concluding Remarks
Acknowledgment
References
Chapter 9: Graphical Models for Protein Function and Structure Prediction
9.1 Introduction
9.2 Graphical Models
9.3 Applications
9.4 Summary
Acknowledgments
References
Part C: Biological Feature Extraction
Chapter 10: Algorithms and Data Structures for Next-Generation Sequences
10.1 Aligners
10.2 Assemblers
References
Chapter 11: Algorithms for Next-Generation Sequencing Data
11.1 Introduction
11.2 Definitions and Notations
11.3 REAL: A Read Aligner for Mapping Short Reads to a Genome
11.4 CREAL: Mapping Short Reads to a Genome with Circular Structure
11.5 DynMap: Mapping Short Reads to Multiple Closely Related Genomes
11.6 Conclusion
References
Chapter 12: Gene Regulatory Network Identification with Qualitative Probabilistic Networks
12.1 Central Dogma: Gene Expression in a Cell
12.2 Measuring Expression Levels: Microarray Technology
12.3 Understanding Gene Regulatory Networks: Basic Concepts
12.4 Bayesian Networks for Learning GRNs
12.5 Toward Qualitative Modeling of GRNs
12.6 QPNs for Gene Regulation
12.7 Summary and Conclusions
References
Part D: Biological Feature Selection
Chapter 13: Comparing, Ranking, and Filtering Motifs with Character Classes: Application to Biological Sequences Analysis
13.1 Introduction
13.2 Motifs with Character Classes: A Characterization
13.3 Filtering by means of Underlying Motifs
13.4 Experimental Results and Discussion
13.5 Conclusion
Acknowledgments
References
Chapter 14: Stability of Feature Selection Algorithms and Ensemble Feature Selection Methods in Bioinformatics
14.1 Introduction
14.2 Feature Selection Algorithms and Instability
14.3 Ensemble Feature Selection Algorithms
14.4 Metrics for Stability Assessment
14.5 Conclusions
Acknowledgment
References
Chapter 15: Statistical Significance Assessment for Biological Feature Selection: Methods and Issues
15.1 Introduction
15.2 Statistical Significance Assessment
15.3 p-Value Distribution and π0 Estimation
15.4 Obtaining Control and Background Estimation
15.5 Statistical Significance in Integrative Analysis
15.6 Conclusions
Symbols
Acknowledgments
References
Chapter 16: Survey of Novel Feature Selection Methods for Cancer Classification
16.1 Biological Background
16.2 Introduction
16.3 Kernel-Based Feature Selection with Hilbert–Schmidt Independence Criterion
16.4 Redundancy-Based Gene Selection
16.5 Unsupervised Feature Selection
16.6 Summary of Algorithms
16.7 Conclusion
References
Chapter 17: Information-Theoretic Gene Selection in Expression Data
17.1 Introduction
17.2 Curse of Dimensionality
17.3 Variable Selection Exploration Strategies
17.4 Relevance, Redundancy, and Synergy
17.5 Information-Theoretic Filters
17.6 Fast Mutual Information Estimation
17.7 Conclusions
References
Chapter 18: Feature Selection and Classification for Gene Expression Data Using Evolutionary Computation
18.1 Introduction
18.2 Preliminaries
18.3 Evolutionary Reduct Generation
18.4 Experimental Results
18.5 Conclusion
References
Section II: Biological Data Mining
Part E: Regression Analysis of Biological Data
Chapter 19: Building Valid Regression Models for Biological Data Using Stata and R
19.1 Introduction
19.2 Fitting the Model
19.3 Validity of the Model
19.4 Nonconstant Variance and Variable Transformation
19.5 Marginal Model Plots
19.6 Patterns in Residual Plots
19.7 Variable Selection
References
Chapter 20: Logistic Regression in Genomewide Association Analysis
20.1 Introduction
20.2 Single Genetic Marker: Basic Concepts
20.3 Single Genetic Marker: Statistical Tests
20.4 Two Genetic Markers and Fisher's Nonadditivity Interaction
20.5 Many Genetic Markers in Genomewide Association Analysis: Variable Reduction and Penalized Regression
20.6 Latent Variables and Dimension Reduction: Partial Least-Squares Regression
20.7 Latent Variables: Logic Regression
20.8 Discussion
Appendix: Matrix Representation of Partial Least-Squares Regression
Acknowledgments
References
Chapter 21: Semiparametric Regression Methods in Longitudinal Data: Applications to AIDS Clinical Trial Data
21.1 Introduction
21.2 Modeling a Single Treatment Group Using a Semiparametric Partially Linear Model
21.3 Modeling Within-Subject Covariance
21.4 Modeling Multiple Treatment Groups
21.5 Summary
Acknowledgment
References
Part F: Biological Data Clustering
Chapter 22: The Three Steps of Clustering in the Post-Genomic Era
22.1 Introduction
22.2 Experimental Set-Up
22.3 Distances
22.4 Clustering Algorithms
22.5 Internal Validation Measures
22.6 Conclusions
Acknowledgment
References
Chapter 23: Clustering Algorithms of Microarray Data
23.1 Introduction
23.2 Geometric Clustering Algorithms
23.3 Model-Based Clustering Algorithms
23.4 Formal Concept–Based Clustering Algorithms
23.5 Clustering Webtools
23.6 Microarray Data Sets
23.7 Conclusion
References
Chapter 24: Spread of Evaluation Measures for Microarray Clustering
24.1 Introduction
24.2 Search Procedure and Classification of Evaluation Measures
24.3 Internal Measures
24.4 External Measures
24.5 Biological Measures
24.6 Discussion
24.7 Data Sets
24.8 Conclusions
References
Chapter 25: Survey on Biclustering of Gene Expression Data
25.1 Introduction
25.2 Types of Biclusters
25.3 Groups of Biclusters
25.4 Evaluation Functions
25.5 Systematic and Stochastic Biclustering Algorithms
25.6 Bicluster Validation
25.7 Conclusion
Acknowledgments
References
Chapter 26: Multiobjective Biclustering of Gene Expression Data with Bioinspired Algorithms
26.1 Introduction
26.2 Biclustering Problem in Microarray Data
26.3 Multiobjective Model for Biclustering in Gene Expression Data
26.4 Bioinspired Algorithms for Biclustering
26.5 Results and Discussions
26.6 Conclusion
References
Chapter 27: Coclustering Under Gene Ontology Derived Constraints for Pathway Identification
27.1 Introduction
27.2 Related Work
27.3 Constrained Coclustering
27.4 Parameterless Methodology for GO-driven Coclustering
27.5 Case Study
27.6 Conclusion
References
Part G: Biological Data Classification
Chapter 28: Survey on Fingerprint Classification Methods for Biological Sequences
28.1 Introduction
28.2 Basic Definitions and Problem Statements
28.3 Overview of Various Classification Approaches
28.4 Missing-Value Estimation Methods
28.5 Fingerprint Classification: Combinatorial Approach for Estimating Missing Values
Acknowledgments
References
Chapter 29: Microarray Data Analysis: From Preparation to Classification
29.1 Introduction
29.2 Experiment Design
29.3 Normalization
29.4 Ranking
29.5 Brief Review of Approaches of Microarray Data Classification
29.6 MIDClass: A Novel Approach to Effective Microarray Data Classification
29.7 Experimental Study
29.8 Conclusion
References
Chapter 30: Diversified Classifier Fusion Technique for Gene Expression Data
30.1 Introduction
30.2 Background Study
30.3 Preliminaries
30.4 Proposed Model
30.5 Experimental Evaluation
30.6 Conclusion
References
Chapter 31: RNA Classification and Structure Prediction: Algorithms and Case Studies
31.1 Introduction
31.2 Classification of RNA Sequences
31.3 In Silico Prediction of RNA Pseudoknots
31.4 Conclusion
References
Chapter 32: Ab Initio Protein Structure Prediction: Methods and Challenges
32.1 Introduction
32.2 Protein-Folding Problem Milestones at a Glance
32.3 Ab Initio Protein Structure Prediction
32.4 Pure Ab Initio Prediction
32.5 Ab Initio Prediction with Database Information
32.6 Discussion and Challenges
32.7 Appendix: CASP9
References
Chapter 33: Overview of Classification Methods to Support HIV/AIDS Clinical Decision Making
33.1 Predicting Resistance to Drugs
33.2 Predicting Coreceptor Usage
33.3 Identifying Subtype
33.4 Identifying Mutation Selection Pressure
33.5 Making Treatment-related Decisions
33.6 Future Directions
33.7 Conclusion
References
Part H: Association Rules Learning from Biological Data
Chapter 34: Mining Frequent Patterns and Association Rules from Biological Data
34.1 Introduction
34.2 Definition of AR Mining Problem
34.3 Algorithms for Mining ARs
34.4 Preprocessing and Postprocessing
34.5 Gene Expression Data Mining
34.6 Sequential Data Mining
34.7 Structural Data Mining
34.8 Protein Interactions: Graph Data Mining
34.9 Text Mining
34.10 Conclusion
References
Chapter 35: Galois Closure Based Association Rule Mining from Biological Data
35.1 Introduction
35.2 Association Rule Mining Frameworks
35.3 Condensed Representations of Association Rules
35.4 Interestingness Measures
35.5 Biological Applications
35.6 Conclusion
References
Chapter 36: Inference of Gene Regulatory Networks Based on Association Rules
36.1 Introduction
36.2 Data Mining and Inference of GRNs based on ARs
36.3 Techniques of Inference of GRNs based on AR
36.4 Concluding Remarks
Acknowledgments
References
Part I: Text Mining and Application to Biological Data
Chapter 37: Current Methodologies for Biomedical Named Entity Recognition
37.1 Introduction
37.2 Preliminaries
37.3 Dictionary-Based Approaches
37.4 ML-Based Approaches
37.5 Hybrid Approaches
37.6 Use Cases
37.7 Conclusion
References
Chapter 38: Automated Annotation of Scientific Documents: Increasing Access to Biological Knowledge
38.1 Introduction
38.2 Survey of Tools
38.3 Technologies and Techniques
38.4 Discussion
38.5 Future Perspectives
Glossary
Acknowledgments
References
Chapter 39: Augmenting Biological Text Mining with Symbolic Inference
39.1 Introduction
39.2 Identifying Implied Information
39.3 Predicting New Hypotheses
39.4 Text Mining with Distributional Analysis
39.5 Discussion and Conclusion
Acknowledgments
References
Chapter 40: Web Content Mining for Learning Generic Relations and their Associations from Textual Biological Data
40.1 Introduction
40.2 State-of-the-Art in Biological Relation Mining
40.3 Proposed Biological Relation-Mining System
40.4 Performance Evaluation
40.5 Uniqueness of Proposed Biological Relation-Mining System
40.6 Conclusion and Future Work
References
Chapter 41: Protein–Protein Relation Extraction from Biomedical Abstracts
41.1 Introduction
41.2 BioEve: BioMolecular Event Extractor
41.3 Sentence-Level Classification and Semantic Labeling
41.4 Event Extraction Using Dependency Parsing
41.5 Experiments and Evaluations
41.6 Conclusions
Acknowledgments
References
Part J: High-Performance Computing for Biological Data Mining
Chapter 42: Accelerating Pairwise Alignment Algorithms by Using Graphics Processor Units
42.1 Introduction
42.2 Pairwise Alignment Algorithms
42.3 Graphics Processor Units
42.4 Accelerating Pairwise Alignment Algorithms
42.5 Conclusion
References
Chapter 43: High-Performance Computing in High-Throughput Sequencing
43.1 Introduction
43.2 Next-Generation Sequencing Applications
43.3 High-Performance Computing Architectures: Short Summary
43.4 High-Performance Computing on Next-Generation Sequencing Data
43.5 Summary
References
Chapter 44: Large-scale clustering of short reads for metagenomics on GPUs
44.1 Introduction
44.2 Background
44.3 Pairwise Global Alignment
44.4 GPU programming
44.5 CRiSPy-CUDA
44.6 Experiments
44.7 Conclusions
References
Section III: Biological Data Postprocessing
Part K: Biological Knowledge Integration and Visualization
Chapter 45: Integration of Metabolic Knowledge for Genome-Scale Metabolic Reconstruction
45.1 Introduction
45.2 Omics ERA
45.3 Metabolic Network Modeling
45.4 History of Genome-Scale Models
45.5 How Genome-Scale Metabolic Models Can Be Generated
45.6 Applications
45.7 Biochemical Pathways and Genome Annotation Databases
45.8 Conclusion
References
Chapter 46: Inferring and Postprocessing Huge Phylogenies
46.1 Introduction
46.2 Recent Advances
46.3 Data Avalanche: Example with rbcL
46.4 Future Challenges and Opportunities
46.5 Conclusion
Acknowledgment
References
Chapter 47: Biological Knowledge Visualization
47.1 Introduction
47.2 Information Visualization and Visual Analytics
47.3 Biological Data Types
47.4 Biological Data Visualization Issues
47.5 Sequence Data Visualization
47.6 Relational and Functional Data Visualization
47.7 Expression Data Visualization
47.8 Structure Data Visualization
47.9 Conclusion and Future Perspectives
References
Chapter 48: Visualization of Biological Knowledge Based on Multimodal Biological Data
48.1 Introduction
48.2 Multimodal Biological Data
48.3 Approaches to Discover Knowledge from Multimodal Biological Data
48.4 Novel Approach for Visualization and Discovery of Biological Knowledge Based on Multimodal Biological Data
48.5 Conclusion
References
Index
Wiley Series onBioinformatics: Computational Techniques and Engineering
A complete list of the titles in this series appears at the end of this volume.
Cover Design: Michael Rutkowski
Cover Image: ©iStockphoto/cosmin 4000
Copyright © 2014 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Elloumi, Mourad.
Biological knowledge discovery handbook: preprocessing, mining, and postprocessing of biological data / Mourad Elloumi, Albert Y. Zomaya.
pages cm. –(Wiley series in bioinformatics; 23)
ISBN 978-1-118-13273-9 (hardback)
1. Bioinformatics. 2. Computational biology. 3. Data mining. I. Zomaya, Albert Y. II. Title.
QH324.2.E45 2012
572.80285–dc23
2012042379
To my family for their patience and support.
Mourad Elloumi
To my mother for her many sacrifices over the years.
Albert Y. Zomaya
Preface
With the massive developments in molecular biology during the last few decades, we are witnessing an exponential growth of both the volume and the complexity of biological data. For example, the Human Genome Project provided the sequence of the 3 billion DNA bases that constitute the human genome. Consequently, we are provided too with the sequences of about 100,000 proteins. Therefore, we are entering the postgenomic era: After having focused so many efforts on the accumulation of data, we now must to focus as much effort, and even more, on the analysis of the data. Analyzing this huge volume of data is a challenging task not only because of its complexity and its multiple and numerous correlated factors but also because of the continuous evolution of our understanding of the biological mechanisms. Classical approaches of biological data analysis are no longer efficient and produce only a very limited amount of information, compared to the numerous and complex biological mechanisms under study. From here comes the necessity to use computer tools and develop new in silico high-performance approaches to support us in the analysis of biological data and, hence, to help us in our understanding of the correlations that exist between, on one hand, structures and functional patterns of biological sequences and, on the other hand, genetic and biochemical mechanisms. Knowledge discovery and data mining (KDD) are a response to these new trends.
Knowledge discovery is a field where we combine techniques from algorithmics, soft computing, machine learning, knowledge management, artificial intelligence, mathematics, statistics, and databases to deal with the theoretical and practical issues of extracting knowledge, that is, new concepts or concept relationships, hidden in volumes of raw data. The knowledge discovery process is made up of three main phases: data preprocessing, data processing, also called data mining, and data postprocessing. Knowledge discovery offers the capacity to automate complex search and data analysis tasks. We distinguish two types of knowledge discovery systems: verification systems and discoveryones. Verification systems are limited to verifying the user's hypothesis, while discovery ones autonomously predict and explain new knowledge. Biological knowledge discovery process should take into account both the characteristics of the biological data and the general requirements of the knowledge discovery process.
Data mining is the main phase in the knowledge discovery process. It consists of extracting nuggets of information, that is, pertinent patterns, pattern correlations, and estimations or rules, hidden in huge bodies of data. The extracted information will be used in the verification of the hypothesis or the prediction and explanation of knowledge. Biological data mining aims at extracting motifs, functional sites, or clustering/classification rules from biological sequences.
Biological KDD are complementary to laboratory experimentation and help to speed up and deepen research in modern molecular biology. They promise to bring us new insights into the growing volumes of biological data.
This book is a survey of the most recent developments on techniques and approaches in the field of biological KDD. It presents the results of the latest investigations in this field. The techniques and approaches presented deal with the most important and/or the newest topics encountered in this field. Some of these techniques and approaches represent improvements of old ones while others are completely new. Most of the other books on biological KDD either lack technical depth or focus on specific topics. This book is the first overview on techniques and approaches in biological KDD with both a broad coverage of this field and enough depth to be of practical use to professionals. The biological KDD techniques and approaches presented here combine sound theory with truly practical applications in molecular biology. This book will be extremely valuable and fruitful for people interested in the growing field of biological KDD, to discover both the fundamentals behind biological KDD techniques and approaches, and the applications of these techniques and approaches in this field. It can also serve as a reference for courses on bioinformatics and biological KDD. So, this book is designed not only for practitioners and professional researchers in computer science, life science, and mathematics but also for graduate students and young researchers looking for promising directions in their work. It will certainly point them to new techniques and approaches that may be the key to new and important discoveries in molecular biology.
This book is organized into 11 parts: Biological Data Management, Biological Data Modeling, Biological Feature Extraction, Biological Feature Selection, Regression Analysis of Biological Data, Biological Data Clustering, Biological Data Classification, Association Rules Learning from Biological Data, Text Mining and Application to Biological Data, High-Performance Computing for Biological Data Mining, and Biological Knowledge Integration and Visualization. The 48 chapters that make up the 11 parts were carefully selected to provide a wide scope with minimal overlap between the chapters so as to reduce duplication. Each contributor was asked that his or her chapter should cover review material as well as current developments. In addition, the authors chosen are leaders in their respective fields.
Mourad Elloumi and Albert Y. Zomaya
Contributors
Jad Abbass, Faculty of Science, Engineering and Computing, Kingston University, London, United Kingdom and Department of Computer Science and Mathematics, Lebanese American University, Beirut, Lebanon
Muhammad Abulaish, Center of Excellence in Information Assurance, King Saud University, Riyadh, Saudi Arabia and Department of Computer Science, Jamia Millia Islamia (A Central University), New Delhi, India
Syed Toufeeq Ahmed, Vanderbilt University Medical Center, Nashville, Tennessee
Shiva Akbari-Birgani, Laboratory of Systems Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
Ali Al Mazari, School of Information Technologies, The University of Sydney, Sydney, Australia
Mohamed Al Sayed Issa, Computers and Systems Department, Faculty of Engineering, Zagazig University, Egypt
Yazdan Asgari, Laboratory of Systems Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
Wassim Ayadi, Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE) and LERIA, University of Angers, Angers, France
Haider Banka, Department of Computer Science and Engineering, Indian School of Mines, Dhanbad, India
Laure Berti-Équille, Institut de Recherche pour le DÉveloppement, Montpellier, France
Gianluca Bontempi, Machine Learning Group, Computer Science Department, UniversitÉ Libre de Bruxelles, Brussels, Belgium
Nigel P. Brown, BioQuant, University of Heidelberg, Heidelberg, Germany
Giulia Bruno, Dipartimento di Ingegneria Gestionale e della Produzione, Politecnico di Torino, Torino, Italy
David Campos, DETI/IEETA, University of Aveiro, Aveiro, Portugal xv
Jessica Andrea Carballido, Laboratorio de Investigación y Desarrollo en Computación Científica (LIDeCC), Dept. Computer Science and Engineering, Universidad Nacional del Sur, Bahía Blanca, Argentina
Luciano Cascione, Department of Clinical and Molecular Biomedicine, University of Catania, Italy
Ümit V. ÇatalyÜrek, Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio
Carlo Cattani, Department of Mathematics, University of Salerno, Fisciano (SA), Italy
Meghana Chitale, Department of Computer Science, Purdue University, West Lafayette, Indiana
Young-Rae Cho, Department of Computer Science, Baylor University, Waco, Texas
Kwok Pui Choi, Department of Statistics and Applied Probability, National University of Singapore, Singapore
Matteo Comin, Department of Information Engineering, University of Padova, Padova, Italy
Francesca Cordero, Department of Computer Science, University of Torino, Turin, Italy
Suresh Dara, Department of Computer Science and Engineering, Indian School of Mines, Dhanbad, India
Bhaskar DasGupta, Department of Computer Science, University of Illinois at Chicago, Chicago, Illinois
Hasan Davulcu, Department of Computer Science and Engineering, Ira A. Fulton Engineering, Arizona State University, Tempe, Arizona
Mourad Elloumi, Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE) and University of Tunis-El Manar, Tunisia
Juan Esquivel-Rodríguez, Department of Computer Science, Purdue University, West Lafayette, Indiana
Alfredo Ferro, Department of Clinical and Molecular Biomedicine, University of Catania, Italy
Alessandro Fiori, Dipartimento di Automatica e Informatica, Politecnico di Torino, Torino, Italy
Adelaide Valente Freitas, DMat/CIDMA, University of Aveiro, Portugal
Terry Gaasterland, Scripps Genome Center, University of California San Diego, San Diego, California
Cristian AndrÉs Gallo, Laboratorio de Investigación y Desarrollo en Computación Científica (LIDeCC), Dept. Computer Science and Engineering, Universidad Nacional del Sur, Bahía Blanca, Argentina
Roger J. Garsia, Department of Clinical Immunology, Royal Prince Alfred Hospital, Sydney, Australia
Raffaele Giancarlo, Department of Mathematics and Informatics, University of Palermo, Palermo, Italy
Rosalba Giugno, Department of Clinical and Molecular Biomedicine, University of Catania, Italy
Jin-Kao Hao, LERIA, University of Angers, Angers, France
Ayat Hatem, Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio
Heiko Horn, Department of Disease Systems Biology, The Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
Ting Hu, Computational Genetics Laboratory, Geisel School of Medicine, Dartmouth College, Lebanon, New Hampshire
Kun Huang, Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio
Zina M. Ibrahim, Social Genetic and Developmental Psychiatry Centre, King's College London, London, United Kingdom
Dino Ienco, Institut de Recherche en Sciences et Technologies pour l'Environnement, Montpellier, France
Costas S. Iliopoulos, Department of Informatics, King's College London, Strand, London, United Kingdom and Digital Ecosystems & Business Intelligence Institute, Curtin University, Centre for Stringology & Applications, Perth, Australia
Jahiruddin, Department of Computer Science, Jamia Millia Islamia (A Central University), New Delhi, India
Laetitia Jourdan, INRIA Lille Nord Europe, Villeneuve d'Ascq, France
Lakshmi Kaligounder, Department of Computer Science, University of Illinois at Chicago, Chicago, Illinois
Radha Krishna Murthy Karuturi, Computational and Mathematical Biology, Genome Institute of Singapore, Singapore
Khairul A. Kasmiran, School of Information Technologies, The University of Sydney, Sydney, Australia
Ioannis Kavakiotis, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
Kamer Kaya, Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio
Catharina Maria Keet, School of Computer Science, University of KwaZulu-Natal, Durban, South Africa
Daisuke Kihara, Department of Computer Science, Purdue University, West Lafayette, Indiana and Department of Biological Sciences, Purdue University, West Lafayette, Indiana
Gaurav Kumar, Department of Chemistry and Biomolecular Sciences and ARC Centre of Excellence in Bioinformatics, Macquarie University, Sydney, Australia
Chee Keong Kwoh, School of Computer Engineering, Nanyang Technological University, Singapore
Giuseppe Lancia, Department of Mathematics and Informatics, University of Udine, Udine, Italy
Hee-Jin Lee, Department of Computer Science, Korea Advanced Institute of Science and Technology, Daejeon, South Korea
Juntao Li, Computational and Mathematical Biology, Genome Institute of Singapore, Singapore and Department of Statistics and Applied Probability, National University of Singapore, Singapore
Wentian Li, Robert S. Boas Center for Genomics and Human Genetics, Feinstein Institute for Medical Research, North Shore LIJ Health Systems, Manhasset, New York
Yehua Li, Department of Statistics and Statistical Laboratory, Iowa State University, Ames, Iowa
Charles Lindsey, StataCorp, College Station, Texas
GiosuÉ Lo Bosco, Department of Mathematics and Informatics, University of Palermo, Palermo, Italy and I.E.ME.S.T., Istituto Euro Mediterraneo di Scienza e Tecnologia, Palermo, Italy
Nashat Mansour, Department of Computer Science and Mathematics, Lebanese American University, Beirut, Lebanon
Ali Masoudi-Nejad, Laboratory of Systems Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
SÉrgio Matos, DETI/IEETA, University of Aveiro, Aveiro, Portugal
Patrick E. Meyer, Machine Learning Group, Computer Science Department, UniversitÉ Libre de Bruxelles, Brussels, Belgium
Debahuti Mishra, Institute of Technical Education and Research, Siksha O Anusandhan University, Bhubaneswar, Odisha, India
Sashikala Mishra, Institute of Technical Education and Research, Siksha O Anusandhan University, Bhubaneswar, Odisha, India
Ahmed Mokaddem, Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE) and University of Tunis-El Manar, El Manar, Tunisia
Kartick Chandra Mondal, Laboratory I3S, University of Nice Sophia-Antipolis, Sophia-Antipolis, France
Jason H. Moore, Computational Genetics Laboratory, Geisel School of Medicine, Dartmouth College, Lebanon, New Hampshire
Fouzia Moussouni, UniversitÉ de Rennes 1, Rennes, France
Mohamed Nadif, LIPADE, University of Paris-Descartes, Paris, France
Radhika Nair, Department of Computer Science and Engineering, Ira A. Fulton Engineering, Arizona State University, Tempe, Arizona
Jean-Christophe Nebel, Faculty of Science, Engineering and Computing, Kingston University, London, United Kingdom
Alioune Ngom, School of Computer Science, University of Windsor, Windsor, Ontario, Canada
Thuy Diem Nguyen, School of Computer Engineering, Nanyang Technological University, Singapore
Oleg Okun, SMARTTECCO, Stockholm, Sweden
JosÉ Luis Oliveira, DETI/IEETA, University of Aveiro, Portugal
Hatice GÜlÇin Özer, Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio
Evangelos Pafilis, Institute of Marine Biology Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
Jong C. Park, Department of Computer Science, Korea Advanced Institute of Science and Technology, Daejeon, South Korea
Nicolas Pasquier, Laboratory I3S, University of Nice Sophia-Antipolis, Sophia-Antipolis, France
Chintan Patel, Department of Computer Science and Engineering, Ira A. Fulton Engineering, Arizona State University, Tempe, Arizona
Yudi Pawitan, Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden
Ruggero G. Pensa, Department of Computer Science, University of Torino, Turin, Italy
Giuseppe Pigola, IGA Technology Services, Udine, Italy
Luca Pinello, Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts; Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts; and I.E.ME.S.T., Istituto Euro Mediterraneo di Scienza e Tecnologia, Palermo, Italy
Solon P. Pissis, Department of Informatics, King's College London, Strand, London, United Kingdom
Alberto Policriti, Department of Mathematics and Informatics and Institute of Applied Genomics, University of Udine, Udine, Italy
Ignacio Ponzoni, Laboratorio de Investigación y Desarrollo en Computación Científica (LIDeCC), Dept. Computer Science and Engineering, Universidad Nacional del Sur, Bahía Blanca, Argentina and Planta Piloto de Ingeniería Química (PLAPIQUI) CONICET, Bahía Blanca, Argentina
Alfredo Pulvirenti, Department of Clinical and Molecular Biomedicine, University of Catania, Italy
Shoba Ranganathan, Department of Chemistry and Biomolecular Sciences and ARC Centre of Excellence in Bioinformatics, Macquarie University, Sydney, Australia
Hendrik Rohn, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany
Haifa Ben Saber, Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE) and University of Tunis, Tunisia
Lee Sael, Department of Computer Science, Purdue University, West Lafayette, Indiana and Department of Biological Sciences, Purdue University, West Lafayette, Indiana
Ali Salehzadeh-Yazdi, Laboratory of Systems Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
Rodrigo Santamaría, Department of Computer Science and Automation, University of Salamanca, Salamanca, Spain
Bertil Schmidt, Institut fÜr Informatik, Johannes Gutenberg University, Mainz, Germany
Falk Schreiber, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany and Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany
Khedidja Seridi, INRIA Lille Nord Europe, Villeneuve d'Aseq, France
Kailash Shaw, Department of CSE, Gandhi Engineering College, Bhubaneswar, Odisha, India
Simon J. Sheather, Department of Statistics, Texas A&M University, College Station, Texas
Stephen A. Smith, Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan
Junilda Spirollari, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ
Alexandros Stamatakis, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
El-Ghazali Talbi, INRIA Lille Nord Europe, Villeneuve d'Ascq, France
Kean Ming Tan, Department of Statistics, Purdue University, West Lafayette, Indiana
Xin Lu Tan, Department of Statistics, Purdue University, West Lafayette, Indiana
Bahar Taneri, Department of Biological Sciences, Eastern Mediterranean University, Famagusta, North Cyprus and Institute for Public Health Genomics, Cluster of Genetics and Cell Biology, Faculty of Health, Medicine and Life Sciences, Maastricht University, The Netherlands
Mingjie Tang, Department of Computer Science, Purdue University, West Lafayette, Indiana
Ahmed Y. Tawfik, Information Systems Department, French University of Egypt, El-Shorouk, Egypt
Sukru Tikves, Department of Computer Science and Engineering, Ira A. Fulton Engineering, Arizona State University, Tempe, Arizona
George Tzanis, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
Filippo Utro, Computational Genomics Group, IBM T.J. Watson Research Center, Yorktown Heights, New York
Davide Verzotto, Department of Information Engineering, University of Padova, Padova, Italy
Francesco Vezzi, Department of Mathematics and Informatics and Institute of Applied Genomics, University of Udine, Udine, Italy
Alessia Visconti, Department of Computer Science, University of Torino, Turin, Italy
Ioannis Vlahavas, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
Jason T. L. Wang, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ
Penghao Wang, School of Mathematics and Statistics, The University of Sydney, Sydney, Australia
Dongrong Wen, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ
Pengyi Yang, School of Information Technologies, University of Sydney, Sydney, Australia
Jean Yee-Hwa Yang, School of Mathematics and Statistics, University of Sydney, Sydney, Australia
Yaning Yang, Department of Statistics and Finance, University of Science and Technology of China, Hefei, China
Zejun Zheng, Singapore Institute for Clinical Sciences, Singapore
Ling Zhong, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ
Bing B. Zhou, School of Information Technologies, University of Sydney, Sydney, Australia
Albert Y. Zomaya, School of Information Technologies, University of Sydney, Sydney, Australia
Section I
Biological Data Preprocessing
Part A
Biological Data Management
Chapter 1
Genome and Transcriptome Sequence Databases for Discovery, Storage, and Representation of Alternative Splicing Events
Bahar Taneri1,2 and Terry Gaasterland3
1Department of Biological Sciences, Eastern Mediterranean University, Famagusta, North Cyprus
2Institute for Public Health Genomics, Cluster of Genetics and Cell Biology, Faculty of Health, Medicine and Life Sciences, Maastricht University, The Netherlands
3Scripps Genome Center, University of California San Diego, San Diego, California
Transcription is a critical cellular process through which the RNA molecules specify which proteins are expressed from the genome within a given cell. DNA is transcribed into RNA and RNA transcripts are then translated into proteins, which carry out numerous functions within cells. Prior to protein synthesis, RNA transcripts undergo several modifications including 5′ capping, 3′ polyadenylation, and splicing [1]. Premature messenger RNA (pre-mRNA) processing determines the mature mRNA's stability, its localization within the cell, and its interaction with other molecules [2]. In addition to constitutive splicing, the majority of eukaryotic genes undergo alternative splicing and therefore code for proteins with diverse structures and functions.
In this chapter, we describe the process of RNA splicing and focus on RNA alternative splicing. As described in detail below, splicing removes noncoding introns from the pre-mRNA and ligates the coding exonic sequences to produce the mRNA transcript. Alternative splicing is a cellular process by which several different combinations of exon–intron architectures are achieved with different mRNA products from the same gene. This process generates several mRNAs with different sequences from a single gene by making use of alternative splice sites of exons and introns. This process is critical in eukaryotic gene expression and plays a pivotal role in increasing the complexity and coding potential of genomes. Since alternative splicing presents an enormous source of diversity and greatly elevates the coding capacity of various genomes [3–5], we devote this chapter to this cellular phenomenon, which is widespread across eukaryotic genomes.
In particular we explain the databases for Alternative Splicing Queries (dbASQ), a computational pipeline we used to generate alternative splicing databases for genome and transcriptome sequences of various organisms. dbASQ enables the use of genome and transcriptome sequence data of any given organism for database development. Alternative splicing databases generated via dbASQ not only store the sequence data but also facilitate the detection and visualization of alternative splicing events for each gene in each genome analyzed. Data mining of the alternative splicing databases, generated using the dbASQ system, enables further analysis of this cellular process, providing biological answers to novel scientific questions.
In this chapter we provide a general overview of the widespread cellular phenomenon alternative splicing. We take a computational approach in answering biological questions with regard to alternative splicing. In this chapter you will find a general introduction to splicing and alternative splicing along with their mechanism and regulation. We briefly discuss the evolution and conservation of alternative splicing. Mainly, we describe the computational tools used in generating alternative splicing databases. We explain the content and the utility of alternative splicing databases for five different eukaryotic organisms: human, mouse, rat, frutifly, and soil worm. We cover genomic and transcriptomic sequence analyses and data mining from alternative splicing databases in general.
A typical mammalian gene is a multiexon gene separated by introns. Exons are relatively short, about 145 nucleotides, and are interrupted by much longer introns of about 3300 nucleotides [6, 7]. In humans, the average number of exons per protein coding gene is 8.8 [7]. Both introns and exons of a protein-coding gene are transcribed into a pre-mRNA molecule [1]. Approximately 90% of the pre-mRNA molecule is composed of the introns and these are removed before translation. Before the mRNA molecule transcribed from the gene can be translated into a protein molecule, there are several processes that need to take place. While in total an average protein-coding gene in human is about 27,000 bp in the genome and in the pre-mRNA molecule, the processed mRNA contains only about 1300 coding nucleotides and 1000 nucleotides in the untranslated regions (UTRs) and polyadenylation (poly A) tail. The removal of introns and ligation of exons are referred to as the splicing process or the RNA splicing process [1, 7]. Splicing takes place in the nucleus. Final products of splicing which are the ligated exonic sequences are ready for translation and are exported out of the nucleus [1].
Simply, splicing refers to removal of intervening sequences from the pre-mRNA molecule and ligation of the exonic sequences. Each single splicing event removes one intron and ligates two exons. This process takes place via two steps of chemical reactions [1]. As shown in Figure 1.1, within the intronic sequence there is a particular adenine nucleotide which attacks the 5′ intronic splice site. A covalent bond is formed between the 5′ splice site of the intron and the adenine nucleotide releasing the exon upstream of the intron. In the second chemical reaction, the free 3′-OH group at the 3′ end of the upstream exon ligates with the 5′ end of the downstream exon. In this process, the intronic sequence, which contains an RNA loop, is released.
Figure 1.1 Illustration of two chemical reactions needed for one splicing reaction (A: adenine nucleotide at branch point of intron).
There are many cis-acting and trans-acting factors involved in splicing. The network of these factors facilitates splicing through exon definition and intron definition. Exon definition occurs early in splicing and involves interactions recognizing the exonic 5′ splice site and 3′ splice site, whereas for intron definition initial interactions take place across the intron for the recognition of 5′ and 3′ splice sites of the intron [8]. Splicing is regulated by a dynamic combinatorial network of RNA and protein molecules. Spliceosome, the splicing machinery, is a very complex system and is composed of five small nuclear RNAs (snRNAs), termed U1, U2, U4, U5, and U6 [1]. These are short RNA sequences of about 200 nucleotides long. In addition to the snRNAs, about 100 proteins are parts of the spliceosome. Assembly of snRNAs with the proteins forms small nuclear ribonucleoprotein complexes (snRNPs), which precisely bind to splice sites on the pre-mRNA to facilitate splicing [9]. Figure 1.2 shows the main steps of spliceosome assembly in the cell. Initially the 5′ intronic splice site interacts with U1. Then U2 interacts with the branch point. Next, U1 is replaced by the U4/U6, U5 complex, which then interacts with the U2, initiating intronic lariat formation. It is thought that the complex molecular content and assembly of the spliceosome are due to the need for highly accurate splicing in order to prevent formation of malfunctional or nonfunctional protein molecules.
Figure 1.2 Spliceosome assembly (U1, U2, U4, U5, U6: snRNAs; GU: guanine and uracil nucleotides forming 5′ splice site signal; AG: adenine and guanine nucleotides forming 3′ splice site signal).
In addition to the complex splicing machinery in the cell, specific sequence signals are needed for realization of splicing. There are four main sequence signals on the pre-mRNA molecule which play important roles in splicing. As shown in Figure 1.3, these are the 5′ splice site (exon–intron junction at the 5′ end of the intron), 3′ splice site (exon–intron junction at the 3′ end of the intron, the branch point (specific sequence slightly upstream of the 3′ splice site), and the polypyrimidine tract (between the branch point and the 3′ splice site). These sequences facilitate the two transesterification reactions involved in intron removal and exon ligation.
Figure 1.3 Splicing signals on pre-mRNA molecule (GU: guanine and uracil nucleotides forming 5′ splice site signal; AG: adenine and guanine nucleotides forming 3′ splice site signal; A: adenine nucleotide at branch point of intron; polypyrimidine tract: pyrimidine-rich short sequence close to 3′ splice site).
However, these sequences are not sufficient for alternative splice site selection. There are multiple other sequence signals involved in alternative splicing. There are several types of cis-acting regulatory sequences for splicing within the RNA molecule termed enhancers and silencers, which stimulate or suppress splicing, respectively. Exonic splicing enhancers (ESEs), exonic splicing silencers (ESSs), intronic splicing enhancers (ISEs), and intronic splicing silencers (ISSs) are among the cis-acting splicing regulatory sequences.
Here, we provide an example of ESE regulatory function. ESEs act as binding sites for regulatory RNA binding proteins (RBPs), particularly as binding sites for SR proteins (proteins rich in serine–arginine). SR proteins have two RNA recognition motifs (RRMs) and one arginine–serine rich domain (RS domain). SR proteins bind to RNA sequence motifs via their RRM domains [10], and they recruit the spliceosome to the splice site via their RS domain. By this process the SR proteins enable exon definition [6]. SR proteins recruit the basal splicing machinery to the RNA; therefore they are required for both constitutive and alternative splicing. Figure 1.4 illustrates SR protein binding to ESEs on the RNA molecule. In addition, SR proteins work as inhibitors of splicing inhibitory proteins binding to ESS sites close to ESEs, where SRs are bound (Figure 1.4). Many exons contain ESEs, which overall have varying sequences [8].
Figure 1.4 SR protein binding on pre-mRNA: SR inhibition of splicing inhibitory protein.
Though less well understood than ESEs, ESSs are known negative regulators of splicing. They interact with repressor heterogeneous nuclear ribonucleoproteins (hnRNPs) to silence splicing [11]. Certain trans-acting splicing regulatory proteins could bind to ESS sequences causing exon skipping [12]. Similarly, intronic sequences can act both as enhancers and silencers of splicing events. Certain intronic sequences function as ISEs and can enhance the splicing of their upstream exon [8]. Certain ISSs could signal for repressor protein binding. For example, specifically YCAY motifs, where Y denotes a pyrimidine (U or C), signal for NOVA binding (a neuron-specific splicing regulatory protein). These particular sequences can act as ISSs depending on their location within the pre-mRNA molecule [13]. ISSs are further discussed in Section 1.3.3.
Alternative splicing is a widespread phenomenon across and within the eukaryotic genomes. Of the estimated 25,000 protein-coding genes in human, ∼90% are predicted to be alternatively spliced [14]. The impact of alternative splicing is widespread on the eukaryotic organisms' gene expression in general [5]. Earlier studies have shown that the majority of the immune system and the nervous system genes exhibit alternative splicing [15]. We have previously shown that the majority of mouse transcription factors are alternatively spliced, leading to protein domain architecture changes [16]. Below, we detail different types of alternative splicing and the mechanism and regulation of this cellular process. We mention the evolution and conservation of alternative splicing across different genomes.
Types of Alternative Splicing Alternative splicing of the pre-mRNA molecule can occur in several different ways. Figure 1.5 shows different types of alternative splicing events which include the presence and absence of cassette exons, mutually exclusive exons, intron retention, and various forms of length variation. A given RNA transcript can contain multiple different types of alternative splicing.Examples of Widespread Presence of Alternative Splicing in Eukaryotic Genes Alternative splicing is a well-documented, widespread phenomenon across the eukaryotic genomes. Here, we provide two interesting examples of alternatively spliced genes, one from Drosophila melanogaster and the other from the human genome. One of the most interesting examples of alternative splicing involves the Down syndrome cell adhesion molecule (Dscam) gene of D. melanogaster. There are 95 cassette exons in this gene and a total of 38,016 different RNA transcripts can potentially be generated from this gene through differential use of the exon–intron structure [5, 17]. The Dscam example illustrates the enormous coding-changing capacity of alternative splicing and its influence on the variation of gene expression within and across cells [5]. The KCNMA 1 human gene presents another interesting case of alternative splicing. This gene exhibits both cassette exons and exons with length variation at 5′ and 3′ ends. These alternative exons generate over 500 different RNA transcripts [5].Figure 1.5 Types of alternative splicing: (a) cassette exon, present or absent in its entirety or from RNA transcript; (b) mutually exclusive exons, only one present in any given RNA transcript; (c) intron retention; (d) length-variant exon, nucleotide length variation possible on both 5′ and 3′ ends or on either end (only use of alternative 5′ splice site shown, use of alternative 3′ splice site not shown).
Mainly the mechanism of alternative splicing involves interaction of cis-acting and trans-acting splicing factors. Recruitment of the splicing machinery to the correct splice sites, blocking of certain splice sites, and enhancing the use of other splice sites all contribute to this process [5]. Furthermore, RNA splicing and transcription are temporally and spatially coordinated. As the pre-mRNA is transcribed, splicing starts to take place [2]. Alternative splicing co-occurs with transcription and may be dependent on the promoter region of the gene. Different promoters might recruit different amounts of SR proteins. Or different promoters might recruit fast-or slow-acting RNA polymerases, which changes the course of splicing. Slow-acting promoters present more chance for exon inclusion and fast-acting ones promote exon exclusion [18]. Furthermore, epigenetics plays a role in the process of alternative splicing. The dynamic chromatin structure, which affects transcription, is also implicated in alternative splicing [19]. In addition, it has been shown that histone modification takes place differentially in the areas with constitutive exons compared to those with alternative exons [20, 21].
Alternative splicing is a tissue-specific, developmental stage and/or physiological condition dependent [5, 22] and is regulated in this manner. Complex interactions between cis regulatory sequences and trans regulatory factors of RNA binding proteins lead to a tissue-specific, cell-specific, developmental stage and physiological condition–dependent regulation of splicing [23–26]. An example of cis-acting regulation is the ISS-based alternative exon exclusion. Inclusion of an alternative exon depends on several factors, including the affinity and the concentrations of positive and negative regulators of splicing. ISSs flank the alternative exons on both sides and could bind the negative regulators of splicing. Protein–protein interaction among these negative regulators results in alternative exon skipping [6]. Figure 1.6 shows ISS regulation leading to exon exclusion from the mRNA.
Figure 1.6 ISS-based exon exclusion (black structure: regulatory protein).
The RNA splicing process is thought to have originated from Group II introns with autocatalytic function [47, 48]. Evolutionary advantages of splicing and alternative splicing stem from various exon–intron rearrangements, which would allow for emergence of new proteins with different functions [1]. The basic splicing machinery and alternative splicing are evolutionarily conserved across species [47, 49–51]. Bioinformatic analyses have shown that alternative exons and their flanking introns are conserved to higher levels than constitutive exons [52, 53]. When compared across species, alternative exons and their splice sites are conserved indicating their functional roles [54, 55]. Similar sequence characteristics of alternative splicing events across different species indicate that these events are functionally significant. Mouse and human genes are highly conserved. About 80% of the mouse genes have human orthologs. The Mouse Genome Sequencing Consortium 2002 indicated that more than 90% of the human and mouse genomes are within conserved syntenic regions. Cross-species analyses between these two species with whole-genome sequence alignments revealed the conserved splicing events [50].
In the genome era, availability of genomic sequences and the wide range of transcript sequence data enabled detailed bioinformatic analyses of alternative splicing. Multiple-sequence alignment approaches have been widely used within and across species in order to detect alternative exons and other alternative splicing events within transcriptomes [56–60]. In this section, we provide a brief overview of various alternative splicing databases and we focus on describing alternative splicing databases developed using the dbASQ system and a wide range of genome and transcriptome sequence data. The databases described here identify, classify, compute, and store alternative splicing events. In addition, they answer biological queries about current and novel splice variants within various genomes.
Over the last decade, utilizing bioinformatics tools, various computational analyses of alternative splicing, and data generation in this field have been accelerated. Mainly storage and representation of sequence data enabled collection of alternative splicing data in the form of databases. Table 1.1 provides a comprehensive list of alternative splicing databases and a literature source for the database. (This list is exhaustive but may not be complete at the time of publication.) In the next section we detail the generation and utility of five specific alternative splicing databases generally called splicing databases (SDBs) built using the computational pipeline system dbASQ.
Table 1.1 Alternative Splicing Databases.
Alternative SDBsDescriptionReferenceASPicDBDatabase of annotated transcript and protein variants generated by alternative splicing[61]TassDB2Comprehensive database of subtle alternative splicing events[62]H-DBASHuman transcriptome database for alternative splicing[63]ASTDAlternative splicing and transcript diversity database[64]AS-ALPSDatabase for analyzing effects of alternative splicing on protein structure, interaction, and network in human and mouse[65]ASMDAlternative Splicing Mutation Database[66]ProSASDatabase for analyzing alternative splicing in context of protein structures[67]Fast DBAnalysis of regulation of expression and function of human alternative splicing variants[68]EuSpliceAnalysis of splice signals and alternative splicing in eukaryotic genes[69]SpliceMinerDatabase implementation of the National Center for Biotechnology Information (NCBI) Evidence Viewer for microarray splice-variant analysis[70]ECgeneProvides functional annotation for alternatively spliced genes[71]ASAP IIAnalysis and comparative genomics of alternative splicing in 15 animal species[72]HOLLYWOODComparative relational database of alternative splicing[73]ASDBioinformatics resource on alternative splicing[74]MAASEAlternative splicing database designed for supporting splicing microarray applications[75]ASHESdbDatabase of exon skipping[76]AVATARDatabase for genomewide alternative splicing event detection[77]DEDBDatabase of D. melanogaster exons in splicing graph form[78]ASGDatabase of splicing graphs for human genes[79]EASEDExtended alternatively spliced expressed sequence tag (EST) database[80]PASDBPlant alternative splicing database[81]ProSplicerDatabase of putative alterantive splicing information[82]AsMamDBAlternative splice database of mammals[83]SpliceDBDatabase of canonical and noncanonical mammalian splice sites[84]ASDBDatabase of alternatively spliced genes[85]It should be noted that, in addition to alternative splicing databases, various computational tools and platforms such as AspAlt [86] and SpliceCenter [87] have been developed to analyze alternative splicing across various genomes. Another example is by Suyama et al. [88], who focus on conserved regulatory motifs of alternative splicing. We will not be providing an exhaustive list for such computational tools and platforms as this is out of the scope of this chapter.
