Biological Knowledge Discovery Handbook - Mourad Elloumi - E-Book

Biological Knowledge Discovery Handbook E-Book

Mourad Elloumi

0,0
164,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

The first comprehensive overview of preprocessing, mining, and postprocessing of biological data Molecular biology is undergoing exponential growth in both the volume and complexity of biological data--and knowledge discovery offers the capacity to automate complex search and data analysis tasks. This book presents a vast overview of the most recent developments on techniques and approaches in the field of biological knowledge discovery and data mining (KDD)--providing in-depth fundamental and technical field information on the most important topics encountered. Written by top experts, Biological Knowledge Discovery Handbook: Preprocessing, Mining, and Postprocessing of Biological Data covers the three main phases of knowledge discovery (data preprocessing, data processing--also known as data mining--and data postprocessing) and analyzes both verification systems and discovery systems. BIOLOGICAL DATA PREPROCESSING * Part A: Biological Data Management * Part B: Biological Data Modeling * Part C: Biological Feature Extraction * Part D Biological Feature Selection BIOLOGICAL DATA MINING * Part E: Regression Analysis of Biological Data * Part F Biological Data Clustering * Part G: Biological Data Classification * Part H: Association Rules Learning from Biological Data * Part I: Text Mining and Application to Biological Data * Part J: High-Performance Computing for Biological Data Mining Combining sound theory with practical applications in molecular biology, Biological Knowledge Discovery Handbook is ideal for courses in bioinformatics and biological KDD as well as for practitioners and professional researchers in computer science, life science, and mathematics.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 2299

Veröffentlichungsjahr: 2015

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Contents

Cover

Wiley Series on Bioinformatics: Computational Techniques and Engineering

Title Page

Copyright

Dedication

Preface

Contributors

Section I: Biological Data Preprocessing

Part A: Biological Data Management

Chapter 1: Genome and Transcriptome Sequence Databases for Discovery, Storage, and Representation of Alternative Splicing Events

1.1 Introduction

1.2 Splicing

1.3 Alternative Splicing

1.4 Alternative Splicing Databases

1.5 Data Mining from Alternative Splicing Databases

Acknowledgments

Web Resources

References

Chapter 2: Cleaning, Integrating, and Warehousing Genomic Data from Biomedical Resources

2.1 Introduction

2.2 Related Work

2.3 Typology of Data Quality Problems in Biomedical Resources

2.4 Cleaning, Integrating, and Warehousing Biomedical Data

2.5 Conclusions and Perspectives

Web Resources

References

Chapter 3: Cleansing of Mass Spectrometry Data for Protein Identification and Quantification

3.1 Introduction

3.2 Preprocessing Approach for Improving Protein Identification

3.3 Identification Filtering Approach for Improving Protein Identification

3.4 Evaluation Results

3.5 Conclusion

References

Chapter 4: Filtering Protein–Protein Interactions by Integration of Ontology Data

4.1 Introduction

4.2 Evaluation of Semantic Similarity

4.3 Identification of False Protein–Protein Interaction Data

4.4 Conclusion

References

Part B: Biological Data Modeling

Chapter 5: Complexity and Symmetries in DNA sequences

5.1 Introduction

5.2 Archaea

5.3 Patterns on Indicator Matrix

5.4 Measure of Complexity and Information

5.5 Complex Root Representation of DNA Words

5.6 DNA Walks

5.7 Wavelet Analysis

5.8 Algorithm of Short Haar Discrete Wavelet Transform

5.9 Conclusions

References

Chapter 6: Ontology-Driven Formal Conceptual Data Modeling for Biological Data Analysis

6.1 Introduction

6.2 Description Logics for Conceptual Data Modeling

6.3 Extensions

6.4 Automated Reasoning and Biological Knowledge Discovery

6.5 Conclusions and Outlook

References

Chapter 7: Biological Data Integration Using Network Models

7.1 Introduction

7.2 Biological Network Models

7.3 Network Models in Understanding Disease

7.4 Future Challenges

Acknowledgment

References

Chapter 8: Network Modeling of Statistical Epistasis

8.1 Introduction

8.2 Epistasis and Detection

8.3 Network

8.4 Gene-Association Interaction Network

8.5 Statistical Epistasis Networks

8.6 Concluding Remarks

Acknowledgment

References

Chapter 9: Graphical Models for Protein Function and Structure Prediction

9.1 Introduction

9.2 Graphical Models

9.3 Applications

9.4 Summary

Acknowledgments

References

Part C: Biological Feature Extraction

Chapter 10: Algorithms and Data Structures for Next-Generation Sequences

10.1 Aligners

10.2 Assemblers

References

Chapter 11: Algorithms for Next-Generation Sequencing Data

11.1 Introduction

11.2 Definitions and Notations

11.3 REAL: A Read Aligner for Mapping Short Reads to a Genome

11.4 CREAL: Mapping Short Reads to a Genome with Circular Structure

11.5 DynMap: Mapping Short Reads to Multiple Closely Related Genomes

11.6 Conclusion

References

Chapter 12: Gene Regulatory Network Identification with Qualitative Probabilistic Networks

12.1 Central Dogma: Gene Expression in a Cell

12.2 Measuring Expression Levels: Microarray Technology

12.3 Understanding Gene Regulatory Networks: Basic Concepts

12.4 Bayesian Networks for Learning GRNs

12.5 Toward Qualitative Modeling of GRNs

12.6 QPNs for Gene Regulation

12.7 Summary and Conclusions

References

Part D: Biological Feature Selection

Chapter 13: Comparing, Ranking, and Filtering Motifs with Character Classes: Application to Biological Sequences Analysis

13.1 Introduction

13.2 Motifs with Character Classes: A Characterization

13.3 Filtering by means of Underlying Motifs

13.4 Experimental Results and Discussion

13.5 Conclusion

Acknowledgments

References

Chapter 14: Stability of Feature Selection Algorithms and Ensemble Feature Selection Methods in Bioinformatics

14.1 Introduction

14.2 Feature Selection Algorithms and Instability

14.3 Ensemble Feature Selection Algorithms

14.4 Metrics for Stability Assessment

14.5 Conclusions

Acknowledgment

References

Chapter 15: Statistical Significance Assessment for Biological Feature Selection: Methods and Issues

15.1 Introduction

15.2 Statistical Significance Assessment

15.3 p-Value Distribution and π0 Estimation

15.4 Obtaining Control and Background Estimation

15.5 Statistical Significance in Integrative Analysis

15.6 Conclusions

Symbols

Acknowledgments

References

Chapter 16: Survey of Novel Feature Selection Methods for Cancer Classification

16.1 Biological Background

16.2 Introduction

16.3 Kernel-Based Feature Selection with Hilbert–Schmidt Independence Criterion

16.4 Redundancy-Based Gene Selection

16.5 Unsupervised Feature Selection

16.6 Summary of Algorithms

16.7 Conclusion

References

Chapter 17: Information-Theoretic Gene Selection in Expression Data

17.1 Introduction

17.2 Curse of Dimensionality

17.3 Variable Selection Exploration Strategies

17.4 Relevance, Redundancy, and Synergy

17.5 Information-Theoretic Filters

17.6 Fast Mutual Information Estimation

17.7 Conclusions

References

Chapter 18: Feature Selection and Classification for Gene Expression Data Using Evolutionary Computation

18.1 Introduction

18.2 Preliminaries

18.3 Evolutionary Reduct Generation

18.4 Experimental Results

18.5 Conclusion

References

Section II: Biological Data Mining

Part E: Regression Analysis of Biological Data

Chapter 19: Building Valid Regression Models for Biological Data Using Stata and R

19.1 Introduction

19.2 Fitting the Model

19.3 Validity of the Model

19.4 Nonconstant Variance and Variable Transformation

19.5 Marginal Model Plots

19.6 Patterns in Residual Plots

19.7 Variable Selection

References

Chapter 20: Logistic Regression in Genomewide Association Analysis

20.1 Introduction

20.2 Single Genetic Marker: Basic Concepts

20.3 Single Genetic Marker: Statistical Tests

20.4 Two Genetic Markers and Fisher's Nonadditivity Interaction

20.5 Many Genetic Markers in Genomewide Association Analysis: Variable Reduction and Penalized Regression

20.6 Latent Variables and Dimension Reduction: Partial Least-Squares Regression

20.7 Latent Variables: Logic Regression

20.8 Discussion

Appendix: Matrix Representation of Partial Least-Squares Regression

Acknowledgments

References

Chapter 21: Semiparametric Regression Methods in Longitudinal Data: Applications to AIDS Clinical Trial Data

21.1 Introduction

21.2 Modeling a Single Treatment Group Using a Semiparametric Partially Linear Model

21.3 Modeling Within-Subject Covariance

21.4 Modeling Multiple Treatment Groups

21.5 Summary

Acknowledgment

References

Part F: Biological Data Clustering

Chapter 22: The Three Steps of Clustering in the Post-Genomic Era

22.1 Introduction

22.2 Experimental Set-Up

22.3 Distances

22.4 Clustering Algorithms

22.5 Internal Validation Measures

22.6 Conclusions

Acknowledgment

References

Chapter 23: Clustering Algorithms of Microarray Data

23.1 Introduction

23.2 Geometric Clustering Algorithms

23.3 Model-Based Clustering Algorithms

23.4 Formal Concept–Based Clustering Algorithms

23.5 Clustering Webtools

23.6 Microarray Data Sets

23.7 Conclusion

References

Chapter 24: Spread of Evaluation Measures for Microarray Clustering

24.1 Introduction

24.2 Search Procedure and Classification of Evaluation Measures

24.3 Internal Measures

24.4 External Measures

24.5 Biological Measures

24.6 Discussion

24.7 Data Sets

24.8 Conclusions

References

Chapter 25: Survey on Biclustering of Gene Expression Data

25.1 Introduction

25.2 Types of Biclusters

25.3 Groups of Biclusters

25.4 Evaluation Functions

25.5 Systematic and Stochastic Biclustering Algorithms

25.6 Bicluster Validation

25.7 Conclusion

Acknowledgments

References

Chapter 26: Multiobjective Biclustering of Gene Expression Data with Bioinspired Algorithms

26.1 Introduction

26.2 Biclustering Problem in Microarray Data

26.3 Multiobjective Model for Biclustering in Gene Expression Data

26.4 Bioinspired Algorithms for Biclustering

26.5 Results and Discussions

26.6 Conclusion

References

Chapter 27: Coclustering Under Gene Ontology Derived Constraints for Pathway Identification

27.1 Introduction

27.2 Related Work

27.3 Constrained Coclustering

27.4 Parameterless Methodology for GO-driven Coclustering

27.5 Case Study

27.6 Conclusion

References

Part G: Biological Data Classification

Chapter 28: Survey on Fingerprint Classification Methods for Biological Sequences

28.1 Introduction

28.2 Basic Definitions and Problem Statements

28.3 Overview of Various Classification Approaches

28.4 Missing-Value Estimation Methods

28.5 Fingerprint Classification: Combinatorial Approach for Estimating Missing Values

Acknowledgments

References

Chapter 29: Microarray Data Analysis: From Preparation to Classification

29.1 Introduction

29.2 Experiment Design

29.3 Normalization

29.4 Ranking

29.5 Brief Review of Approaches of Microarray Data Classification

29.6 MIDClass: A Novel Approach to Effective Microarray Data Classification

29.7 Experimental Study

29.8 Conclusion

References

Chapter 30: Diversified Classifier Fusion Technique for Gene Expression Data

30.1 Introduction

30.2 Background Study

30.3 Preliminaries

30.4 Proposed Model

30.5 Experimental Evaluation

30.6 Conclusion

References

Chapter 31: RNA Classification and Structure Prediction: Algorithms and Case Studies

31.1 Introduction

31.2 Classification of RNA Sequences

31.3 In Silico Prediction of RNA Pseudoknots

31.4 Conclusion

References

Chapter 32: Ab Initio Protein Structure Prediction: Methods and Challenges

32.1 Introduction

32.2 Protein-Folding Problem Milestones at a Glance

32.3 Ab Initio Protein Structure Prediction

32.4 Pure Ab Initio Prediction

32.5 Ab Initio Prediction with Database Information

32.6 Discussion and Challenges

32.7 Appendix: CASP9

References

Chapter 33: Overview of Classification Methods to Support HIV/AIDS Clinical Decision Making

33.1 Predicting Resistance to Drugs

33.2 Predicting Coreceptor Usage

33.3 Identifying Subtype

33.4 Identifying Mutation Selection Pressure

33.5 Making Treatment-related Decisions

33.6 Future Directions

33.7 Conclusion

References

Part H: Association Rules Learning from Biological Data

Chapter 34: Mining Frequent Patterns and Association Rules from Biological Data

34.1 Introduction

34.2 Definition of AR Mining Problem

34.3 Algorithms for Mining ARs

34.4 Preprocessing and Postprocessing

34.5 Gene Expression Data Mining

34.6 Sequential Data Mining

34.7 Structural Data Mining

34.8 Protein Interactions: Graph Data Mining

34.9 Text Mining

34.10 Conclusion

References

Chapter 35: Galois Closure Based Association Rule Mining from Biological Data

35.1 Introduction

35.2 Association Rule Mining Frameworks

35.3 Condensed Representations of Association Rules

35.4 Interestingness Measures

35.5 Biological Applications

35.6 Conclusion

References

Chapter 36: Inference of Gene Regulatory Networks Based on Association Rules

36.1 Introduction

36.2 Data Mining and Inference of GRNs based on ARs

36.3 Techniques of Inference of GRNs based on AR

36.4 Concluding Remarks

Acknowledgments

References

Part I: Text Mining and Application to Biological Data

Chapter 37: Current Methodologies for Biomedical Named Entity Recognition

37.1 Introduction

37.2 Preliminaries

37.3 Dictionary-Based Approaches

37.4 ML-Based Approaches

37.5 Hybrid Approaches

37.6 Use Cases

37.7 Conclusion

References

Chapter 38: Automated Annotation of Scientific Documents: Increasing Access to Biological Knowledge

38.1 Introduction

38.2 Survey of Tools

38.3 Technologies and Techniques

38.4 Discussion

38.5 Future Perspectives

Glossary

Acknowledgments

References

Chapter 39: Augmenting Biological Text Mining with Symbolic Inference

39.1 Introduction

39.2 Identifying Implied Information

39.3 Predicting New Hypotheses

39.4 Text Mining with Distributional Analysis

39.5 Discussion and Conclusion

Acknowledgments

References

Chapter 40: Web Content Mining for Learning Generic Relations and their Associations from Textual Biological Data

40.1 Introduction

40.2 State-of-the-Art in Biological Relation Mining

40.3 Proposed Biological Relation-Mining System

40.4 Performance Evaluation

40.5 Uniqueness of Proposed Biological Relation-Mining System

40.6 Conclusion and Future Work

References

Chapter 41: Protein–Protein Relation Extraction from Biomedical Abstracts

41.1 Introduction

41.2 BioEve: BioMolecular Event Extractor

41.3 Sentence-Level Classification and Semantic Labeling

41.4 Event Extraction Using Dependency Parsing

41.5 Experiments and Evaluations

41.6 Conclusions

Acknowledgments

References

Part J: High-Performance Computing for Biological Data Mining

Chapter 42: Accelerating Pairwise Alignment Algorithms by Using Graphics Processor Units

42.1 Introduction

42.2 Pairwise Alignment Algorithms

42.3 Graphics Processor Units

42.4 Accelerating Pairwise Alignment Algorithms

42.5 Conclusion

References

Chapter 43: High-Performance Computing in High-Throughput Sequencing

43.1 Introduction

43.2 Next-Generation Sequencing Applications

43.3 High-Performance Computing Architectures: Short Summary

43.4 High-Performance Computing on Next-Generation Sequencing Data

43.5 Summary

References

Chapter 44: Large-scale clustering of short reads for metagenomics on GPUs

44.1 Introduction

44.2 Background

44.3 Pairwise Global Alignment

44.4 GPU programming

44.5 CRiSPy-CUDA

44.6 Experiments

44.7 Conclusions

References

Section III: Biological Data Postprocessing

Part K: Biological Knowledge Integration and Visualization

Chapter 45: Integration of Metabolic Knowledge for Genome-Scale Metabolic Reconstruction

45.1 Introduction

45.2 Omics ERA

45.3 Metabolic Network Modeling

45.4 History of Genome-Scale Models

45.5 How Genome-Scale Metabolic Models Can Be Generated

45.6 Applications

45.7 Biochemical Pathways and Genome Annotation Databases

45.8 Conclusion

References

Chapter 46: Inferring and Postprocessing Huge Phylogenies

46.1 Introduction

46.2 Recent Advances

46.3 Data Avalanche: Example with rbcL

46.4 Future Challenges and Opportunities

46.5 Conclusion

Acknowledgment

References

Chapter 47: Biological Knowledge Visualization

47.1 Introduction

47.2 Information Visualization and Visual Analytics

47.3 Biological Data Types

47.4 Biological Data Visualization Issues

47.5 Sequence Data Visualization

47.6 Relational and Functional Data Visualization

47.7 Expression Data Visualization

47.8 Structure Data Visualization

47.9 Conclusion and Future Perspectives

References

Chapter 48: Visualization of Biological Knowledge Based on Multimodal Biological Data

48.1 Introduction

48.2 Multimodal Biological Data

48.3 Approaches to Discover Knowledge from Multimodal Biological Data

48.4 Novel Approach for Visualization and Discovery of Biological Knowledge Based on Multimodal Biological Data

48.5 Conclusion

References

Index

Wiley Series onBioinformatics: Computational Techniques and Engineering

A complete list of the titles in this series appears at the end of this volume.

Cover Design: Michael Rutkowski

Cover Image: ©iStockphoto/cosmin 4000

Copyright © 2014 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Elloumi, Mourad.

Biological knowledge discovery handbook: preprocessing, mining, and postprocessing of biological data / Mourad Elloumi, Albert Y. Zomaya.

pages cm. –(Wiley series in bioinformatics; 23)

ISBN 978-1-118-13273-9 (hardback)

1. Bioinformatics. 2. Computational biology. 3. Data mining. I. Zomaya, Albert Y. II. Title.

QH324.2.E45 2012

572.80285–dc23

2012042379

To my family for their patience and support.

Mourad Elloumi

To my mother for her many sacrifices over the years.

Albert Y. Zomaya

Preface

With the massive developments in molecular biology during the last few decades, we are witnessing an exponential growth of both the volume and the complexity of biological data. For example, the Human Genome Project provided the sequence of the 3 billion DNA bases that constitute the human genome. Consequently, we are provided too with the sequences of about 100,000 proteins. Therefore, we are entering the postgenomic era: After having focused so many efforts on the accumulation of data, we now must to focus as much effort, and even more, on the analysis of the data. Analyzing this huge volume of data is a challenging task not only because of its complexity and its multiple and numerous correlated factors but also because of the continuous evolution of our understanding of the biological mechanisms. Classical approaches of biological data analysis are no longer efficient and produce only a very limited amount of information, compared to the numerous and complex biological mechanisms under study. From here comes the necessity to use computer tools and develop new in silico high-performance approaches to support us in the analysis of biological data and, hence, to help us in our understanding of the correlations that exist between, on one hand, structures and functional patterns of biological sequences and, on the other hand, genetic and biochemical mechanisms. Knowledge discovery and data mining (KDD) are a response to these new trends.

Knowledge discovery is a field where we combine techniques from algorithmics, soft computing, machine learning, knowledge management, artificial intelligence, mathematics, statistics, and databases to deal with the theoretical and practical issues of extracting knowledge, that is, new concepts or concept relationships, hidden in volumes of raw data. The knowledge discovery process is made up of three main phases: data preprocessing, data processing, also called data mining, and data postprocessing. Knowledge discovery offers the capacity to automate complex search and data analysis tasks. We distinguish two types of knowledge discovery systems: verification systems and discoveryones. Verification systems are limited to verifying the user's hypothesis, while discovery ones autonomously predict and explain new knowledge. Biological knowledge discovery process should take into account both the characteristics of the biological data and the general requirements of the knowledge discovery process.

Data mining is the main phase in the knowledge discovery process. It consists of extracting nuggets of information, that is, pertinent patterns, pattern correlations, and estimations or rules, hidden in huge bodies of data. The extracted information will be used in the verification of the hypothesis or the prediction and explanation of knowledge. Biological data mining aims at extracting motifs, functional sites, or clustering/classification rules from biological sequences.

Biological KDD are complementary to laboratory experimentation and help to speed up and deepen research in modern molecular biology. They promise to bring us new insights into the growing volumes of biological data.

This book is a survey of the most recent developments on techniques and approaches in the field of biological KDD. It presents the results of the latest investigations in this field. The techniques and approaches presented deal with the most important and/or the newest topics encountered in this field. Some of these techniques and approaches represent improvements of old ones while others are completely new. Most of the other books on biological KDD either lack technical depth or focus on specific topics. This book is the first overview on techniques and approaches in biological KDD with both a broad coverage of this field and enough depth to be of practical use to professionals. The biological KDD techniques and approaches presented here combine sound theory with truly practical applications in molecular biology. This book will be extremely valuable and fruitful for people interested in the growing field of biological KDD, to discover both the fundamentals behind biological KDD techniques and approaches, and the applications of these techniques and approaches in this field. It can also serve as a reference for courses on bioinformatics and biological KDD. So, this book is designed not only for practitioners and professional researchers in computer science, life science, and mathematics but also for graduate students and young researchers looking for promising directions in their work. It will certainly point them to new techniques and approaches that may be the key to new and important discoveries in molecular biology.

This book is organized into 11 parts: Biological Data Management, Biological Data Modeling, Biological Feature Extraction, Biological Feature Selection, Regression Analysis of Biological Data, Biological Data Clustering, Biological Data Classification, Association Rules Learning from Biological Data, Text Mining and Application to Biological Data, High-Performance Computing for Biological Data Mining, and Biological Knowledge Integration and Visualization. The 48 chapters that make up the 11 parts were carefully selected to provide a wide scope with minimal overlap between the chapters so as to reduce duplication. Each contributor was asked that his or her chapter should cover review material as well as current developments. In addition, the authors chosen are leaders in their respective fields.

Mourad Elloumi and Albert Y. Zomaya

Contributors

Jad Abbass, Faculty of Science, Engineering and Computing, Kingston University, London, United Kingdom and Department of Computer Science and Mathematics, Lebanese American University, Beirut, Lebanon

Muhammad Abulaish, Center of Excellence in Information Assurance, King Saud University, Riyadh, Saudi Arabia and Department of Computer Science, Jamia Millia Islamia (A Central University), New Delhi, India

Syed Toufeeq Ahmed, Vanderbilt University Medical Center, Nashville, Tennessee

Shiva Akbari-Birgani, Laboratory of Systems Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran

Ali Al Mazari, School of Information Technologies, The University of Sydney, Sydney, Australia

Mohamed Al Sayed Issa, Computers and Systems Department, Faculty of Engineering, Zagazig University, Egypt

Yazdan Asgari, Laboratory of Systems Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran

Wassim Ayadi, Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE) and LERIA, University of Angers, Angers, France

Haider Banka, Department of Computer Science and Engineering, Indian School of Mines, Dhanbad, India

Laure Berti-Équille, Institut de Recherche pour le DÉveloppement, Montpellier, France

Gianluca Bontempi, Machine Learning Group, Computer Science Department, UniversitÉ Libre de Bruxelles, Brussels, Belgium

Nigel P. Brown, BioQuant, University of Heidelberg, Heidelberg, Germany

Giulia Bruno, Dipartimento di Ingegneria Gestionale e della Produzione, Politecnico di Torino, Torino, Italy

David Campos, DETI/IEETA, University of Aveiro, Aveiro, Portugal xv

Jessica Andrea Carballido, Laboratorio de Investigación y Desarrollo en Computación Científica (LIDeCC), Dept. Computer Science and Engineering, Universidad Nacional del Sur, Bahía Blanca, Argentina

Luciano Cascione, Department of Clinical and Molecular Biomedicine, University of Catania, Italy

Ümit V. ÇatalyÜrek, Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio

Carlo Cattani, Department of Mathematics, University of Salerno, Fisciano (SA), Italy

Meghana Chitale, Department of Computer Science, Purdue University, West Lafayette, Indiana

Young-Rae Cho, Department of Computer Science, Baylor University, Waco, Texas

Kwok Pui Choi, Department of Statistics and Applied Probability, National University of Singapore, Singapore

Matteo Comin, Department of Information Engineering, University of Padova, Padova, Italy

Francesca Cordero, Department of Computer Science, University of Torino, Turin, Italy

Suresh Dara, Department of Computer Science and Engineering, Indian School of Mines, Dhanbad, India

Bhaskar DasGupta, Department of Computer Science, University of Illinois at Chicago, Chicago, Illinois

Hasan Davulcu, Department of Computer Science and Engineering, Ira A. Fulton Engineering, Arizona State University, Tempe, Arizona

Mourad Elloumi, Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE) and University of Tunis-El Manar, Tunisia

Juan Esquivel-Rodríguez, Department of Computer Science, Purdue University, West Lafayette, Indiana

Alfredo Ferro, Department of Clinical and Molecular Biomedicine, University of Catania, Italy

Alessandro Fiori, Dipartimento di Automatica e Informatica, Politecnico di Torino, Torino, Italy

Adelaide Valente Freitas, DMat/CIDMA, University of Aveiro, Portugal

Terry Gaasterland, Scripps Genome Center, University of California San Diego, San Diego, California

Cristian AndrÉs Gallo, Laboratorio de Investigación y Desarrollo en Computación Científica (LIDeCC), Dept. Computer Science and Engineering, Universidad Nacional del Sur, Bahía Blanca, Argentina

Roger J. Garsia, Department of Clinical Immunology, Royal Prince Alfred Hospital, Sydney, Australia

Raffaele Giancarlo, Department of Mathematics and Informatics, University of Palermo, Palermo, Italy

Rosalba Giugno, Department of Clinical and Molecular Biomedicine, University of Catania, Italy

Jin-Kao Hao, LERIA, University of Angers, Angers, France

Ayat Hatem, Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio

Heiko Horn, Department of Disease Systems Biology, The Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark

Ting Hu, Computational Genetics Laboratory, Geisel School of Medicine, Dartmouth College, Lebanon, New Hampshire

Kun Huang, Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio

Zina M. Ibrahim, Social Genetic and Developmental Psychiatry Centre, King's College London, London, United Kingdom

Dino Ienco, Institut de Recherche en Sciences et Technologies pour l'Environnement, Montpellier, France

Costas S. Iliopoulos, Department of Informatics, King's College London, Strand, London, United Kingdom and Digital Ecosystems & Business Intelligence Institute, Curtin University, Centre for Stringology & Applications, Perth, Australia

Jahiruddin, Department of Computer Science, Jamia Millia Islamia (A Central University), New Delhi, India

Laetitia Jourdan, INRIA Lille Nord Europe, Villeneuve d'Ascq, France

Lakshmi Kaligounder, Department of Computer Science, University of Illinois at Chicago, Chicago, Illinois

Radha Krishna Murthy Karuturi, Computational and Mathematical Biology, Genome Institute of Singapore, Singapore

Khairul A. Kasmiran, School of Information Technologies, The University of Sydney, Sydney, Australia

Ioannis Kavakiotis, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece

Kamer Kaya, Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio

Catharina Maria Keet, School of Computer Science, University of KwaZulu-Natal, Durban, South Africa

Daisuke Kihara, Department of Computer Science, Purdue University, West Lafayette, Indiana and Department of Biological Sciences, Purdue University, West Lafayette, Indiana

Gaurav Kumar, Department of Chemistry and Biomolecular Sciences and ARC Centre of Excellence in Bioinformatics, Macquarie University, Sydney, Australia

Chee Keong Kwoh, School of Computer Engineering, Nanyang Technological University, Singapore

Giuseppe Lancia, Department of Mathematics and Informatics, University of Udine, Udine, Italy

Hee-Jin Lee, Department of Computer Science, Korea Advanced Institute of Science and Technology, Daejeon, South Korea

Juntao Li, Computational and Mathematical Biology, Genome Institute of Singapore, Singapore and Department of Statistics and Applied Probability, National University of Singapore, Singapore

Wentian Li, Robert S. Boas Center for Genomics and Human Genetics, Feinstein Institute for Medical Research, North Shore LIJ Health Systems, Manhasset, New York

Yehua Li, Department of Statistics and Statistical Laboratory, Iowa State University, Ames, Iowa

Charles Lindsey, StataCorp, College Station, Texas

GiosuÉ Lo Bosco, Department of Mathematics and Informatics, University of Palermo, Palermo, Italy and I.E.ME.S.T., Istituto Euro Mediterraneo di Scienza e Tecnologia, Palermo, Italy

Nashat Mansour, Department of Computer Science and Mathematics, Lebanese American University, Beirut, Lebanon

Ali Masoudi-Nejad, Laboratory of Systems Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran

SÉrgio Matos, DETI/IEETA, University of Aveiro, Aveiro, Portugal

Patrick E. Meyer, Machine Learning Group, Computer Science Department, UniversitÉ Libre de Bruxelles, Brussels, Belgium

Debahuti Mishra, Institute of Technical Education and Research, Siksha O Anusandhan University, Bhubaneswar, Odisha, India

Sashikala Mishra, Institute of Technical Education and Research, Siksha O Anusandhan University, Bhubaneswar, Odisha, India

Ahmed Mokaddem, Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE) and University of Tunis-El Manar, El Manar, Tunisia

Kartick Chandra Mondal, Laboratory I3S, University of Nice Sophia-Antipolis, Sophia-Antipolis, France

Jason H. Moore, Computational Genetics Laboratory, Geisel School of Medicine, Dartmouth College, Lebanon, New Hampshire

Fouzia Moussouni, UniversitÉ de Rennes 1, Rennes, France

Mohamed Nadif, LIPADE, University of Paris-Descartes, Paris, France

Radhika Nair, Department of Computer Science and Engineering, Ira A. Fulton Engineering, Arizona State University, Tempe, Arizona

Jean-Christophe Nebel, Faculty of Science, Engineering and Computing, Kingston University, London, United Kingdom

Alioune Ngom, School of Computer Science, University of Windsor, Windsor, Ontario, Canada

Thuy Diem Nguyen, School of Computer Engineering, Nanyang Technological University, Singapore

Oleg Okun, SMARTTECCO, Stockholm, Sweden

JosÉ Luis Oliveira, DETI/IEETA, University of Aveiro, Portugal

Hatice GÜlÇin Özer, Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio

Evangelos Pafilis, Institute of Marine Biology Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece

Jong C. Park, Department of Computer Science, Korea Advanced Institute of Science and Technology, Daejeon, South Korea

Nicolas Pasquier, Laboratory I3S, University of Nice Sophia-Antipolis, Sophia-Antipolis, France

Chintan Patel, Department of Computer Science and Engineering, Ira A. Fulton Engineering, Arizona State University, Tempe, Arizona

Yudi Pawitan, Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden

Ruggero G. Pensa, Department of Computer Science, University of Torino, Turin, Italy

Giuseppe Pigola, IGA Technology Services, Udine, Italy

Luca Pinello, Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts; Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts; and I.E.ME.S.T., Istituto Euro Mediterraneo di Scienza e Tecnologia, Palermo, Italy

Solon P. Pissis, Department of Informatics, King's College London, Strand, London, United Kingdom

Alberto Policriti, Department of Mathematics and Informatics and Institute of Applied Genomics, University of Udine, Udine, Italy

Ignacio Ponzoni, Laboratorio de Investigación y Desarrollo en Computación Científica (LIDeCC), Dept. Computer Science and Engineering, Universidad Nacional del Sur, Bahía Blanca, Argentina and Planta Piloto de Ingeniería Química (PLAPIQUI) CONICET, Bahía Blanca, Argentina

Alfredo Pulvirenti, Department of Clinical and Molecular Biomedicine, University of Catania, Italy

Shoba Ranganathan, Department of Chemistry and Biomolecular Sciences and ARC Centre of Excellence in Bioinformatics, Macquarie University, Sydney, Australia

Hendrik Rohn, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany

Haifa Ben Saber, Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE) and University of Tunis, Tunisia

Lee Sael, Department of Computer Science, Purdue University, West Lafayette, Indiana and Department of Biological Sciences, Purdue University, West Lafayette, Indiana

Ali Salehzadeh-Yazdi, Laboratory of Systems Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran

Rodrigo Santamaría, Department of Computer Science and Automation, University of Salamanca, Salamanca, Spain

Bertil Schmidt, Institut fÜr Informatik, Johannes Gutenberg University, Mainz, Germany

Falk Schreiber, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany and Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany

Khedidja Seridi, INRIA Lille Nord Europe, Villeneuve d'Aseq, France

Kailash Shaw, Department of CSE, Gandhi Engineering College, Bhubaneswar, Odisha, India

Simon J. Sheather, Department of Statistics, Texas A&M University, College Station, Texas

Stephen A. Smith, Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan

Junilda Spirollari, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ

Alexandros Stamatakis, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany

El-Ghazali Talbi, INRIA Lille Nord Europe, Villeneuve d'Ascq, France

Kean Ming Tan, Department of Statistics, Purdue University, West Lafayette, Indiana

Xin Lu Tan, Department of Statistics, Purdue University, West Lafayette, Indiana

Bahar Taneri, Department of Biological Sciences, Eastern Mediterranean University, Famagusta, North Cyprus and Institute for Public Health Genomics, Cluster of Genetics and Cell Biology, Faculty of Health, Medicine and Life Sciences, Maastricht University, The Netherlands

Mingjie Tang, Department of Computer Science, Purdue University, West Lafayette, Indiana

Ahmed Y. Tawfik, Information Systems Department, French University of Egypt, El-Shorouk, Egypt

Sukru Tikves, Department of Computer Science and Engineering, Ira A. Fulton Engineering, Arizona State University, Tempe, Arizona

George Tzanis, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece

Filippo Utro, Computational Genomics Group, IBM T.J. Watson Research Center, Yorktown Heights, New York

Davide Verzotto, Department of Information Engineering, University of Padova, Padova, Italy

Francesco Vezzi, Department of Mathematics and Informatics and Institute of Applied Genomics, University of Udine, Udine, Italy

Alessia Visconti, Department of Computer Science, University of Torino, Turin, Italy

Ioannis Vlahavas, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece

Jason T. L. Wang, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ

Penghao Wang, School of Mathematics and Statistics, The University of Sydney, Sydney, Australia

Dongrong Wen, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ

Pengyi Yang, School of Information Technologies, University of Sydney, Sydney, Australia

Jean Yee-Hwa Yang, School of Mathematics and Statistics, University of Sydney, Sydney, Australia

Yaning Yang, Department of Statistics and Finance, University of Science and Technology of China, Hefei, China

Zejun Zheng, Singapore Institute for Clinical Sciences, Singapore

Ling Zhong, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ

Bing B. Zhou, School of Information Technologies, University of Sydney, Sydney, Australia

Albert Y. Zomaya, School of Information Technologies, University of Sydney, Sydney, Australia

Section I

Biological Data Preprocessing

Part A

Biological Data Management

Chapter 1

Genome and Transcriptome Sequence Databases for Discovery, Storage, and Representation of Alternative Splicing Events

Bahar Taneri1,2 and Terry Gaasterland3

1Department of Biological Sciences, Eastern Mediterranean University, Famagusta, North Cyprus

2Institute for Public Health Genomics, Cluster of Genetics and Cell Biology, Faculty of Health, Medicine and Life Sciences, Maastricht University, The Netherlands

3Scripps Genome Center, University of California San Diego, San Diego, California

1.1 Introduction

Transcription is a critical cellular process through which the RNA molecules specify which proteins are expressed from the genome within a given cell. DNA is transcribed into RNA and RNA transcripts are then translated into proteins, which carry out numerous functions within cells. Prior to protein synthesis, RNA transcripts undergo several modifications including 5′ capping, 3′ polyadenylation, and splicing [1]. Premature messenger RNA (pre-mRNA) processing determines the mature mRNA's stability, its localization within the cell, and its interaction with other molecules [2]. In addition to constitutive splicing, the majority of eukaryotic genes undergo alternative splicing and therefore code for proteins with diverse structures and functions.

In this chapter, we describe the process of RNA splicing and focus on RNA alternative splicing. As described in detail below, splicing removes noncoding introns from the pre-mRNA and ligates the coding exonic sequences to produce the mRNA transcript. Alternative splicing is a cellular process by which several different combinations of exon–intron architectures are achieved with different mRNA products from the same gene. This process generates several mRNAs with different sequences from a single gene by making use of alternative splice sites of exons and introns. This process is critical in eukaryotic gene expression and plays a pivotal role in increasing the complexity and coding potential of genomes. Since alternative splicing presents an enormous source of diversity and greatly elevates the coding capacity of various genomes [3–5], we devote this chapter to this cellular phenomenon, which is widespread across eukaryotic genomes.

In particular we explain the databases for Alternative Splicing Queries (dbASQ), a computational pipeline we used to generate alternative splicing databases for genome and transcriptome sequences of various organisms. dbASQ enables the use of genome and transcriptome sequence data of any given organism for database development. Alternative splicing databases generated via dbASQ not only store the sequence data but also facilitate the detection and visualization of alternative splicing events for each gene in each genome analyzed. Data mining of the alternative splicing databases, generated using the dbASQ system, enables further analysis of this cellular process, providing biological answers to novel scientific questions.

In this chapter we provide a general overview of the widespread cellular phenomenon alternative splicing. We take a computational approach in answering biological questions with regard to alternative splicing. In this chapter you will find a general introduction to splicing and alternative splicing along with their mechanism and regulation. We briefly discuss the evolution and conservation of alternative splicing. Mainly, we describe the computational tools used in generating alternative splicing databases. We explain the content and the utility of alternative splicing databases for five different eukaryotic organisms: human, mouse, rat, frutifly, and soil worm. We cover genomic and transcriptomic sequence analyses and data mining from alternative splicing databases in general.

1.2 Splicing

A typical mammalian gene is a multiexon gene separated by introns. Exons are relatively short, about 145 nucleotides, and are interrupted by much longer introns of about 3300 nucleotides [6, 7]. In humans, the average number of exons per protein coding gene is 8.8 [7]. Both introns and exons of a protein-coding gene are transcribed into a pre-mRNA molecule [1]. Approximately 90% of the pre-mRNA molecule is composed of the introns and these are removed before translation. Before the mRNA molecule transcribed from the gene can be translated into a protein molecule, there are several processes that need to take place. While in total an average protein-coding gene in human is about 27,000 bp in the genome and in the pre-mRNA molecule, the processed mRNA contains only about 1300 coding nucleotides and 1000 nucleotides in the untranslated regions (UTRs) and polyadenylation (poly A) tail. The removal of introns and ligation of exons are referred to as the splicing process or the RNA splicing process [1, 7]. Splicing takes place in the nucleus. Final products of splicing which are the ligated exonic sequences are ready for translation and are exported out of the nucleus [1].

1.2.1 Mechanism of Splicing

Simply, splicing refers to removal of intervening sequences from the pre-mRNA molecule and ligation of the exonic sequences. Each single splicing event removes one intron and ligates two exons. This process takes place via two steps of chemical reactions [1]. As shown in Figure 1.1, within the intronic sequence there is a particular adenine nucleotide which attacks the 5′ intronic splice site. A covalent bond is formed between the 5′ splice site of the intron and the adenine nucleotide releasing the exon upstream of the intron. In the second chemical reaction, the free 3′-OH group at the 3′ end of the upstream exon ligates with the 5′ end of the downstream exon. In this process, the intronic sequence, which contains an RNA loop, is released.

Figure 1.1 Illustration of two chemical reactions needed for one splicing reaction (A: adenine nucleotide at branch point of intron).

1.2.2 Regulation of Splicing

There are many cis-acting and trans-acting factors involved in splicing. The network of these factors facilitates splicing through exon definition and intron definition. Exon definition occurs early in splicing and involves interactions recognizing the exonic 5′ splice site and 3′ splice site, whereas for intron definition initial interactions take place across the intron for the recognition of 5′ and 3′ splice sites of the intron [8]. Splicing is regulated by a dynamic combinatorial network of RNA and protein molecules. Spliceosome, the splicing machinery, is a very complex system and is composed of five small nuclear RNAs (snRNAs), termed U1, U2, U4, U5, and U6 [1]. These are short RNA sequences of about 200 nucleotides long. In addition to the snRNAs, about 100 proteins are parts of the spliceosome. Assembly of snRNAs with the proteins forms small nuclear ribonucleoprotein complexes (snRNPs), which precisely bind to splice sites on the pre-mRNA to facilitate splicing [9]. Figure 1.2 shows the main steps of spliceosome assembly in the cell. Initially the 5′ intronic splice site interacts with U1. Then U2 interacts with the branch point. Next, U1 is replaced by the U4/U6, U5 complex, which then interacts with the U2, initiating intronic lariat formation. It is thought that the complex molecular content and assembly of the spliceosome are due to the need for highly accurate splicing in order to prevent formation of malfunctional or nonfunctional protein molecules.

Figure 1.2 Spliceosome assembly (U1, U2, U4, U5, U6: snRNAs; GU: guanine and uracil nucleotides forming 5′ splice site signal; AG: adenine and guanine nucleotides forming 3′ splice site signal).

In addition to the complex splicing machinery in the cell, specific sequence signals are needed for realization of splicing. There are four main sequence signals on the pre-mRNA molecule which play important roles in splicing. As shown in Figure 1.3, these are the 5′ splice site (exon–intron junction at the 5′ end of the intron), 3′ splice site (exon–intron junction at the 3′ end of the intron, the branch point (specific sequence slightly upstream of the 3′ splice site), and the polypyrimidine tract (between the branch point and the 3′ splice site). These sequences facilitate the two transesterification reactions involved in intron removal and exon ligation.

Figure 1.3 Splicing signals on pre-mRNA molecule (GU: guanine and uracil nucleotides forming 5′ splice site signal; AG: adenine and guanine nucleotides forming 3′ splice site signal; A: adenine nucleotide at branch point of intron; polypyrimidine tract: pyrimidine-rich short sequence close to 3′ splice site).

However, these sequences are not sufficient for alternative splice site selection. There are multiple other sequence signals involved in alternative splicing. There are several types of cis-acting regulatory sequences for splicing within the RNA molecule termed enhancers and silencers, which stimulate or suppress splicing, respectively. Exonic splicing enhancers (ESEs), exonic splicing silencers (ESSs), intronic splicing enhancers (ISEs), and intronic splicing silencers (ISSs) are among the cis-acting splicing regulatory sequences.

Here, we provide an example of ESE regulatory function. ESEs act as binding sites for regulatory RNA binding proteins (RBPs), particularly as binding sites for SR proteins (proteins rich in serine–arginine). SR proteins have two RNA recognition motifs (RRMs) and one arginine–serine rich domain (RS domain). SR proteins bind to RNA sequence motifs via their RRM domains [10], and they recruit the spliceosome to the splice site via their RS domain. By this process the SR proteins enable exon definition [6]. SR proteins recruit the basal splicing machinery to the RNA; therefore they are required for both constitutive and alternative splicing. Figure 1.4 illustrates SR protein binding to ESEs on the RNA molecule. In addition, SR proteins work as inhibitors of splicing inhibitory proteins binding to ESS sites close to ESEs, where SRs are bound (Figure 1.4). Many exons contain ESEs, which overall have varying sequences [8].

Figure 1.4 SR protein binding on pre-mRNA: SR inhibition of splicing inhibitory protein.

Though less well understood than ESEs, ESSs are known negative regulators of splicing. They interact with repressor heterogeneous nuclear ribonucleoproteins (hnRNPs) to silence splicing [11]. Certain trans-acting splicing regulatory proteins could bind to ESS sequences causing exon skipping [12]. Similarly, intronic sequences can act both as enhancers and silencers of splicing events. Certain intronic sequences function as ISEs and can enhance the splicing of their upstream exon [8]. Certain ISSs could signal for repressor protein binding. For example, specifically YCAY motifs, where Y denotes a pyrimidine (U or C), signal for NOVA binding (a neuron-specific splicing regulatory protein). These particular sequences can act as ISSs depending on their location within the pre-mRNA molecule [13]. ISSs are further discussed in Section 1.3.3.

1.3 Alternative Splicing

1.3.1 Introduction to Alternative Splicing

Alternative splicing is a widespread phenomenon across and within the eukaryotic genomes. Of the estimated 25,000 protein-coding genes in human, ∼90% are predicted to be alternatively spliced [14]. The impact of alternative splicing is widespread on the eukaryotic organisms' gene expression in general [5]. Earlier studies have shown that the majority of the immune system and the nervous system genes exhibit alternative splicing [15]. We have previously shown that the majority of mouse transcription factors are alternatively spliced, leading to protein domain architecture changes [16]. Below, we detail different types of alternative splicing and the mechanism and regulation of this cellular process. We mention the evolution and conservation of alternative splicing across different genomes.

Types of Alternative Splicing Alternative splicing of the pre-mRNA molecule can occur in several different ways. Figure 1.5 shows different types of alternative splicing events which include the presence and absence of cassette exons, mutually exclusive exons, intron retention, and various forms of length variation. A given RNA transcript can contain multiple different types of alternative splicing.Examples of Widespread Presence of Alternative Splicing in Eukaryotic Genes Alternative splicing is a well-documented, widespread phenomenon across the eukaryotic genomes. Here, we provide two interesting examples of alternatively spliced genes, one from Drosophila melanogaster and the other from the human genome. One of the most interesting examples of alternative splicing involves the Down syndrome cell adhesion molecule (Dscam) gene of D. melanogaster. There are 95 cassette exons in this gene and a total of 38,016 different RNA transcripts can potentially be generated from this gene through differential use of the exon–intron structure [5, 17]. The Dscam example illustrates the enormous coding-changing capacity of alternative splicing and its influence on the variation of gene expression within and across cells [5]. The KCNMA 1 human gene presents another interesting case of alternative splicing. This gene exhibits both cassette exons and exons with length variation at 5′ and 3′ ends. These alternative exons generate over 500 different RNA transcripts [5].

Figure 1.5 Types of alternative splicing: (a) cassette exon, present or absent in its entirety or from RNA transcript; (b) mutually exclusive exons, only one present in any given RNA transcript; (c) intron retention; (d) length-variant exon, nucleotide length variation possible on both 5′ and 3′ ends or on either end (only use of alternative 5′ splice site shown, use of alternative 3′ splice site not shown).

1.3.2 Mechanism of Alternative Splicing

Mainly the mechanism of alternative splicing involves interaction of cis-acting and trans-acting splicing factors. Recruitment of the splicing machinery to the correct splice sites, blocking of certain splice sites, and enhancing the use of other splice sites all contribute to this process [5]. Furthermore, RNA splicing and transcription are temporally and spatially coordinated. As the pre-mRNA is transcribed, splicing starts to take place [2]. Alternative splicing co-occurs with transcription and may be dependent on the promoter region of the gene. Different promoters might recruit different amounts of SR proteins. Or different promoters might recruit fast-or slow-acting RNA polymerases, which changes the course of splicing. Slow-acting promoters present more chance for exon inclusion and fast-acting ones promote exon exclusion [18]. Furthermore, epigenetics plays a role in the process of alternative splicing. The dynamic chromatin structure, which affects transcription, is also implicated in alternative splicing [19]. In addition, it has been shown that histone modification takes place differentially in the areas with constitutive exons compared to those with alternative exons [20, 21].

1.3.3 Regulation of Alternative Splicing

Alternative splicing is a tissue-specific, developmental stage and/or physiological condition dependent [5, 22] and is regulated in this manner. Complex interactions between cis regulatory sequences and trans regulatory factors of RNA binding proteins lead to a tissue-specific, cell-specific, developmental stage and physiological condition–dependent regulation of splicing [23–26]. An example of cis-acting regulation is the ISS-based alternative exon exclusion. Inclusion of an alternative exon depends on several factors, including the affinity and the concentrations of positive and negative regulators of splicing. ISSs flank the alternative exons on both sides and could bind the negative regulators of splicing. Protein–protein interaction among these negative regulators results in alternative exon skipping [6]. Figure 1.6 shows ISS regulation leading to exon exclusion from the mRNA.

Figure 1.6 ISS-based exon exclusion (black structure: regulatory protein).

Splicing Regulatory Proteins Splicing regulatory proteins which control tissue-specific alternative splicing are expressed in certain cell types [24]. Most such well-known splicing factors are neuron-specific Nova1 and Nova2 proteins [27]. Importantly, splicing could be regulated by different isoforms of a splicing factor [28]. Here, we provide a partial list of splicing regulatory proteins: polypyrimidine tract binding (PTB) protein [29], various SR proteins [30–32], various hnRNPs [33–36], ASF/SF2 [37], transformer-2 (tra-2) [38], Sam68 [39], CELF [40], muscleblind-like (MBNL) [41], Hu [42], Fox-1 and Fox-2 [43], and sex-lethal [44]. Long and Caceres [31] provide an extensive review of SR proteins and SR protein–related regulators of splicing and alternative splicing.Tissue-Specific Isoform Expression It is well established that alternative splicing is a tissue-specific cellular process. Since an increased number of alternatively spliced isoforms has been shown to be expressed in the brain of mammals [45], we choose to illustrate the tissue specificity of alternative splicing by discussing a case of neuron-specific regulation of this process. Several trans-acting regulatory factors for splicing are proteins providing tissue-specific regulation of alternative splicing. Nova1 and Nova2 proteins are the first tissue-specific splicing regulators identified in vertebrates [46]. Nova proteins are neuron-specific regulators of alternative splicing. The cis regulatory elements to which Nova proteins bind have been identified as YCAY clusters, where Y denotes either U or C, within the sequence of the pre-mRNA [13]. Nova proteins can promote or prevent exon inclusion in their target RNAs, depending on where they bind in relation to exon–intron architecture of the RNA molecule. When Nova binds within exonic YCAY clusters, exon is skipped, whereas intronic binding of Nova enhances exon inclusion. Nova promotes removal of introns containing YCAY clusters and those introns close to YCAY clusters [13]. Ule et al. [13] define a genomewide map of cis regulatory elements of neuron-specific alternative splicing regulatory protein Nova. They combine bioinformatics with CLIP technology which stands for cross-linking and immunoprecipitation and splicing microarrays to identify target exons of Nova. Spliceosome assembly is differentially altered by Nova binding to different locations of cis-acting elements within the genome. Nova regulated exons are enriched in YCAY clusters (on average ∼28 nucleotides) near the splice junctions. This is well conserved among human and mouse alternative exons regulated by Nova [13].

1.3.4 Evolution and Conservation of Splicing and Alternative Splicing

The RNA splicing process is thought to have originated from Group II introns with autocatalytic function [47, 48]. Evolutionary advantages of splicing and alternative splicing stem from various exon–intron rearrangements, which would allow for emergence of new proteins with different functions [1]. The basic splicing machinery and alternative splicing are evolutionarily conserved across species [47, 49–51]. Bioinformatic analyses have shown that alternative exons and their flanking introns are conserved to higher levels than constitutive exons [52, 53]. When compared across species, alternative exons and their splice sites are conserved indicating their functional roles [54, 55]. Similar sequence characteristics of alternative splicing events across different species indicate that these events are functionally significant. Mouse and human genes are highly conserved. About 80% of the mouse genes have human orthologs. The Mouse Genome Sequencing Consortium 2002 indicated that more than 90% of the human and mouse genomes are within conserved syntenic regions. Cross-species analyses between these two species with whole-genome sequence alignments revealed the conserved splicing events [50].

1.4 Alternative Splicing Databases

1.4.1 Genomic and Transcriptomic Sequence Analyses

In the genome era, availability of genomic sequences and the wide range of transcript sequence data enabled detailed bioinformatic analyses of alternative splicing. Multiple-sequence alignment approaches have been widely used within and across species in order to detect alternative exons and other alternative splicing events within transcriptomes [56–60]. In this section, we provide a brief overview of various alternative splicing databases and we focus on describing alternative splicing databases developed using the dbASQ system and a wide range of genome and transcriptome sequence data. The databases described here identify, classify, compute, and store alternative splicing events. In addition, they answer biological queries about current and novel splice variants within various genomes.

1.4.2 Literature Overview of Various Alternative Splicing Databases

Over the last decade, utilizing bioinformatics tools, various computational analyses of alternative splicing, and data generation in this field have been accelerated. Mainly storage and representation of sequence data enabled collection of alternative splicing data in the form of databases. Table 1.1 provides a comprehensive list of alternative splicing databases and a literature source for the database. (This list is exhaustive but may not be complete at the time of publication.) In the next section we detail the generation and utility of five specific alternative splicing databases generally called splicing databases (SDBs) built using the computational pipeline system dbASQ.

Table 1.1 Alternative Splicing Databases.

Alternative SDBsDescriptionReferenceASPicDBDatabase of annotated transcript and protein variants generated by alternative splicing[61]TassDB2Comprehensive database of subtle alternative splicing events[62]H-DBASHuman transcriptome database for alternative splicing[63]ASTDAlternative splicing and transcript diversity database[64]AS-ALPSDatabase for analyzing effects of alternative splicing on protein structure, interaction, and network in human and mouse[65]ASMDAlternative Splicing Mutation Database[66]ProSASDatabase for analyzing alternative splicing in context of protein structures[67]Fast DBAnalysis of regulation of expression and function of human alternative splicing variants[68]EuSpliceAnalysis of splice signals and alternative splicing in eukaryotic genes[69]SpliceMinerDatabase implementation of the National Center for Biotechnology Information (NCBI) Evidence Viewer for microarray splice-variant analysis[70]ECgeneProvides functional annotation for alternatively spliced genes[71]ASAP IIAnalysis and comparative genomics of alternative splicing in 15 animal species[72]HOLLYWOODComparative relational database of alternative splicing[73]ASDBioinformatics resource on alternative splicing[74]MAASEAlternative splicing database designed for supporting splicing microarray applications[75]ASHESdbDatabase of exon skipping[76]AVATARDatabase for genomewide alternative splicing event detection[77]DEDBDatabase of D. melanogaster exons in splicing graph form[78]ASGDatabase of splicing graphs for human genes[79]EASEDExtended alternatively spliced expressed sequence tag (EST) database[80]PASDBPlant alternative splicing database[81]ProSplicerDatabase of putative alterantive splicing information[82]AsMamDBAlternative splice database of mammals[83]SpliceDBDatabase of canonical and noncanonical mammalian splice sites[84]ASDBDatabase of alternatively spliced genes[85]

It should be noted that, in addition to alternative splicing databases, various computational tools and platforms such as AspAlt [86] and SpliceCenter [87] have been developed to analyze alternative splicing across various genomes. Another example is by Suyama et al. [88], who focus on conserved regulatory motifs of alternative splicing. We will not be providing an exhaustive list for such computational tools and platforms as this is out of the scope of this chapter.

1.4.3 SDBs

dbASQ—Computational Pipeline for Construction of SDBs SDBs were built using a computational pipeline referred to as the dbASQ system. This system is based on the AutoDB system previously reported by Zavolan et al. [89]. Figure 1.7 illustrates the dbASQ computational pipeline used for the development of SDBs. Input transcripts are obtained from UniGene and are aligned to the University of California at Santa Cruz (UCSC) genomes using BLAT [90] and SIM4 [91]. dbASQ filters each transcript based on the following two criteria. Each transcript has to have at least 75% identity to the genome. Transcripts with lower sequence identities are not included in the final versions of the databases. Each exon of the transcripts that pass the initial filter is individually screened for sequence identitiy to the genome. Each exon of a matching transcript has to have at least 95% identity to the genome. Transcripts which have one or more exons with lower sequence identity are not included in the final versions of the databases. In addition, transcripts which have only one exon are not included given that there are no splice sites in such transcripts. The remaining transcripts are clustered together (Figure 1.7). Each group of transcripts that map to a certain locus in the genome is termed a splice cluster. Each individual splice cluster is further filtered by dbASQ based on the number of transcripts it contains. A given splice cluster has to contain at least three transcripts to be included in the final version of the database. Splice clusters with less than three transcripts are not included (Figure 1.7). After transcripts and clusters are filtered, transcript sequence data are loaded to the databases using PostgreSQL-7.4.Database Terminology—Genomic Exons and Other Database Terms To carry out the alternative splicing analyses using the SDBs, we defined several terms unique to our databases and our analyses. Some of these terms have been introduced by Taneri et al. [16] and are defined as follows. A transcript is a sequence transcribed as pre-mRNA from the genomic DNA sequence and processed into mature mRNA. A splice cluster is a set of overlapping transcripts that map to the same genomic region. If a splice cluster contains differently spliced transcripts, it is termed a variant cluster. An invariant clustercontains no variant transcripts. An exon is a continuous sequence of a transcript that is mapped to the genome sequence. To facilitate the alternative splicing analysis, in this study we define a unique notion called the genomic exon. This notion is novel to our analysis and differentiates SDBs from already existing alternative splicing databases. A genomic exon is an uninterrupted genomic region aligned to one or more overlapping transcript exons. Based on the genomic exon notion, here we define an intron as the genomic region located between two neighboring genomic exons. The genomic exon map of any given splice cluster contains all the genomic exons and the introns of that particular cluster. Identification and labeling of any alternative exon in any given splice cluster rely on the genomic exon map of that particular cluster. A constitutive exon is an exon that is present in all transcripts of a given splice cluster, and its genomic coordinates match or are contained within the corresponding genomic exon. In a variant cluster, a cassette exon is present in some transcripts and is absent from others. In previous studies, these exons have been termed cryptic, facultative, or skipped. A length-invariant exon has the same splice donor and acceptor sites in all transcripts in which it is present. Length-variant exons have alternative 5′ or 3′ splice sites or both; therefore they are called 5′ variant, 3′ variant, or 5′, 3′ variant, respectively. Importantly, the coordinates of a genomic exon for a length-variant exon reflect the outermost splice sites. An exon can be both cassette and length variant. A variant exon is either cassette or length variant or both. Genomic exons to which at least portions of protein-coding regions are projected are called coding exons. Joined genomic exons (JGEs) are concatenations of all genomic exon sequences without the intronic sequences within a given splice cluster. JGEs are designed to facilitate the homology analyses.Data Tables of SDBs SDBs created using dbASQ contain six different data tables. Data schema of SDBs are shown in Table 1.2. These tables are called Cluster Table, Clone Table, Clone Exon Table, Clone Intron Table, Cds Table, and Genomic Exon Table. Cluster Table contains cluster identification numbers (Ds), chromosome IDs, and information on cluster types as variant and invariant. Clone Table contains transcript IDs, cluster IDs, chromosome IDs, clone lenghts, data sources of transcripts, their libraries and annotations, transcript sequences, and the number of exons of each transcript. Both Cluster Table and the Clone Table contain information on genomic orientation and about the beginnings and ends of genomic coordinates of transcripts. Clone Exon Table contains exon IDs, clone IDs, exon numbers, chromosome IDs, orientation, begining and end coordinates of transcripts, transcript sequences, chromosome sequences, 5′ and 3′ splice junction sites, variation types of alternative exons, and data sources of transcripts. Clone Intron Table contains intron IDs, intron numbers, clone IDs, chromosome IDs, orientation, data sources of transcripts. Cds Table contains clone IDs, chromosome IDs, orientation, begining and end coordinates of chromosomes, beginning and end coordinates of transcripts, and data sources of transcritps. Genomic Exon Table contains exon numbers, cluster IDs, chromosome IDs, orientaiton, and exon types (Table 1.2).Construction of SDBs for Five Eukaryotic Organisms Using the dbASQ system, we have constructed five relational databases for the Homo sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat), D. melanogaster (fruitfly), and Caenorhabditis elegans (soil worm) transcriptomes and genomes, called HumanSDB3, MouSDB5, RatSDB2, DmelSDB5, and CeleganSDB5, respectively. These databases contain expressed sequences precisely mapped to the genomic sequences using methods described above. UCSC genome builds hg17, mm5, rn3, dm2, and ce2 were used as input genome sequences for human, mouse, rat, fruitfly, and soil worm, respectively. UniGene database version numbers 173, 139, and 134 were used as input transcript sequences for human, mouse, and rat, respectively. For D. melanogaster and C. elegans, the full-length transcript nucleotide sequences were downloaded via Entrez query. The query limited results only to mRNA molecules and excluded expressed sequence tags (ESTs), sequence-tagged sites (STSs), genome sequence survey (GSSs), third-party annotation (TPA), working drafts, and patents. In addition, ESTs were downloaded from dbEST entries for the organisms of choice. All sequence sets were initially localized within genomes using BLAT [90]. The BLAT suite was installed from jksrc444 dated July 15, 2002. SIM4 was then used to generate a more refined alignment of the top 10% of BLAT matches [91]. SIM4 transcript genome alignments were included in the final splicing databases if they satisfied the criteria described above, including at least 75% transcript genome identity, at least 95% exon genome identity, and presence of at least two exons in the transcript. The SIM4 alignment provided exon splice sites. Following the SIM4 alignment, software developed by our group was used to cluster the transcripts, compute genomic exons, and determine the variation classification for each exon, each transcript, and each locus. Database schemas represent genomic positions of transcribed subsequences with indications of variation types.Web Access to SDBs Online access to the PostgreSQL-7.4 SDBs is provided via dbASQ website at the Scripps Genome Center (SGC). HumanSDB3, MouSDB5, RatSDB2, DmelSDB5, and CeleganSDB5 web pages are dynamically generated by PHP scripts, deployed on the Apache-2.0 webserver. PostgreSQL database connections are carried out via built-in PHP database functions. Each SDB has been supplemented by additional tables that provide faster online access to the SDB statistical analyses described above. General information about splice clusters and individual chromosomes are also provided. When a particular splice cluster is accessed for the first time through a Web interface, graphical cluster maps are generated as PNG files by either PHP scripts or a Perl script using GD library. Graphical splice cluster files display positions of color-coded genomic exons and individual transcripts from this cluster with projections of their exons onto the genomic map. Graphical files are cached for faster subsequent access to the splice cluster. SDBs can be browsed for individual chromosomes or for lists of splice clusters. Gene annotation keywords, splice cluster IDs, GenBank accession numbers, UniGene IDs, chromosome numbers, and variation status of the splice clusters can be used as search parameters. Pairs of orthologous and potentially orthologous human, mouse, and rat splice clusters can be identified using any of the following parameters: keyword, gene symbol, splicing cluster ID, GeneBank accession number, and UniGene cluster ID. If a particular splice cluster pairwise comparison is requested, a PHP script generates a graphical map with lines that connect homologous genomic exons. Pairwise cluster maps are cached to facilitate faster subsequent access to a given homologous splice cluster pair. Figures 1.8–1.12