Deep Learning for Genomics - Upendra Kumar Devisetty - E-Book

Deep Learning for Genomics E-Book

Upendra Kumar Devisetty

0,0
32,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Deep learning has shown remarkable promise in the field of genomics; however, there is a lack of a skilled deep learning workforce in this discipline. This book will help researchers and data scientists to stand out from the rest of the crowd and solve real-world problems in genomics by developing the necessary skill set. Starting with an introduction to the essential concepts, this book highlights the power of deep learning in handling big data in genomics. First, you’ll learn about conventional genomics analysis, then transition to state-of-the-art machine learning-based genomics applications, and finally dive into deep learning approaches for genomics. The book covers all of the important deep learning algorithms commonly used by the research community and goes into the details of what they are, how they work, and their practical applications in genomics. The book dedicates an entire section to operationalizing deep learning models, which will provide the necessary hands-on tutorials for researchers and any deep learning practitioners to build, tune, interpret, deploy, evaluate, and monitor deep learning models from genomics big data sets.
By the end of this book, you’ll have learned about the challenges, best practices, and pitfalls of deep learning for genomics.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 402

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Deep Learning for Genomics

Data-driven approaches for genomics applications in life sciences and biotechnology

Upendra Kumar Devisetty

BIRMINGHAM—MUMBAI

Deep Learning for Genomics

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Dhruv Jagdish Kataria

Content Development Editor: Priyanka Soam

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Rekha Nair

Production Designer: Mohamed Huzair

Marketing Coordinators: Shifa Ansari, Abeer Riyaz Dawe

First published: October 2022

Production reference: 1311022

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80461-544-7

www.packt.com

Contributors

About the author

Upendra Kumar Devisetty has a Ph.D. in agriculture and over 12 years of experience working in Next-Generation Sequencing. He has a deep background in genomics and bioinformatics with a specialization in applying predictive analytics across a varied set of genomics problems in life sciences. Dr. Devisetty is currently working as a senior data science manager at Greenlight Biosciences, where he leads a team of bioinformatics scientists and data scientists to support the various bioinformatics and data science projects at Greenlight Biosciences with a mission to create mRNA-based solutions that can provide a cleaner environment and healthier people.

About the reviewer

Urminder Singh is a computer scientist and bioinformatician. His diverse research interests include understanding novel gene evolution, cancer genomics, machine learning in medicine, sociogenomics, and algorithms for big heterogeneous data. You can find him online at urmi-21.github.io.

Table of Contents

Preface

Part 1 – Machine Learning in Genomics

1

Introducing Machine Learning for Genomics

What is machine learning?

Why machine learning for genomics?

Machine learning for genomics in life sciences and biotechnology

Exploring machine learning software

Python programming language

Visualization

Biopython

Scikit-learn

Summary

2

Genomics Data Analysis

Technical requirements

Installing Biopython

Matplotlib

What is a genome?

Genome sequencing

Sanger sequencing of nucleic acids

Evolution of next-generation sequencing

Analysis of genomic data

Steps in genomics data analysis

Introduction to Biopython for genomic data analysis

What is Biopython?

Genomic data analysis use case – Sequence analysis of Covid-19

Calculating GC content

Calculating nucleotide content

Dinucleotide content

Modeling

Motif finder

Summary

3

Machine Learning Methods for Genomic Applications

Technical requirements

Python packages

ML libraries

Genomics big data

Supervised and unsupervised ML

Supervised ML

Unsupervised ML

ML for genomics

The basic workflow of ML in genomics

An ML use case for genomics – Disease prediction

Data collection

Data preprocessing

EDA

Data transformation

Data splitting

Model training

Model evaluation

ML challenges in genomics

Summary

Part 2 – Deep Learning for Genomic Applications

4

Deep Learning for Genomics

Understanding what deep learning is and how it works

Neural network definition

Anatomy of deep neural networks

Key concepts of DNNs

An example of how neural networks work

DNN architectures

DNNs for genomics

Deep learning workflow for genomics

Broad application of DNNs in genomics

Protein structure predictions

Regulatory genomics

Gene regulatory networks

Single-cell RNA sequencing

Introducing deep learning algorithms and Python libraries

General deep learning libraries

Deep learning libraries for genomics

Summary

5

Introducing Convolutional Neural Networks for Genomics

Introduction to CNNs

What are CNNs?

Transfer Learning

CNNs for genomics

Applications of CNNs in genomics

DeepBind

DeepInsight

DeepChrome

DeepVariant

Summary

6

Recurrent Neural Networks in Genomics

What are RNNs?

Introducing RNNs

How do RNNs work?

Different RNN architectures

Bidirectional RNNs (BiLSTM )

LSTMs and GRUs

Different types of RNNs

Applications and use cases of RNNs in genomics

DeepNano

ProLanGo

DanQ

Understanding RNNs through Transcription Factor Binding Site (TFBS) predictions

Summary

7

Unsupervised Deep Learning with Autoencoders

What is unsupervised DL?

Types of unsupervised DL

Clustering

Anomaly detection

Association

What are autoencoders?

Properties of autoencoders

How do autoencoders work?

Architecture of autoencoders

Types of autoencoders

Autoencoders for genomics

Gene expression

Use case – Predicting gene expression from TCGA pan-cancer RNA-Seq data using denoising autoencoders

Summary

8

GANs for Improving Models in Genomics

What are GANs?

Differences between Discriminative and Generative models

Intuition about GANs

How do GANs work?

Challenges working with genomics datasets

What is synthetic data?

How can GANs help improve models?

Practical applications of GANs in genomics

Analysis of ScRNA-Seq data

Generation of DNA

Using GANs for augmenting population-scale genomics data

Summary

Part 3 – Operationalizing models

9

Building and Tuning Deep Learning Models

Technical requirements

DL life cycle

Data processing

Data collection

Data wrangling

Feature engineering

Developing models

Selecting an appropriate algorithm

Model training

Tuning the models

Hyperparameter tuning

Hyperparameter tuning libraries

Classification metrics or performance statistics

Visualizing performance

Regression metrics

Use case – Predicting the binding site location of the JunD TF

Framing the TFBS prediction problem in terms of DL

Processing the data

Model training

Summary

10

Model Interpretability in Genomics

What is model interpretability?

Black-box model interpretability

Unlocking business value from model interpretability

Better business decisions

Building trust

Profitability

Model interpretability methods in genomics

Partial dependence plot

Individual conditional expectation

Permuted feature importance

Global surrogate

LIME

Shapley value

ExSum

Saliency map

Use case – Model interpretability for genomics

Data collection

Feature extraction

Target labels

Train-test split

Creating a CNN architecture

Summary

11

Model Deployment and Monitoring

Technical requirements

Streamlit

Hugging Face

Introducing model deployment

Steps in model deployment

Types of model deployment

Deploying models as services

A use case for deploying a DL model as a web service – building a Streamlit application of the CNN model

Monitoring models using advanced tools

Why monitor models?

Reasons for model degradation

How to monitor DL models

Advanced tools for model monitoring

Addressing drifts

Summary

12

Challenges, Pitfalls, and Best Practices for Deep Learning in Genomics

Deep learning challenges regarding genomics

Lack of flexible tools

Fewer biological samples

Computational resource requirements

Expertise in DL frameworks

Lack of high-quality labeled data

Lack of model interpretability

Common pitfalls for applying deep learning to genomics

Confounding

Data leakage

Imbalanced data

Improper model comparisons

Best practices for applying deep learning to genomics

Understand the problem and know your data better

A simple model for a simple problem

Establish a baseline for your model

Ensure reproducibility

Using pre-existing models for genomics

Do not reinvent the rule

Tune hyperparameters automatically

Focus on feature engineering

Normalize the data

Always perform model interpretation

Avoid overfitting

Summary

Index

Other Books You May Enjoy

Preface

Deep learning is the subset of machine learning based on artificial neural networks with representative learning using vast amounts of data. Machine learning is a subcomponent of artificial intelligence, which includes sophisticated algorithms that enable machines to mimic human intelligence to perform human tasks automatically. Both deep learning and machine learning help automatically detect meaningful patterns from data without explicit programming. Machine learning and deep learning have completely changed the way that we live these days. We rely on these so much that it’s hard to imagine a day without using any of these in some way or another, whether it is via the spam filtering of emails, product recommendations, or speech recognition. Both machine learning and specifically deep learning have been adopted by the scientific community in areas such as biology, genomics, bioinformatics, and computational biology. High-throughput technologies (HTS) such as next-generation sequencing (NGS) have made a significant contribution to genomics to study complex biological phenomena at a single-base-pair resolution on an unprecedented scale, facilitating an era of big data genomics. To get meaningful and novel biological insights from this big data, most of the algorithms are currently based on machine learning and, lately, deep learning methodologies to provide higher levels of accuracy in specific tasks related to genomics than state-of-the-art rule-based algorithms. Given the growing trend in the perception and application of machine learning and deep learning in genomics, research professionals, scientists, and managers require a good understanding of this exciting field to equip them with the necessary tools, technologies, and general guidelines to assist them in the selection of machine learning and deep learning methods for handling genomics data and accelerating data-driven decision-making in industries related to life sciences and biotechnology.

Throughout this book, we will learn how to apply deep learning approaches to solve real-world problems in genomics, interpret biological insights from deep learning models built from genomic datasets, and finally, operationalize deep learning models using open source tools to enable predictions for end users.

Who is this book for?

This book aims to practically introduce machine learning and deep learning for genomic applications that can transform genomics data into novel biological insights. It provides both the theoretical fundamentals and hands-on sections to give a taste of how machine learning and deep learning can be leveraged in real-world applications in the life sciences and biotech industries. This book covers a range of topics that are not currently available in other textbooks. The book also includes the challenges, pitfalls, and best practices when applying machine learning and deep learning to real-world scenarios. Each chapter of the book has code written in Python with industry-standard machine learning and deep learning libraries and frameworks such as Keras that the audience can reproduce in their working environment. This book is designed to cater to the needs of researchers, bioinformaticians, and data scientists in both academia and industry who want to leverage machine learning and deep learning technologies in genomic applications to extract insights from sets of big data. Managers and leaders who are already established in the life sciences and biotechnology sectors will not only find this book useful but can also adopt these methodologies to identify patterns, come up with predictions, and thereby contribute to data-driven decision-making in their respective companies.

The book is divided into three different parts. The first part introduces the fundamentals of genomic data analysis and machine learning. In this part, we will introduce the basic concept of genomic data analysis and discuss what machine learning is and why it is important for genomics and what value machine learning will bring to the life sciences and biotechnology industries. The second part will transition the readers from machine learning to deep learning and introduce them to the basic concepts of deep learning and diverse deep learning algorithms, using real-world examples to transform raw genomics data into biological insights. The final part will describe how to operationalize deep learning models using open source tools to enable predictions for end users. In this part, you will learn how to build and tune state-of-the-art machine learning models using Python and industry-standard libraries to derive biological insights from large amounts of multimodal genomic datasets and how to deploy these models on several cloud platforms such as AWS and Azure. The last chapter in the final part is fully dedicated to the current challenges for deep learning approaches to genomics and the potential pitfalls and how to avoid them using best practices.

What this book covers

Chapter 1, Introducing Machine Learning for Genomics, provides a brief history of the field of genomics and the practical application of machine learning methods to genomics, in addition to some of the technologies that this book will use.

Chapter 2, Genomics Data Analysis, gives readers a quick primer on data analysis in genomics. Using the Python programming language, readers will be able to make sense of the vast amounts of genomics data available and extract biological insights.

Chapter 3, Machine Learning Methods for Genomic Applications, introduces the reader to the two most important machine learning methods (supervised and unsupervised) and some of the important elements of standard machine learning pipelines. It also includes the practical real-world applications of supervised and unsupervised algorithms for genomics data analysis in the life sciences and biotechnology industries.

Chapter 4, Deep Learning for Genomics, will teach the reader about the fundamental concepts of deep learning, different types of deep learning models, and different deep learning Python libraries.

Chapter 5, Introducing Convolutional Neural Networks for Genomics, gives the reader a taste of Convolutional Neural Networks (CNNs), a type of deep neural network that is primarily used for sequence data, and shows how CNNs have superior performance compared to other deep learning methods.

Chapter 6, Recurrent Neural Networks in Genomics, introduces reinforcement learning techniques such as Recurrent Neural Networks (RNNs) and LSTMs and shows how they are currently being applied in several applications.

Chapter 7, Unsupervised Deep Learning with Autoencoders, introduces unsupervised deep learning, different methods of unsupervised deep learning, specifically Autoencoders, and its application in genomics.

Chapter 8, GANs for Improving Models in Genomics, introduces Generative Adversarial Networks (GANs) and how they can be used to improve deep neural networks trained on genomics datasets for predictive modeling.

Chapter 9, Building and Tuning Deep Learning Models, describes how to build and tune machine learning and deep learning models and deploy the final models across various computational systems and several platforms.

Chapter 10, Model Interpretability in Genomics, introduces the reader to how to interpret machine learning and deep learning models. The model interpretability introduced here helps readers to understand a model’s decision and why businesses are interested in model interpretability for creating trust, gaining profitability, and so on.

Chapter 11, Model Deployment and Monitoring, teaches the reader how to take the model they built on Google Colab and deploy it for predictions using open source tools such as Streamlit and Hugging Face. In addition, this chapter also describes how to monitor models using advanced tools and how monitoring is a key metric for businesses.

Chapter 12, Challenges, Pitfalls, and Best Practices for Deep Learning in Genomics, informs the reader of the challenges and pitfalls associated with applying machine learning and deep learning methodologies to genomics applications. It also covers the best practices for building end-to-end machine learning and deep learning models and applying them to genomic datasets.

To get the most out of this book

The book aims to keep it self-contained as possible. To extract the maximum value out of this book, a basic to intermediate knowledge of Python programming is recommended and a background in genomics, statistics, and bioinformatics and some knowledge of data science is a must. In addition, readers are expected to know the basics of machine learning and associated machine learning algorithms, such as regression and classification. The book provides a hands-on approach to implementation and associated deep learning methodologies that will have you up-and-running and productive in no time. At the end of the book, you will be able to put your knowledge to work with this practical guide.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository. This will ensure you avoid any potential error related to copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Deep-Learning-for-Genomics-. Any updates to the code will be reflected in the GitHub repository. We also have other code bundles from our rich catalog of books and videos available at: https://github.com/PacktPublishing/. Check them out!

Conventions used

There are several text conventions used throughout this book.

Code in text: Indicates code words in the text, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”

A block of code is set as follows:

# covid19_features.py from Bio import SeqIO

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold: “First, import all the relevant libraries:”

>>> from Bio import SeqIO

Any command-line input or output is written as follows:

>>> from Bio import SeqIO

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “In the Create the default IAM role pop-up window, select Any S3 bucket.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Share Your Thoughts

Once you’ve read Deep Learning for Genomics, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781804615447

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1 – Machine Learning in Genomics

This part will describe genomics data analysis and machine learning approaches to genomics. You will use state-of-the-art machine learning methods to transform raw genomics data into insights utilizing real-life examples in the life sciences and biotechnology industries.

This section comprises the following chapters:

Chapter 1, Introducing Machine Learning for GenomicsChapter 2, Genomics Data AnalysisChapter 3, Machine Learning Methods for Genomic Applications

1

Introducing Machine Learning for Genomics

Machine learning (ML) is the field of science that deals with developing computer algorithms and models that can perform certain tasks without explicitly programming them. This is to say, it teaches the machines to “learn” rather than specifying “rules” from input data provided to them. The machine then can convert that learning into expertise or knowledge and use that for predictions. ML is an important tool for leveraging technologies around artificial intelligence (AI), a subfield of computer science that aims to perform tasks automatically that we, as humans, are naturally good at. ML is an important aspect of all modern businesses and research. The adoption of ML for genomics applications is changing recently because of the availability of large genomic datasets, improvement in algorithms, and, most importantly, superior computational power. More and more scientific research organizations and industries are expanding the use of ML across vast volumes of genomic data for predictive diagnostics, as well as to get biological insights at the scale of population health.

Genomics, the study of the genetic constitution of organisms, holds promise in understanding and diagnosing human diseases or improving our agriculture and livestock. The field of genomics has seen exponential growth in the last 15 years, mainly due to recent technological advances in High-throughput sequencing also known as next-generation sequencing (NGS) technologies generating exponential amounts of genomics data. It is estimated that between 100 million and as many as 2 billion human genomes could be sequenced by 2025 (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195), representing an astounding growth of four to five orders of magnitude in 10 years and far exceeding the growth of many big data domains. This complexity and the sheer amount of data generated create roadblocks not only to the acquisition, storage, and distribution but also to genomic data analysis. The current tools used in the genomic analysis are built on top of deterministic approaches and rely on rules encoded to perform a particular task. To keep up with data growth, we need more and new innovative approaches, such as ML, in genomics to enrich our understanding of basic biology and subject them to applied research. In this chapter, we’ll learn what ML is, why ML is essential for genomics, and what value ML brings to life sciences and biotechnology industries that leverage genome data for the development of genomic-based products. By the end of this chapter, you will understand the limitations of the current conventional algorithms for genomic data analysis, how solving problems with ML is different from conventional approaches, and how ML approaches can fill in those gaps and make generating biological insights very easy.

As such, in this chapter, we’re going to cover the following main topics:

What is machine learning?Why machine learning for genomics?Machine learning for genomics in life sciences and biotechnology

What is machine learning?

Before we talk about ML, let’s understand what AI is. In the simplest terms, AI is the ability of a machine to mimic human intelligence and iteratively improve itself based on the information it collects. The goal of AI is to build systems to perform actions that are routinely done by humans such as problem-solving, pattern matching, image recognition, knowledge acquisition, and so on. ML, a subset of AI, is the process of training a model to learn and improve from experience. Deep learning (DL), in turn, is a subfield of ML, in which we leverage artificial neural networks (ANNs) to mimic the human brain and find the nonlinear relationships between the input and output to generate predictions (Figure 1.1):

Figure 1.1 – AI versus ML versus DL – how they are related

In ML, a model is built based on input data and an underlying algorithm to make useful predictions from real-world data. In a simplified ML, “features” that represent an individual measurable property of the data are provided as input, and “labels” are returned as the predictions. Suppose we want to predict whether a particular sequence of DNA has a binding site for a transcription factor (TF) of your interest or not. Using the traditional approach, we would use a positional weight matrix (PWF) to scan the sequence and identify the potential motifs that are overrepresented. Even though this works, this is extremely difficult, manual, scalable, and so on. Using an ML-based approach, we would give an ML model plenty of DNA sequences until the ML model learns the mathematical relationship between the features from those DNA sequences that either have or don’t have binding sites (labels) based on experimental results. It then uses this knowledge to make decisions on new data and make informed predictions. For example, we could give the ML model an unknown DNA sequence, and it would predict the correct binding site motif if present. This is one such example of why ML is a good fit for genomics problems. Some other ways in which ML can be used in genomics include identifying genetic disorders, predicting the type of cancer from genetic variants, improving disease prognosis, and so on.

Why machine learning for genomics?

One of the most important events in the field of biology was the completion of the human genome sequence in 2003, which is considered one of the significant milestones in genomics. Since then, genomics has been evolving rapidly, from research to clinical practice at scale, especially in oncology and infectious diseases. Genomics, because of its ability to identify root causes of diseases due to tiny changes in the genome, fueled the discovery of many important disease genes – particularly rare disease genes – which brought clinical decision-making one step closer to personalized medicine. As a result, sequencing efforts have exploded globally, and so the amount of genomics data that’s being generated has shot up. Along with sequencing efforts, biological techniques have started to increase in complexity and number, resulting in large-scale genomics data being generated. It is estimated that there will be between 2 and 40 exabytes of genomics data generated in the next decade (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4494865/). This is a lot of data, which the current computational and bioinformatics tools can handle, extract, interpret, and identify biological insights. ML, with its inherent nature of learning from experience, holds incredible promise in analyzing this large and complex genomic data. Since ML algorithms can detect patterns in the data automatically, it is suitable for interpreting this large trove of genomic data.

ML has a strong place in genomics since it uses mathematical and data analysis techniques that are applied to complex multi-dimensional datasets, such as genomic datasets, to build predictive models and uncover insights from those models. ML can transform heterogeneous and large-scale genomic datasets into biological insights. ML approaches rely on sophisticated statistical and computational algorithms to make biological predictions. It does this by mapping the complex association between the input features and the labels or finding complex patterns in the input features and creating groups of samples based on similarities using supervised and unsupervised methods, respectively. They can learn useful and new patterns from data that is hard to find by experts. There is now a huge demand for applying ML to genomic datasets because of their huge success in other domains.

Machine learning for genomics in life sciences and biotechnology

Because of the incredible promise that ML has shown for genomics applications such as drug discovery, diagnostics, precision medicine, agriculture, and biological research, more and more life science and biotech organizations are leveraging ML to analyze genomic data for population health and predictive analytics. As per the market research study, which takes into account technology, functionality, application, and region, the global AI in the genomics market is forecasted to reach $1.671 billion by 2025 from $202 million in 2020 (https://www.marketsandmarkets.com/Market-Reports/artificial-intelligence-in-genomics-market-36649899.html). The main drivers for this growth can be attributed to the need to control spiraling drug costs, increasing public and private investments, and, most importantly, the adoption of AI solutions in precision medicine. The recent COVID-19 pandemic has played its part in accelerating the adoption of AI for genomics as well (https://www.jmir.org/2021/3/e22453/). Even though the outlook for ML for genomics is exciting, there is a lack of a skilled workforce to develop, manage, and apply these ML methodologies in genomics. Additionally, integrating these ML systems into existing systems is a challenging task that requires a proper understanding of the concepts and techniques. For researchers to stand out from the crowd and contribute to data-driven decisions by the company, they must have the necessary skill set.

This book will address the problem of the skill gap that currently exists in the market. This book is a Swiss Army knife for any research professional, data scientist, or manager who is getting started with genomic data analysis using ML. This book highlights the power of ML approaches in handling genomics big data by introducing key concepts, employing real-life business examples, use cases, best practices, and so on to help fill the gaps in both the technical skill set as well as general mentality within the field.

Exploring machine learning software

Before we start the tutorials, we will need some tools. To accommodate users regarding their specific operating system requirements, we will use ML software that is compatible across all operating systems, whether it’s Windows, macOS, or Linux. We will be using Python programming language and the Python libraries such as BioPython for genomic data analysis, Scikit-learn for ML building, and Keras to train our DL models. Let’s take a closer look at these pieces of ML software.

Python programming language

We will be using the Python programming language throughout this book. Python is a widely used programming language for researchers because of its popularity, the available packages that support all types of data analysis, and its user-friendliness. More importantly, ML, DL, and the genomic community routinely use Python for their own analysis needs. Throughout this book, we will use Python version3.7 and look at a few ways of installing Python using Pip, Conda, and Anaconda.

Visualization

We will be using the Matplotlib and Seaborn Python packages, which are the two most popular visualization libraries in Python. They are quick to install, easy to use, and easy to import in the Python script. They both come with a variety of functions and methods to use on the data. Throughout this book, we will use Matplotlib version 3.5.1 and Seaborn version 0.11.2. We will look at a few ways of installing these libraries in the subsequent chapters.

Biopython

We will also be using Biopython, a Python module that provides a collection of Python tools for processing genomic data. It creates high-quality, reusable calls for analyzing complex genomic data. It has inherent libraries to connect to databases such as Swiss-Port, NCBI, ENSEMBL, and so on. We will use Biopython version 1.78 and look at separate ways of installing Biopython using Pip, Conda, and Anaconda.

Scikit-learn

Scikit-learn is a Python package written for the sole purpose of performing ML and is one of the most popular ML libraries used by data scientists. It has a rich collection of ML algorithms, extensive tutorials, good documentation, and, most importantly, an excellent user community. For this introductory chapter, we will use scikit-learn for developing ML models in Python. Wherever applicable, we will use scikit-learn version 1.0.2 and look at separate ways of installing scikit-learn in the subsequent chapters.

Summary

In this first chapter, you were introduced to the concept of ML for genomics. We gained a brief understanding of ML in several genomic applications in the life science, pharma, clinical, and biotechnology industries. We also looked at the rapid strides that NGS has made in the last 15 years and how it contributed to the production of genomic big data. Then, we understood how ML can be used to analyze genomic data for the development of genomic-based products.

Finally, we looked at the different programming languages, including the most popular genomic library and ML software that we will be using throughout this book. You will mainly use Python and scikit-learn for developing models, Biopython for genomic data analysis, and some open source tools for model training and productionalizing them for deploying models.

In the next chapter, we will introduce the fundamentals of genomic data analysis.

2

Genomics Data Analysis

Genomics gained mainstream attention when the Human Genome Project published the complete sequence of the human genome in 2003. Over the last decade, genomics has become the backbone of drug discovery, targeted therapeutics, disease diagnosis, and precision medicine, leading to the chances of successful clinical trials. For example, in 2021, over 33% of FDA-approved new drug approvals were personalized medicines, a trend that sustained for the past five years (https://www.foley.com/en/insights/publications/2022/03/personalized-medicine-2021-fda-guideposts-progress). This growing use of genomics can be mainly attributed to the drastic decrease in the cost and turnaround time of DNA sequencing. For instance, while human genome sequencing was reported to cost around $3 billion and took 13 years to complete, today, you can get your genome sequenced in a day with less than $200 (https://www.medtechdive.com/news/illumina-ushers-in-200-genome-with-the-launch-of-new-sequencers/633133/). Because of this incredible success of the genomics industry, more and more research professionals and scientists are now routinely generating genomic data than ever before to understand how genome functions affect human health and disease. It is estimated that over 100 million genomes will be sequenced by 2025 (https://www.biorxiv.org/content/10.1101/203554v1) and with the right data analysis and interpretation of this massive data, this information could pave the way to a new golden age of precision medicine.

To find and interpret biological information hidden within this data, it is important to have a solid foundation of genomics data analysis methods and algorithms. The objective of this chapter is to provide fundamentals of genomics data analysis using Biopython, which is one of the most popular Python modules that provides a suite of new commands for working with sequence data. Using Biopython, you can make sense of the vast amounts of available genomics data and extract biological insights. If you are a genomic scientist or a researcher working in the area of genome biology, or someone familiar with these concepts, then please free to skip this chapter or quickly skim through it for a quick refresh. By the end of this chapter, you will know the fundamentals of genomics, genome sequencing, and genome data analysis, how to use the Biopython module of Python for genomics data analysis, and prepare the data in such a way that is compatible with machine learning (ML).

As such, the following topics are covered in this chapter:

What is a genome?Genome sequencingAnalysis of genomic dataIntroduction to Biopython for genomic data analysis

Technical requirements

This course assumes that you have a basic knowledge of Python for programming, so we will not introduce Python in this book.

Note

For a quick refresher on Python fundamentals, please refer to https://www.freecodecamp.org/news/python-fundamentals-for-data-science/.

Instead, you will be introduced to Biopython, which is a powerful library in Python that has tools for computational molecular biology for performing genomics data analysis.

Installing Biopython

Installing Biopython is very easy, and it will not take more than a few minutes on any operating system.

Step 1 – Verifying Python installation

Before we install Biopython, first check to see whether Python is installed using the following command in your command prompt:

$ python --versionPython 3.7.4

Note

The $ character represents the command prompt.

If your command prompt returns something like this, then it shows that Python is installed and 3.7.4 is your version of Python on your computer.

Note

Biopython only works with Python version 2.5 and above. If your Python version is <2.5, you should upgrade your Python version.

Alternatively, if you get an error like this, then you should download the latest version of Python from https://www.python.org/downloads/, install it, and then run the preceding command again:

-bash: python: command not found

Step 2 – Installing Biopython using pip

The easiest way to install Biopython is through the pip package manager, and the command to install is as follows:

$ pip install Biopython==1.79

Collecting Biopython

  Using cached https://files.pythonhosted.org/packages/4a/28/19014d35446bb00b6783f098eb86f24440b9c099b1f1ded3 3814f48afbea/Biopython-1.79-cp37-cp37m-macosx_10_9_x86_64.whl

Collecting numpy (from Biopython)

  Downloading https://files.pythonhosted.org/packages/09/8c/ae037b8643aaa405b666c167f48550c1ce6b7c589fe5540de6d83e5931ca/numpy-1.21.5-cp37-cp37m-macosx_10_9_x86_64.whl (16.9MB)

    100% |████████████████████████████████| 16.9MB 3.4MB/s

Installing collected packages: numpy, Biopython

Successfully installed Biopython-1.79 NumPy-1.21.5

The preceding response indicates that Biopython version 1.79 is successfully installed on your computer.

If you have an older version of Biopython, try running the following command to update the Biopython version:

$ pip install Biopython –upgrade

Requirement already up-to-date: Biopython in python3.7/site-packages (1.79)

Requirement already satisfied, skipping upgrade: numpy in python3.7/site-packages (from Biopython) (1.21.5)

The preceding message indicates that Biopython is the latest version that you have on your computer. If your Biopython version is older than the most recently updated one of Biopython, then the old version of Biopython and NumPy (a dependency for Biopython) will be replaced by the new version.

Step 3 – Verifying Biopython installation

After you have successfully installed Biopython, you can verify Biopython installation on your machine by running the following command in your Python console:

$ python3.7

Python 3.7.4 (default, Aug 13 2019, 15:17:50)

[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> import Bio

>>> print(Bio.__version__)

1.79

Note

>>> here represents the Python prompt where you would enter code expressions. Please note the underscores before and after the version. You can exit the console by using the exit() command or pressing Ctrl + D (which works in Linux and macOS only).

The preceding output shows the version of Biopython, which, in this case, is 1.79. If that command fails, your Biopython version is very out of date and you should upgrade it to the new version as indicated before.

There are alternate ways of installing Biopython, such as installing from the source, and you will find more information on it here: https://biopython.org/wiki/Packages

Matplotlib

We will be using Matplotlib, a very popular Python library for visualization. It is one of the easiest libraries to install and use. To install Matplotlib, simply run pip install matplotlib in your terminal. Then, you can include import matplotlib.pyplot as plt in your Python script, which you will see later in the hands-on section of this chapter.

What is a genome?

Before we discuss genomes, let’s do a quick Genetics 101. A cell represents the fundamental structural and functional unit of life. DNA contains the instructions that are needed to perform different activities of the cell. DNA is the basis of genetic studies and consists of four building blocks called nucleotides – adenine (A), guanine (G), cytosine (C), and thymine (T), which store information about life. The sequence of DNA is a string of these building blocks, also referred to as bases. DNA has a double-helix structure with two complementary polymers interlaced with each other. In the complementary strand of DNA, A matches with T, and G matches with C, to form base pairs.

A genome represents the full DNA sequence of a cell that contains all the hereditary information. The genome consists of information that is needed to build and maintain the whole living organism. The size of genomes is different from species to species. For example, the human genome is made up of 3 billion base pairs spread across 46 chromosomes, whereas the bread wheat genome consists of 42 chromosomes and ~ 17 gigabases. A region of a genome that transcribes into a functional RNA molecule, or transcribes into an RNA and then encodes a functional protein, is called a gene. This is the simplest of definitions, and there are several definitions of a gene, but overall, a gene constitutes the fundamental unit of heredity of a living organism. By analogy, you can imagine the four nucleotides (A, G, C, and T) that make up the gene as letters in a sentence, genes as sentences in a book, and the genome as the actual book consisting of tens of thousands of words.

Genome sequencing

After the discovery of the DNA structure, scientists were curious to determine the exact sequence of DNA (aka interpreting the whole book). A lot of pioneering discoveries paved the way for the sequencing of DNA, starting with Walter Gilbert when he published the first nucleotide sequence of the DNA lac operator consisting of 24 base pairs in 1973 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC427284). This was followed by Frederick Sanger who for the first time sequenced the complete DNA genome of the phi X174 bacteriophage (https://pubmed.ncbi.nlm.nih.gov/731693/). Sanger pioneered the first-ever sequencing of genes through the method of DNA sequencing with chain-terminating inhibitors. In the last 50 years, available sequencing technologies have been restricted to relatively small genomes, but advances in DNA sequencing technologies such as next-generation sequencing (NGS) have revolutionized genome sequencing because of their cost, speed, throughput, and accuracy. As more and more genomes are sequenced, this extended knowledge can be utilized for the development of personalized medicines to prevent, diagnose, and treat diseases and, ultimately, support clinical decision-making and healthcare. It’s beyond the scope of this chapter to go into the historical background of DNA sequencing technologies. Here, we will briefly present the different DNA sequencing technologies that have led to major milestones in genome sequencing.

Sanger sequencing of nucleic acids

Sanger sequencing, also known as the “chain termination method” is the first-generation sequencing method, and was developed by Frederick Sanger and his colleagues in 1977 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC431765/). It was used by the Human Genome Project to sequence the first draft of the human genome. This method relies on the natural method of DNA replication. Sanger sequencing involves the random incorporation of bases called deoxyribonucleoside triphosphates (dNTPs) by DNA polymerase into strands during copying, resulting in the termination of invitro transcription. These bases in the short fragments are then read out based on the presence of a dye molecule attached to the special bases at the end of each fragment. Despite being replaced by NGS technologies, Sanger sequencing remains the gold standard sequencing method and is routinely being used by research labs throughout the world for quick verification of short sequences generated by polymerase chain reaction (PCR) and other methods.

Evolution of next-generation sequencing

Next-generation sequencing first became available at the start of the 21st century and it completely transformed biological sciences, allowing any research lab to perform sequencing for a wide variety of applications and study biological systems at a level unimaginable before. NGS aims to determine the order of nucleotides in the entire genome or targeted regions of DNA or RNA in a high-throughput and massively parallel manner. The biggest advance that NGS offered compared to traditional Sanger sequencing and other approaches is the ability to produce ultra-high-throughput genomic data at a scale and speed. NGS technologies can be broadly divided into second-generation sequencing and third-generation sequencing technologies, depending on the type of sequencing methodology.

Second-generation DNA sequencing technologies

Alongside the developments in large-scale dideoxy-sequencing efforts, another sequencing technology that emerged in the last 15 years that revolutionized DNA sequencing is the NGS technology. There are a lot of technologies under NGS starting with pyrosequencing, by 454 Life Sciences, but the most important of all of them are bridge amplification technologies that were brought forth by Solexa, which was later acquired by Illumina. In contrast to Sanger sequencing, Illumina leverages sequencing by synthesis (SBS) technology, which is the tracking of labeled nucleotides as the DNA is copied in a massively parallel fashion generating output ranging from 300 kilobases up to several terabases in a single run. The typical size fragments generated by Illumina are in the range of 50-300 bases. Illumina dominates the NGS market.

Third-generation DNA sequencing technologies

These technologies are capable of sequencing single DNA molecules full length without amplification, and they allow the production of sequences (also called as reads) much longer than second-generation sequencing technologies such as Illumina. Pacific Bioscience and Oxford Nanopore Technologies dominate this sector with their systems called single-molecule sequencing real-time (SMRT) sequencing and nanopore sequencing, respectively. Each of these technologies can rapidly generate very long reads of up to 15,000 bases long from single molecules of DNA and RNA.

Analysis of genomic data

Genomic data analysis aims to provide biological interpretation and insights from genomics data and help drive innovation. It is like any other data analysis with the exception that it requires domain-specific knowledge and tools. With the advances in NGS technologies, it is estimated that genomic research will generate significant amounts of data in the next decade. However, our ability to mine insights from this big data is lagging behind the pace at which the current data is being generated. As new and more high-throughput genomic data is getting generated, data analysis capabilities are sought-after features for researchers and other scientific professionals in both academia and industry.

Steps in genomics data analysis

Genomics data is generally complex in nature and size. Researchers are currently facing an exciting yet challenging time with this available data that needs to be analyzed and understood. Analyzing this big genomics data can be extremely challenging. A natural course of action can be ML, which is fast becoming the go-to method for analyzing this data to mine biological insights. We will discuss the application of ML for genomics data in the next chapter but for now, let’s understand the different steps involved in the analysis of genomics data. The main aim of genomics data analysis is to do a biological interpretation of large volumes of raw genomics data. It is very similar to any other kind of data analysis but in this case, it often requires domain-specific knowledge and tools. Here, we will discuss data analysis steps for analyzing genomics data. Typical analysis of genomic data consists of multiple interdependent extract, transform, and load (ETL) that transform raw genomic data from sequencing machines into more meaningful downstream data used by researchers and clinicians.

A typical genomics data analysis workflow consists of the following steps: data collection, data cleaning, data processing, exploratory data analysis (EDA), modeling, visualization, and reporting. You might have seen several versions of this workflow but ultimately, a typical genomic data analysis workflow boils down to these five main steps – raw data collection, transforming the data, Exploratory Data Analysis (EDA), modeling (statistical or machine learning), and biological interpretation, as shown in Figure 2.1:

Figure 2.1 – Steps in a typical genomic data analysis workflow

Even though the steps in the genomic data analysis are linear, there are instances when you go back and repeat many of these steps to answer some questions related to data quality or add new datasets in the analysis or optimize the parameters.

Data collection

Data collection refers to any source, experiment, or survey that provides raw data. The genomic data analysis starts right after the experimentation stage, where data is generated if no data is available or if the question cannot be solved with available data. Many times, you don’t need to generate the data and instead can use publicly available datasets and specialized databases. The type and amount of data that needs to be collected are entirely dependent on the questions along with the technical and biological variability of the experimental system under study.

Data transformation

Data transformation consists of converting raw data into data that can be used for downstream analysis, such as EDA and modeling. It consists of quality control and data processing.

Quality control and cleaning

The quality control and cleaning step aims to identify the quality issues in the data and clean them from the raw dataset, generating high-quality data for modeling purposes. Quite frequently, the analysis starts with the raw genomics data (processed data if you are lucky), and it’s mostly messy data. Like any other raw data, genomic data consists of missing values, outliers, invalid, and other noisy data. Since it is a well-known fact that the quality of the output is determined by the quality of the input (garbage in, garbage out), it is super important to subject the data to quality control and cleaning before using it for downstream analysis.

Data processing

The goal of data processing is to convert raw data into a format that is amenable to visualization or EDA or modeling. The data processing step starts after the raw data was collected and cleaned. Briefly, the data processing steps include the following:

Data munging, which is transforming data from one format to another formatData transformation, which includes either normalizing the data or log transformation of the dataData filtering, which is a key step to remove data points that have unreliable measurements or missing values.

Exploratory data analysis

EDA is the most crucial step of any data