E-Book
32,99 €

Genomics in the AWS Cloud E-Book

Catherine Vacher

0,0

32,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Perform genome analysis and sequencing of data with Amazon Web Services Genomics in the AWS Cloud: Analyzing Genetic Code Using Amazon Web Services enables a person who has moderate familiarity with AWS Cloud to perform full genome analysis and research. Using the information in this book, you'll be able to take a FASTQ file containing raw data from a lab or a BAM file from a service provider and perform genome analysis on it. You'll also be able to identify potentially pathogenic gene sequences. * Get an introduction to Whole Genome Sequencing (WGS) * Make sense of WGS on AWS * Master AWS services for genome analysis Some key advantages of using AWS for genomic analysis is to help researchers utilize a wide choice of compute services that can process diverse datasets in analysis pipelines. Genomic sequencers that generate raw data files are located in labs on premises and AWS provides solutions to make it easy for customers to transfer these files to AWS reliably and securely. Storing Genomics and Medical (e.g., imaging) data at different stages requires enormous storage in a cost-effective manner. Amazon Simple Storage Service (Amazon S3), Amazon Glacier, and Amazon Elastics Block Store (Amazon EBS) provide the necessary solutions to securely store, manage, and scale genomic file storage. Moreover, the storage services can interface with various compute services from AWS to process these files. Whether you're just getting started or have already been analyzing genomics data using the AWS Cloud, this book provides you with the information you need in order to use AWS services and features in the ways that will make the most sense for your genomic research.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 454

Veröffentlichungsjahr: 2023

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Kopf schlägt Kapital

Günter Faltin

Der größte Raubzug der Geschichte

Matthias Weik

Der Mann und das Holz

Lars Mytting

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Power: Die 48 Gesetze der Macht

Robert Greene

The Truth About Employee Engagement

Patrick M. Lencioni

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

Leseprobe

Cover

Title Page

Introduction

Who Should Read This Book

Genomics

Cloud Computing and AWS

What You'll Learn from This Book

Our Story

Getting Under Way

CHAPTER 1: Why Do Genome Analysis Yourself When Commercial Offerings Exist?

Commercial Sequencing Services

Typical Results

Summary

CHAPTER 2: A Crash Course in Molecular Biology

DNA

DNA at Work: RNA and Proteins

Inheritance

Summary

CHAPTER 3: Obtaining Your Genome

Preparing to Have Your Genome Sequenced

Specifying Lab Work

Engaging a Laboratory

Getting a Tissue Sample for DNA Extraction

Shipping the Sample

Receiving the Results

Summary

CHAPTER 4: The Bioinformatics Workflow

Extraction of DNA

FASTA Files

FASTQ Files

Alignment to a Reference Genome

Reference Genomes

Quality Control

Trimming

The Alignment Process

Marking Duplicates

Recalibrating Base Quality Score

Calling SNVs and Indel Variants

Annotating SNVs and Indel Variants

Prioritizing Variants

Inheritance Analysis

Identifying SVs and CNVs

Bioinformatics Workflow

Summary

CHAPTER 5: AWS Services for Genome Analysis

General Concepts

Custom Environments

Summary

CHAPTER 6: Building Your Environment in the AWS Cloud

Setting Up a Virtual Private Cloud

Setting Up and Launching an EC2 Instance

Setting Up S3 Buckets

Configuring Your Account Securely

Creating Groups

Creating Users

Setting Up Your Client Environment

Summary

CHAPTER 7: Linux and AWS Command-Line Basics for Genomics

Selecting a Linux Distribution

Accessing Your AWS Linux Instance from Your Local Computer

Getting Familiar with the Command Line

Transferring Files to and from Your AWS Instance

Running Programs in the Background

Understanding File Permissions

Compressing and Archiving Files

Managing Linux

The AWS Command-Line Interface

AWS CLI Essentials

An Alternative Approach: AWS Systems Manager

Summary

CHAPTER 8: Processing theSequencing Data

Getting from Data to Information

Setting Up AWS Services and Data Storage

Summary

CHAPTER 9: Visualizing the Genome

Introducing Genome Visualizers

Installing the IGV Desktop Visualizer

Analyzing Variants in IGV

Summary

CHAPTER 10: Containerizing Your Workflow on the Desktop

Introducing Containerization

Understanding and Using Docker

Summary

CHAPTER 11: Variants and Applications

Polygenic Risk Scores

Metagenomics

AlphaFold

Summary

CHAPTER 12: Cancer Genomics

Somatic Genomes

Cancer

The Promise and Reality of Cancer Precision Medicine

Samples

Somatic Variant Analysis

Copy Number Changes

Measuring Tumor Genomic Instability

Summary

Notes

Index

Dedication

Acknowledgments

About the Authors

End User License Agreement

List of Tables

Chapter 7

Table 7.1: Main Linux Server Distributions

Table 7.2: Linux Permissions, Octal Code, and Interpretation

Table 7.3: Redirection Operators

Chapter 9

Table 9.1: Setting hg38 Options in the IGV Browser

Table 9.2: Setting hg19 Options in the IGV Browser

Table 9.3: Mapping Quality (MAPQ) Scores and Their Corresponding Probabiliti...

Chapter 11

Table 11.1 Example of PGS Catalog Scoring File

List of Illustrations

Chapter 1

Figure 1.1: My ancestry timeline

Figure 1.2: My wife's ancestry timeline

Figure 1.3: 23andMe raw data

Chapter 2

Figure 2.1: DNA molecule representation

Figure 2.2: Human karyotype

Figure 2.3: Transcription and translation

Figure 2.4: The genetic code

Figure 2.5: The protein alcohol dehydrogenase

Figure 2.6: Family pedigree

Chapter 3

Figure 3.1: The four raw data files resulting from sequencing cellular mater...

Figure 3.2: A graph showing how many times each base pair in a sequence was ...

Chapter 4

Figure 4.1: Phred scores

Figure 4.2: Alignment of paired reads to the reference genome

Figure 4.3: Bioinformatics workflow

Chapter 5

Figure 5.1: Selecting a VPC configuration

Figure 5.2: Configuring the VPC with a single public subnet

Figure 5.3: Confirming the VPC creation

Chapter 6

Figure 6.1: The main console page, where you should search for VPC

Figure 6.2: Existing VPCs and other network resources

Figure 6.3: The initial step of the VPC wizard

Figure 6.4: Specifying network characteristics of the new VPC in the VPC Wiz...

Figure 6.5: The second step of the VPC wizard, with VPC name and address blo...

Figure 6.6: VPC creation success

Figure 6.7: The newly created VPC among other VPCs

Figure 6.8: The main EC2 status page

Figure 6.9: Selecting an image for the EC2 instance

Figure 6.10: Choosing an AMI image to use as a base for the new server

Figure 6.11: Specifying the virtual hardware for the EC2 instance

Figure 6.12: Attaching the new EC2 instance to the VPC created earlier

Figure 6.13: Adding storage

Figure 6.14: Adding tags

Figure 6.15: Configuring a security group

Figure 6.16: Reviewing the instance launch details

Figure 6.17: An EC2 instance starting up, showing in Pending state.

Figure 6.18: Stopping an instance to save running costs

Figure 6.19: A list of all existing buckets

Figure 6.20: Naming a new bucket

Figure 6.21: Choosing bucket options

Figure 6.22: Declining the option to make the new bucket public

Figure 6.23: The new bucket in the list of all buckets

Figure 6.24: The main IAM page

Figure 6.25: Activating multifactor authentication

Figure 6.26: Choosing an MFS device

Figure 6.27: Getting a QR code for scanning

Figure 6.28: IAM wants you to set up a password policy for enhanced security...

Figure 6.29: Setting a password policy

Figure 6.30: Specifying characteristics of the password policy

Figure 6.31: A selection of password policy characteristics

Figure 6.32: Naming your group

Figure 6.33: Assigning the standard PowerUserAccess policy to the new Genomi...

Figure 6.34: Creating two new users

Figure 6.35: Assigning the new users to a group

Figure 6.36: Two new users, successfully created

Figure 6.37: Connecting to an EC2 instance via MacOS Terminal

Figure 6.38: Creating a

.ppk

file with PuTTY Key Generator

Figure 6.39: Adding a

.ppk

file to PuTTY

Figure 6.40: Specifying an EC2 instance in PuTTY

Figure 6.41: A successful PuTTY connection

Figure 6.42: Attaching an S3 location to Windows

Chapter 7

Figure 7.1: Configuration window of PuTTY

Figure 7.2: Configuration of PuTTY with support for an X server

Figure 7.3: Ubuntu desktop running as a virtual machine (with VMware) on Win...

Figure 7.4: Configuration of an Ubuntu Virtual machine with VMware on Window...

Figure 7.5: FTP client interface (FileZilla) that allows the transfer of fil...

Figure 7.6: Linux file permissions

Chapter 8

Figure 8.1: Signing into the AWS Console

Figure 8.2: Creating an EFS storage volume

Figure 8.3: Adding a file system

Figure 8.4: Assigning a name

Figure 8.5: Successful creation of a storage volume

Figure 8.6: Getting the name of the security group

Figure 8.7: The AWS EC2 administration console

Figure 8.8: Launching EC2 instances

Figure 8.9: Choosing an operating system

Figure 8.10: Selecting an EC2 instance type

Figure 8.11: EC2 default characteristics

Figure 8.12: Adding storage to an EC2 instance

Figure 8.13: Skipping ahead without creating tags

Figure 8.14: Reviewing the characteristics of a new instance

Figure 8.15: Launching a new EC2 instance

Figure 8.16: Getting the security key pair

Figure 8.17: Launching a new instance

Figure 8.18: A running EC2 instance

Figure 8.19: Clicking View Instances

Figure 8.20: Seeing that an EC2 instance is in the Running state

Figure 8.21: Stopping an EC2 instance

Figure 8.22: Confirming EC2 stop

Figure 8.23: Seeing that an EC2 instance is in the Stopped state

Figure 8.24: Restarting an EC2 instance

Figure 8.25: Noting the IP address of an EC2 instance

Figure 8.26: Adding a storage volume to an EC2 instance

Figure 8.27: Creating a Security Group

Figure 8.28: Confirming the characteristics of the security group

Figure 8.29: A newly created security group

Figure 8.30: Selecting the EFS mount security group

Figure 8.31: Editing inbound security group rules

Figure 8.32: Adding a rule

Figure 8.33: Noting the creation of the new rule

Figure 8.34: Selecting EFS storage

Figure 8.35: Attaching an EFS volume

Figure 8.36: Information about the mounted volume

Figure 8.37: Specifying a host in puTTY

Figure 8.38: Selecting a PPK key file in puTTY

Figure 8.39: A successful connection to a remote EC2 instance, via puTTY

Figure 8.40: Observing the contents of an EFS volume

Figure 8.41: Changing the type of an EC2 instance

Figure 8.42: Applying the type change

Figure 8.43: Restarting the modified EC2 instance

Chapter 9

Figure 9.1: The UCSC Genome Browser displaying the tumor necrosis factor gen...

Figure 9.2: The third exon of the TNF gene

Figure 9.3: The IGV desktop main page

Figure 9.4: The appearance of the IGV browser on initial startup

Figure 9.5: Adding a user in an AWS account

Figure 9.6: Granting S3 storage permissions to the newly created user

Figure 9.7: AWS reporting that the new user has been set up correctly

Figure 9.8: Opening a data file in the IGV browser

Figure 9.9: Choosing a genome reference

Figure 9.10: Specifying the hg38 human genome as a reference

Figure 9.11: Loading a BAM file into the IGV browser

Figure 9.12: Setting up the UCSC Table Browser to download functional elemen...

Figure 9.13: The UCSC Genome Browser showing the ACE2 gene

Figure 9.14: User settings and preferences in the UCSC Genome Browser

Figure 9.15: Resetting everything to default values

Figure 9.16: Choosing a publicly available BAM file representing a populatio...

Figure 9.17: Navigating a genome in the UCSC Genome Browser

Figure 9.18: UCSC Genome Browser indicating that a particular DNA fragment w...

Figure 9.19: An indication that the aligner is having trouble mapping a sequ...

Figure 9.20: IGV indicating an exon

Figure 9.21: IGV indicating an inserted sequence

Chapter 10

Figure 10.1: Virtual machines each have their own operating systems.

Figure 10.2: Containers have their own configurations on top of the host's o...

Figure 10.3: A Docker image can spawn many (potentially slightly different) ...

Figure 10.4: Docker showing a running container

Figure 10.5: Docker showing a newly added container in operation

Figure 10.6: A Docker image for running the MySQL database

Figure 10.7: The Broad Institute's Genomes in the Cloud image

Figure 10.8: Downloading the GITC image

Chapter 11

Figure 11.1: A view of an uploaded protein 3D structure model

Figure 11.2: The sequence of PDB structure 2BBO with residue Phe508 selected...

Figure 11.3: A comparison of the structures of wild-type (PDB 2BBO shown lig...

Figure 11.4: A comparison of the structures of wild-type (PDB 2BBO, lighter)...

Figure 11.5: The surface of wild-type CFTR in the vicinity of Phe508

Figure 11.6: The surface of Phe508Del mutant CFTR in the vicinity of where P...

Chapter 12

Figure 12.1: Revised hallmarks of cancer (adapted from Hallmarks of Cancer: ...

Figure 12.2: RAS/RAF/MEK/ERK/PI3K signaling pathways and associated cancer d...

Figure 12.3: TCGA GDC Data Portal, the most frequently mutated Cancer Census...

Figure 12.4: Plotting of read depth using CNVpytor. The squared dark line in...

Guide

Cover

Title Page

Dedication

Acknowledgments

About the Authors

Introduction

Table of Contents

Begin Reading

Index

End User License Agreement

Pages

iii

xix

xxi

xxii

xxiii

xxiv

xxv

xxvi

100

101

102

103

104

105

106

107

108

109

110

111

112

113

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

vii

307

Genomics in the AWS® Cloud

Analyzing Genetic Code Using Amazon Web Services

Catherine Vacher

David Wall

Introduction

Welcome to Genomics in the AWS Cloud!

From its title, you can conclude that this book is about two things: genomics (the science of sequencing and interpreting genetic data) and Amazon Web Services (one of the three big hosted computing platforms). Genomics in the AWS Cloud, therefore, is meant to appeal either to people from a biology background who want to learn how to do genomics work with AWS or to people with a computer background who want to find out how to apply their skills to genomics.

Both of these areas, genomics and cloud computing, are evolving constantly, and practically no one can claim to be completely au fait with either. This book, therefore, aims at not one but two separate moving targets. Our goal as authors is not to teach you everything there is to know about AWS and genomics—or even about the intersection of the two fields—but rather to show you the following:

Enough of the general concepts of cloud computing and genomics that you understand the problems to be solved and the technologies available to work on those problems

Enough specifics to enable you to work through actual genomics tasks and see results

Who Should Read This Book

This book is intended for people who aren't content to use commercial genome sequencing services and want to do their own analysis. We walk you through the process of getting raw data from a blood sample via a lab and then using the AWS services to analyze it—learning which genes are present in the sample and what they might say about you and your health. This will enable you to investigate aspects of your genome that commercial services don't explore because they are not allowed to give medical advice.

As well, this book is suited to people who want to learn about the AWS cloud and want to structure their study around a useful field—genomics.

Genomics

At the core of genomics is genome sequencing, which is the process of taking some biological material, such as blood or tissue, and converting it to pure information. This is a complicated process that combines the traditional work of a biologist (which is to say, manipulating actual cells in a “wet lab” environment) with information technology. Cells go into the process; a computer-readable data file comes out.

Genome sequencing took a long time to figure out. Crude, expensive methods were first employed in the 1970s and 1980s. More automated methods became available in the late 1990s, and these enabled the sequencing of relatively simple organisms: yeasts, bacteria, and a nearly microscopic nematode worm (Caenorhabditis elegans, long popular in biology labs as an experimental subject). Uncomplicated plants (notably Arabidopsis thaliana—a European weed with a particularly small genome) and modest insects (Drosophila melanogaster—the fruit fly), both longtime standard experimental subjects, soon followed around the turn of the century.

Two biologists described the first draft human genome sequence in an article in the journal Science in 2001. Scientists have worked to refine the human genome sequence since then and also have worked to sequence the genomes of thousands of other organisms.

Key to their work has been a continuous drop in the cost of full-genome sequencing. The first human genome sequence in 2001 cost roughly 2.7 billion U.S. dollars to produce—it required funding of the sort only national governments could provide. Within less than a decade, by 2005, the cost had fallen by four orders of magnitude to something like $1 million—still quite a lot. At this writing, in 2022, it is possible to have a human genome sequenced for less than the cost of a high-end smartphone, and plenty of companies are attracting funding for their plans to bring the cost to less than $100. By the time the asteroid 99942 Apophis makes its closest approach to Earth in 2029, sequencing a full genome will almost certainly cost about the same as the simplest routine medical blood test costs now. The cost of knowing everything about your genetic makeup will be trivial (assuming Apophis doesn't render this unimportant, which the latest reports assure us it won't).

The dramatic fall of the price of genome sequencing, from billions of dollars at the turn of the twenty-first century to a few hundred dollars today, makes it possible for almost all of us to explore our genetic makeup. While we all, as humans, share practically all of our genetic code (upwards of 99.9 percent), the differences make all the difference.

The tiny fraction of our individual genomes that differ from other humans is what accounts for whether we are male or female, all of our physical characteristics, many of our personality traits, and our propensity to health or various kinds of disease.

The availability of low-cost genome sequencing has revolutionized medical and pharmaceutical research and is started to change the practice of medicine. It also enables us to start to understand the building blocks of life and how much, or rather how little, we differ from other life forms.

Genomics in the AWS Cloud is about discovering and studying those differences and learning from them. But there is another part to the equation, which is to say the other set of tools and techniques identified in the title.

Cloud Computing and AWS

Almost in parallel with the advances in genomics that took place between 1995 and the present day, so-called cloud computing evolved enormously during the same time period and today represents a standard way of designing, deploying, and operating information processing systems.

Now, the idea of computing resources that are not local to the people who need them is not new at all. The earliest commercial and scientific computers were, of course, mainframes that were shared across many users—and more than a few of these remain in place today. Servers in organizational or co-location data centers, providing storage and computing resources to privileged users and the general public, have long been part of information technology. In the case of mainframes and client-server systems, users access remote computer systems (often not knowing or caring where they actually are). Functionally, that's cloud computing, and it's not a new thing.

What is new is the ease with which modern cloud computing platforms allow rapid construction and cost-effective use of complex and powerful systems. You can quickly set up elaborate workflows, test them with minimal computing power, and then scale them up enormously when it's time for a production run. More or less, you pay only for the computing power you use, and there are ways to schedule the use of processor cycles for times of low demand, when computing is cheaper. With the exception of storage and data transfer—meaning machine images that can be turned into working compute resources, as well as input and output data—systems configured in the cloud can cost practically nothing when they are not doing useful work. Such efficient use of expensive resources isn't possible with on-premises or traditionally hosted solutions.

There are three main players in the cloud-computing industry.

Amazon Web Services (AWS)

Google Cloud Platform (GCP)

Microsoft Azure

Each of them has its points of strength and weakness, and those relative merits are beyond the scope of this publication. Most organizations mix and match parts of each anyway to take advantage of relative technical superiorities and to maintain leverage with vendors. Suffice it to say that we chose to do our genomics work on AWS.

Considering that genome analysis is essentially a workflow to be carried out on storage and computing resources, AWS is well suited to the job. Here are some of the tools we will use:

Elastic Compute Cloud (EC2) for building and running the Linux servers that actually run the software required for genome analysis

Elastic Block Storage (EBS) for maintaining updated disk images, ready to attach to a working machine when needed

Simple Storage Service (S3) for storing input and output files when ready access to them is needed

Simple Workflow Service (SWF) for automating processes

Glacier when files need to be archived at low cost

Identity and Access Management (IAM) for maintaining security and appropriate user privileges

What You'll Learn from This Book

This book is intended to educate its readers in two areas: the science of genomics and the technology of Amazon Web Services. The idea is that you use the latter as a tool to explore the former.

Since it's unlikely that many readers are familiar with both genomics and AWS, this book is meant to teach you either subject—or both if you are familiar with neither—and how they work together.

How This Book Is Organized

Here is a quick introduction to each chapter in this book. You can skip directly to the parts that interest you most, or you can read from beginning to end to get a complete picture.

Chapter 1

: Why Do Genome Analysis Yourself When Commercial Offerings Exist?

This chapter explains what turnkey commercial services (such as 23&Me) exist and what they are good for. It then explains what they do not do and why you might want to do your own genomics work.

Chapter 2

: A Crash Course in Molecular Biology

This chapter brings you up to speed on the state of biological science as it pertains to genomics. Use this to refresh your knowledge from school or to gain a new understanding.

Chapter 3

: Obtaining Your Genome

This chapter explains how to get a sample of your blood suitable for sequencing and how to send it off to a lab for conversion into raw genome data.

Chapter 4

: The Bioinformatics Workflow

This chapter approaches the sequencing process from a biological point of view, explaining how one step of data processing feeds into the next to ultimately produce the results you want.

Chapter 5

: AWS Services for Genome Analysis

This chapter represents a deep dive into the AWS services you can use for genomics work. If you know biology but aren't familiar with AWS, you might want to begin here.

Chapter 6

: Building Your Environment in the AWS Cloud

This chapter goes into more detail about how to set up AWS services for genomics work.

Chapter 7

: Linux and AWS Command-Line Basics for Genomics

All major bioinformatics tools run under Linux, so you'll need to understand that operating environment and how to get things working together within it.

Chapter 8

: Processing the Sequencing Data

This chapter explains how to get from raw sequencing data to useful information about a genome.

Chapter 9

: Visualizing the Genome

This chapter introduces you to tools you can use to depict genomic information graphically, enabling you to understand it better.

Chapter 10

: Containerizing Your Workflow on the Desktop

This chapter explains how to use Docker containers‚ both locally and in AWS, to process data efficiently and scalably.

Chapter 11

: Variants and Applications

This chapter explores certain aspects of genome analysis, allowing you to dig deeper into the information you have derived from your sample.

Chapter 12

: Cancer Genomics

This chapter discusses the analysis of somatic mutations, specifically those found in cancerous tumors, although the workflow could also be applied to any tissue in the body.

How to Use This Book

You can approach this book by reading straight through, from beginning to end. That is probably the best way to approach it if you have knowledge of neither genomics nor AWS. Alternatively, if you feel you have a good handle on one or the other, you can focus on those chapters that cover your weak area and then study the sections on how AWS can be used to create and automate a genomics workflow.

Our Story

This book is not about us, but it's probably fair to explain to you, our reader, who we are and where we are coming from as we write this book.

We are a married couple and the parents of two children. Catherine is a working scientist and bioinformatician, attached to a medical research group in Sydney. David is an information technology and communications consultant who designs and builds AWS solutions, among other things.

A key fact is that one of our two children isn't alive anymore. Our daughter, Floriane, died of a sudden cardiac arrest in April 2015. She was nine years old at the time.

Floriane's death and its consequences are not the subject of this book. However, the way she lived and died inspired us to take our knowledge of genomics and cloud computing and look inward at our family. We wanted to know what happened to Floriane and what bearing that had on the rest of us. One way we honor her memory is by attempting to understand what happened to her. As well, we want to keep the rest of us—particularly her younger brother, Roland—safe.

Floriane was a fit and apparently healthy girl of nine. She had no ongoing medical concerns and was by all appearances developing normally. She was active and happy.

One evening, after dinner, she approached David on the sofa and asked him to chase her around the house—one of her favorite activities that they had indulged in many times before. David chased Floriane around, theatrically waving his arms. Floriane giggled excitedly and ran a few meters. David caught her and flipped her upside-down, which excited her further because she knew it led to tickling. At that point, she went completely limp.

David was confused. He thought he'd accidentally hit Floriane's head on something. Catherine, nearby, thought it was a seizure, even though Floriane had never had one previously. It quickly became clear that Floriane was in cardiac arrest.

Not having a defibrillator on hand, we called for an ambulance and did our best in the seconds before she died with manual cardiopulmonary resuscitation (CPR) that proved ineffective. Paramedics showed up with a defibrillator and adrenaline, and used both, but it was all over by then. Floriane died at home, a few minutes after the onset of her cardiac arrest.

(Again, it's not the subject of this book, but please take this opportunity to contemplate the fleetingness of life and to appreciate the people you love. We'll wait.)

And so we were left with questions: What just happened? Why did it happen? Is it going to happen to any of the rest of us? There were many more questions, far more existential in nature, and those questions remain, but those questions are subjects for other venues. Looking at the situation in a limited way, our daughter had just died, completely unexpectedly, and we wanted to know why.

Speculation on the cause of Floriane's death began almost immediately. She hadn't been visibly sick at all. She hadn't exhibited the symptoms—fever and so on—of an infection, such as meningitis. She was far too mature—nine years old, tall, and apparently fit—to have succumbed to the strange and almost random things that claim the lives of babies and are grouped under the catchall descriptor Sudden Infant Death Syndrome (SIDS). She didn't have any known allergies. She wasn't taking any medicines. She hadn't suffered an injury, hadn't eaten anything unusual, hadn't traveled anywhere dangerous—she hadn't done anything outside the realm of what is normal for a fifth-grade girl living in Australian suburbia. And yet she was no longer alive.

The emergency-room doctor offered a broad guess that turned out to be right: cardiomyopathy, or a disease of the heart muscle. Further investigation by the medical examiner showed some characteristics of hypertrophic cardiomyopathy (HCM), which is one of several kinds of cardiomyopathy, but also revealed some characteristics of arrhythmogenic right ventricular cardiomyopathy (ARVC), another disease of the heart. Genetic testing showed two interesting genes: MYH7 (associated with HCM) and RYR2 (associated with ARVC).

The medical examiner officially attributed Floriane's death to HCM but noted that her case had “unusual features”—a reference to the traces of ARVC that had been found. The stated cause of death could just as well have been ARVC, we were told.

We began to research and to think. We discovered that RYR2 is associated with disturbances to the heart rhythm due to adrenaline. Floriane had gone into cardiac arrest when she was wound up with excitement, playing with David. She had received an injection of adrenaline from the paramedics. Could the undeniable structural defects caused by HCM, which usually do prove fatal, but not until age 30 or so, have been amplified by some form of ARVC?

The question gained urgency when we learned that while Floriane's MYH7 was a de novo mutation, she hadn't inherited it from us, and her brother didn't have it.

So began our research into our family's genomes.

Getting Under Way

Our goal in Genomics in the AWS Cloud is to make the study of your genome, or whatever genome you choose (some are available on the Internet for practice; see Chapter 4), as simple and straightforward as possible. Chapter 2 describes things you should consider before starting. As genome analysis requires computer resources that exceed the average desktop, we set up the analysis environment on the AWS Cloud—that's the subject of Chapters 5 through 7. This book does not presuppose any particular knowledge in molecular biology (which we introduce in Chapter 3) or computers (we talk about Linux in Chapter 8). We paid particular care in explaining how to make sense of our genetic variants in Chapters 11 and 12.

Now, let's start the journey.

How to Contact Wiley and the Authors

Sybex strives to keep you supplied with the latest tools and information you need for your work. If you believe you have found a mistake in this book, please bring it to our attention. At John Wiley & Sons, we understand how important it is to provide our customers with accurate content, but even with our best efforts an error may occur.

In order to submit your possible errata, please email it to our Customer Service Team at [email protected] with the subject line “Possible Book Errata Submission.”

You can email David Wall with your comments or questions at [email protected].

CHAPTER 1Why Do Genome Analysis Yourself When Commercial Offerings Exist?

As you begin to explore sequencing a genome and analyzing the results, you will no doubt become aware of a number of commercial operations that offer to do the job for you, neat and tidy, in exchange for a modest amount of money. Why would you want to go through the time and hassle involved in doing the work yourself when such convenient offerings are so easy to use? Why not engage a service to do your genetic analysis and enjoy the benefits of something that “just works”?

The answer has, essentially, two parts.

The first is that the commercial services may not provide all the information you want, not least because they are hamstrung by regulations that govern the provision of medical advice. They tend to provide “novelty” information—Larry Lightbulb kinds of things about hair color and finger length, as well as about racial, national, and tribal heritage. They are great for educating children about the sorts of information that DNA can carry and for talking about heritability of characteristics good, bad, and neutral. When you do the sequencing and analysis yourself, you can extract whatever information you want.

The second is that you are the kind of person who likes to do things yourself, either just for the satisfaction of it or because you want to understand how everything works and fits together. Working on your personal genome in Amazon Web Services (AWS) is an excellent way to learn about those services, and that knowledge can then be put to use for fun and profit.

Commercial Sequencing Services

As you survey the Web, you will find that there are several popular consumer-grade genome sequencing services, including 23andMe, Ancestry, and MyHeritage.

Others include the following:

African Ancestry

: This service is marketed to Africans and people of African heritage. Some users have regretted that the service does not show DNA broken down by origin in the various regions of the African continent.

Athletigen

: This service focuses on markers related to physical fitness, such as those affecting endurance and speed of recovery after exertion.

DNAFit

: This service provides diet plans designed around certain nutrition-related markers.

Fitness Genes

: This service offers training regimes that fit genomic markers related to physical exertion.

GEDmatch

: This service focuses on genealogy and links with others who have submitted their sequences.

Genome Link

: This service provides information on a series of characteristics, such as physical endurance and skin color.

Genopalate

: Focused on nutrition, this service aims to help its customers optimize their diets.

Living DNA

: This is an ancestry-focused service with a user community primarily from the United Kingdom and Ireland.

MyHeritageDNA

: Ancestry focused, this service connects its users with possible relatives and suggests possible genetic risks to health.

Nebula Genomics

: Offering a monthly subscription that entitles its users to monthly updates as new information becomes available, this service includes data on the oral microbiome (i.e., the bacteria found in your saliva).

Promethease

: This is a modestly priced service that detects a number of single nucleotide polymorphisms (SNPs, or “snips”).

Sano Genetics

: This free service concentrates on SNPs related to autism and mathematical reasoning.

SelfDecode

: This is a general-purpose detector of several thousand genetic markers.

Vitagene

: Focused on fitness and athletic performance, this service includes ancestry information as well.

Xcode Life

: This service offers several low-cost specialty tests including one on skin care and another on metabolic diseases.

Typical Results

Of the aforementioned services, perhaps the best-known of the direct-to-consumer genetic testing services is 23andMe, a California company that pioneered the industry in 2007. When someone places an order with 23andMe, the company sends out a kit containing materials needed for the collection of saliva, which is then sent back for analysis. (The idea is that everyone's saliva contains cells that have been shed from the interior of the mouth.) The company presents its report to the customer via its website.

The company ran afoul of the U.S. Food and Drug Administration (FDA) in 2013, when the regulator objected to 23andMe (and other genetics services providers) advertising that its service provided its customers with information on their susceptibility to various genetically linked conditions, such as male-pattern baldness and certain kinds of cancer. This, the FDA said, constituted medical advice of the sort that should be formulated and delivered by a qualified doctor. The 23andMe tests were medical devices and should be regulated as such.

After going quiet for several years, 23andMe applied to the FDA for permission to include information in its reports about a number of mutations and alleles that are well-understood to be associated with pathogenic conditions, including Alzheimer's disease, Parkinson's disease, celiac disease, and a number of BRCA1 and BRCA2 mutations associated with breast cancer. The company argued—and the FDA ultimately agreed—that the test methods used by 23andMe were sufficiently reliable and understood as to not require the involvement of a medical professional. As well, the entities agreed that the relationships between the tested sequences and the various diseases were adequately proven, and that if an individual was found to have a sequence known to be pathogenic, there was no need to hide the truth behind the medical establishment.

With the shift toward presentation of information about ancestry rather than medical conditions, direct-to-consumer online genetic analysis services have, perhaps predictably, begun to appeal to those whose ancestry is more than passing interest.

So, what's in a set of 23andMe reports? If you undergo the saliva test and log into the 23andMe website today, you will get a lot of novelty information about probable hair color, the shapes of certain body parts, and the aspects of aging.

Sometimes, 23andMe gets things right. For example, 23andMe predicted the following for me:

A 67 percent probability of little to no back hair. (Correct!)

A 32 percent probability of a bald spot on the top rear of the head. (Correct, pretty much. It's kind of thinning there, but certainly not bald. Certainly not.)

A 74 percent chance that the earlobes are separate from the sides of the face. (Hear, hear!)

A 71 percent chance that the ring fingers are longer than the index fingers. (Yep.)

A 1 percent chance of red hair. (It's brown.)

A 62 percent chance that dandruff is sometimes a problem. (It is.)

A 25 percent chance of being afraid of heights. (I am a qualified pilot.)

Sometimes, it gets things wrong. In my case, it forecast the following:

A 51 percent chance of blue eye color. (They're green.)

A 66 percent chance of wavy hair. (It's straight.)

A 33 percent chance of a widow's peak across the front of the scalp. (There is definitely a prominent one.)

A 4 percent chance of bunions in the feet. (A substantial one on the right foot has been causing trouble since high school.)

From there, the 23andMe report can get a little comical. For example, my report states that I am likely to wake up at precisely 7:34 a.m. Apparently, the company uses statistical analysis of surveys conducted on people with certain similar genetic markers (those associated with early or late rising) to arrive at the precise time. The suggestion is that genetics determine your wake-up time to the minute, which is just not correct. This is a manufactured novelty “fact,” and it's not correct: I almost always gets out of bed before 6 a.m. Hilariously, my wife Catherine's 23andMe report has her waking up significantly earlier than I do, which has happened exactly zero times.

The company provides an ancestry report, which attempts to describe which part of the world your forebears came from.

Some background here: my wife and I live in Australia, to which we both immigrated (Catherine from France and me from the United States) around the turn of the 21st century. We are both people of the New World, largely characterized by its relatively recent immigrants, now.

In my case, the ancestry timeline (shown in Figure 1.1) appears to agree with much of what I know about my family history. It shows ancestors from Scandinavia as recently as 1940. Family lore states that my maternal grandmother was born in Sweden and was brought to Chicago as a baby in the 1920s, so that makes sense. Similarly, my family records have my maternal great-great-grandfather emigrating from Hesse, in the western part of what had recently become a unified German Empire, in 1879. That fits with 23andMe's report of French and German ancestry in the late 19th century.

Figure 1.1: My ancestry timeline

My paternal side is less well documented, probably because those ancestors have been in the United States for several generations. The surname Wall fits with heritage from the British Isles, though, and there was always vague talk about my paternal grandmother having ancestors from the eternally contested regions of Central and Eastern Europe.

The report includes Cypriot heritage in the 18th century. I have no idea what that is about, having never heard anything about ancestors on an island of the Ottoman Empire in the Eastern Mediterranean. The 1700s are the distant past for us, as far removed in memory as the original ancestors that first distinguished themselves from the other apes.

Conclusion: no one really knows, but it kind of fits.

As for Catherine, she understands her heritage to be French for many generations back, as far as anyone has been able to trace genealogy. I get scolded when I suggest that the family tree may be particularly branchless. The report from 23andMe (shown in Figure 1.2) concurs, sort of. It shows French heritage back into the 19th century, as well as British and Irish ancestors during the same time period. This latter characteristic may be explained by Catherine's maternal antecedents being from Brittany, a maritime region in the extreme west of France. The language and culture there are Celtic and distinct from those of mainstream France—a distinction that was even more pronounced in the past. The Bretons have a lot in common with other Celtic people, including the Irish, Welsh, Manx, and Cornish. A number of her ancestors were sailors, which probably contributed to the genetic intermixing.

It's all interesting in a bourgeois sort of way, in which people living comfortably enough to spend a hundred bucks on a saliva test can look back at their ancestors, whose lives were foreign (in time, if not always location) and therefore conventionally romantic and interesting. In the case of people who find that their ancestors migrated from place to place, it becomes possible to conclude that the lives of the ancestors were unstable or otherwise crappy enough to motivate them to relocate, which allows the patrons of 23andMe to burnish their claims to modest origins. It's all good fun and harmless enough, facilitating European stuff on North American Christmas trees and a bottomless market for generic Irish music among Boomers in the former colonies.

Figure 1.2: My wife's ancestry timeline

There's a dark side to genetic ancestry services like that of 23andMe, though. Some people assign extreme importance to their personal ancestry and believe that certain types of genetic heritage are inherently better than others. A scan of white supremacist websites, for example, will reveal screen shots of ancestry reports from 23andMe and similar services, posted by people attempting to fit in with the group. It's simultaneously risible and sad—or would be so, if it weren't for the undercurrent of violence—to see genetic profiling used in this way. Genetic testing is like any other power tool, though, capable of being used for ill as well as for good.

Perhaps the best part of 23andMe is the fact that it makes available what it calls raw data, which is a text file of 7–10 MB containing information about base pair sequences (see Figure 1.3).

You can take the 23andMe raw data file and use it as input to many of the other services listed earlier in this chapter.

Figure 1.3: 23andMe raw data

Summary

In this chapter, we considered the fact that there are a number of commercial genome-analysis services available for use. Most of them ask you to send in some genetic material—usually in the form of cells suspended in saliva—and then test for various markers. The services then present their customers with a report, which usually focuses on ancestry and certain physical characteristics (such as probably hair color and ability to perceive certain flavors) rather than serious medical information. This is due to regulations that govern who may offer medical advice, and how. This is one of the main reasons to pursue your own genome sequencing work on the AWS cloud—so that you have the data and tools you need to answer all the medical questions that interest you.

CHAPTER 2A Crash Course in Molecular Biology

To properly analyze genomes, you need to understand some key biology concepts. Biology is a vast field, and this chapter focuses on areas that are fundamental for extracting meaningful information from genomes.

DNA

As you are probably aware, your genome encodes all the information necessary to build and maintain your body. It is composed of a series of about three billion letters, which can be either A, G, C, or T, and is stored inside the nucleus of each of your three trillion nucleated cells. Red blood cells do not have a nucleus and therefore are your only cells without a copy of your genome.

Note that all living organisms—such as animals, plants, fungi, or bacteria—also have a genome, composed of the same letters A, G, C and T. This book focuses on the analysis of the human genome, but all concepts and techniques are applicable to other species. You can apply them to the analysis of the genome of your cat, goldfish, potted plant, or slime mold growing in your front lawn. In genomics, it is sobering to see that humans are just a life species among many others.

Physically speaking, your genome is encoded into chains of repeating subunits called nucleotides. Each nucleotide consists of a sugar group, a phosphate group, and a base that can be either adenine (A), guanine (G), cytosine (C), or thymine (T). The sugar group is called deoxyribose; hence, these long molecules are named deoxyribonucleic acid (DNA). A chain of nucleotides is referred to as a DNA strand. To make this chemical structure more stable against mutations, your genome is actually stored as two complementary DNA strands. The second strand does not provide any more information and is simply a copy of the first strand, where A has been replaced by T, G by C, T by A, and C by G. The two DNA strands bind via weak chemical bonds called hydrogen bonds (the bases A pair with T, the bases G pair with C), coil around each other, and form the elegant and well-known DNA helix structure.

DNA is a convenient and compact way of storing information. It has already been used to store small, compressed MPEG movies and data of up to 200 MB.

In living organisms, the long DNA molecule is broken into segments called chromosomes. Species have various numbers of chromosomes: the fruit fly has 8 chromosomes, spinach and slime mold have 12, humans have 46, dogs and chickens have 78, and carp have 100. Bacteria stand apart as they have a single chromosome of circular shape, which is free floating as bacteria do not have a nucleus. The 46 human chromosomes are divided into two sets of 23 chromosomes, with each set inherited from each parent. Our chromosomes are numbered from 1 to 22, with the last two chromosomes, X and Y, determining our sex.

Not to be forgotten, each of our cells also contains between 80 and 2,000 small oval structures called mitochondria that produce the energy required for the cell. Mitochondria are thought to be bacteria that have been internalized by the cells of our remote primitive ancestors some 1.45 billion years ago. Each of our mitochondria has its own DNA, which has a circular shape (like in bacteria), unlike our linear chromosomes. We inherit our mitochondrial DNA exclusively from our mother. This allows genealogical researchers to study our maternal lineage many generations ago.

The total length of our genome, if we put our chromosomes end to end, would be nearly 2 meters. To fit into the tiny nucleus of each of our cells, our DNA is first wound around proteins called histones and then tightly folded and coiled. This packing complex of DNA and histone is called chromatin. When the chromatin is tightly packed (also called condensed), the DNA cannot be accessed. When a region of DNA needs to be read, the chromatin is unpacked and is said to be open.

Now, let's look at some chemical nuts and bolts to clarify an important concept that is sometimes confusing when interpreting genomics databases. DNA strands have an orientation: at one end of the strand, the base is connected to the fifth carbon atom of the sugar, whereas at the other end, the base is attached to the third carbon. Each end of a DNA strand is therefore nicknamed 5' or 3'. By convention, DNA sequences are represented from 5' to 3'. Because the second complementary strand does not represent any additional information, usually only the first strand (called the coding strand or Plus strand or + strand) is represented, whereas the complementary strand (Minus strand or – strand) is omitted (see Figure 2.1). Note that the Minus strand, read from 5' to 3' as per the convention, is the reverse complement of the Plus strand. Each paired group of letters of the double-stranded DNA molecule is referred to as a base pair (bp). So, the sequence shown in Figure 2.1 consists of 25 bp. Confusingly, though genomics databases describe DNA mutations or variants on the Plus strand, they are sometimes described on the Minus strand. It is necessary in that case to transform the nucleotide information into its complement (A to T, G to C, and vice versa).

Figure 2.1: DNA molecule representation

Crucially, as the principle of life revolves around replicating itself, DNA can easily be replicated via convenient cell enzymes called DNA polymerases. This enables cells to receive an identical copy of the genome after each cell division. In practice, the chromatin of the chromosomes is decondensed, the two strands of their double helix are unwound, and the DNA polymerase synthesizes a new complementary DNA strand for each of the two initial strands. At the end of the process, the DNA is fully duplicated into identical double helices. The chromosomes then align in the center of the nucleus. Fibers attach to the center (called centrosome) of the chromosomes and pull one copy of each chromosome to opposite poles of the cell. This process by which two new nuclei are formed, in view of creating two identical daughter cells, is caused mitosis. Finally, the cell itself divides into two cells with each an identical genome.

How long does it take for a cell to replicate? Bacteria have very small genomes that can be replicated quickly, as well as short growth phases; in consequence, bacteria can divide in up to 15 minutes. In contrast, fast-growing human cells take about 24 hours to divide. The cell starts with a growth phase (called G1) of about 11 hours, where the cell secretes enzymes and nutrients that will be required for the cell division. The cell then undergoes a phase of DNA synthesis (called S phase) to duplicate the genome, which takes about 8 hours. It is followed by a second growth phase (called G2) of about 4 hours. Finally, the nucleus division (mitosis) and cell division can occur. This last phase, called M phase, is the most visual and spectacular and takes only one hour.

The chromosomes are not usually visible under the microscope, as they are all bundled together in the nucleus. During mitosis, the chromosomes are condensed and neatly align in the center of the cell. It is during this phase that the chromosomes can be seen and photographed under a light microscope when stained with a dye. This technique is called karyotyping. Figure 2.2

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben: