32,99 €
Perform genome analysis and sequencing of data with Amazon Web Services Genomics in the AWS Cloud: Analyzing Genetic Code Using Amazon Web Services enables a person who has moderate familiarity with AWS Cloud to perform full genome analysis and research. Using the information in this book, you'll be able to take a FASTQ file containing raw data from a lab or a BAM file from a service provider and perform genome analysis on it. You'll also be able to identify potentially pathogenic gene sequences. * Get an introduction to Whole Genome Sequencing (WGS) * Make sense of WGS on AWS * Master AWS services for genome analysis Some key advantages of using AWS for genomic analysis is to help researchers utilize a wide choice of compute services that can process diverse datasets in analysis pipelines. Genomic sequencers that generate raw data files are located in labs on premises and AWS provides solutions to make it easy for customers to transfer these files to AWS reliably and securely. Storing Genomics and Medical (e.g., imaging) data at different stages requires enormous storage in a cost-effective manner. Amazon Simple Storage Service (Amazon S3), Amazon Glacier, and Amazon Elastics Block Store (Amazon EBS) provide the necessary solutions to securely store, manage, and scale genomic file storage. Moreover, the storage services can interface with various compute services from AWS to process these files. Whether you're just getting started or have already been analyzing genomics data using the AWS Cloud, this book provides you with the information you need in order to use AWS services and features in the ways that will make the most sense for your genomic research.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 454
Veröffentlichungsjahr: 2023
Cover
Title Page
Introduction
Who Should Read This Book
Genomics
Cloud Computing and AWS
What You'll Learn from This Book
Our Story
Getting Under Way
CHAPTER 1: Why Do Genome Analysis Yourself When Commercial Offerings Exist?
Commercial Sequencing Services
Typical Results
Summary
CHAPTER 2: A Crash Course in Molecular Biology
DNA
DNA at Work: RNA and Proteins
Inheritance
Summary
CHAPTER 3: Obtaining Your Genome
Preparing to Have Your Genome Sequenced
Specifying Lab Work
Engaging a Laboratory
Getting a Tissue Sample for DNA Extraction
Shipping the Sample
Receiving the Results
Summary
CHAPTER 4: The Bioinformatics Workflow
Extraction of DNA
FASTA Files
FASTQ Files
Alignment to a Reference Genome
Reference Genomes
Quality Control
Trimming
The Alignment Process
Marking Duplicates
Recalibrating Base Quality Score
Calling SNVs and Indel Variants
Annotating SNVs and Indel Variants
Prioritizing Variants
Inheritance Analysis
Identifying SVs and CNVs
Bioinformatics Workflow
Summary
CHAPTER 5: AWS Services for Genome Analysis
General Concepts
Custom Environments
Summary
CHAPTER 6: Building Your Environment in the AWS Cloud
Setting Up a Virtual Private Cloud
Setting Up and Launching an EC2 Instance
Setting Up S3 Buckets
Configuring Your Account Securely
Creating Groups
Creating Users
Setting Up Your Client Environment
Summary
CHAPTER 7: Linux and AWS Command-Line Basics for Genomics
Selecting a Linux Distribution
Accessing Your AWS Linux Instance from Your Local Computer
Getting Familiar with the Command Line
Transferring Files to and from Your AWS Instance
Running Programs in the Background
Understanding File Permissions
Compressing and Archiving Files
Managing Linux
The AWS Command-Line Interface
AWS CLI Essentials
An Alternative Approach: AWS Systems Manager
Summary
CHAPTER 8: Processing theSequencing Data
Getting from Data to Information
Setting Up AWS Services and Data Storage
Summary
CHAPTER 9: Visualizing the Genome
Introducing Genome Visualizers
Installing the IGV Desktop Visualizer
Analyzing Variants in IGV
Summary
CHAPTER 10: Containerizing Your Workflow on the Desktop
Introducing Containerization
Understanding and Using Docker
Summary
CHAPTER 11: Variants and Applications
Polygenic Risk Scores
Metagenomics
AlphaFold
Summary
CHAPTER 12: Cancer Genomics
Somatic Genomes
Cancer
The Promise and Reality of Cancer Precision Medicine
Samples
Somatic Variant Analysis
Copy Number Changes
Measuring Tumor Genomic Instability
Summary
Notes
Index
Copyright
Dedication
Acknowledgments
About the Authors
End User License Agreement
Chapter 7
Table 7.1: Main Linux Server Distributions
Table 7.2: Linux Permissions, Octal Code, and Interpretation
Table 7.3: Redirection Operators
Chapter 9
Table 9.1: Setting hg38 Options in the IGV Browser
Table 9.2: Setting hg19 Options in the IGV Browser
Table 9.3: Mapping Quality (MAPQ) Scores and Their Corresponding Probabiliti...
Chapter 11
Table 11.1 Example of PGS Catalog Scoring File
Chapter 1
Figure 1.1: My ancestry timeline
Figure 1.2: My wife's ancestry timeline
Figure 1.3: 23andMe raw data
Chapter 2
Figure 2.1: DNA molecule representation
Figure 2.2: Human karyotype
Figure 2.3: Transcription and translation
Figure 2.4: The genetic code
Figure 2.5: The protein alcohol dehydrogenase
Figure 2.6: Family pedigree
Chapter 3
Figure 3.1: The four raw data files resulting from sequencing cellular mater...
Figure 3.2: A graph showing how many times each base pair in a sequence was ...
Chapter 4
Figure 4.1: Phred scores
Figure 4.2: Alignment of paired reads to the reference genome
Figure 4.3: Bioinformatics workflow
Chapter 5
Figure 5.1: Selecting a VPC configuration
Figure 5.2: Configuring the VPC with a single public subnet
Figure 5.3: Confirming the VPC creation
Chapter 6
Figure 6.1: The main console page, where you should search for VPC
Figure 6.2: Existing VPCs and other network resources
Figure 6.3: The initial step of the VPC wizard
Figure 6.4: Specifying network characteristics of the new VPC in the VPC Wiz...
Figure 6.5: The second step of the VPC wizard, with VPC name and address blo...
Figure 6.6: VPC creation success
Figure 6.7: The newly created VPC among other VPCs
Figure 6.8: The main EC2 status page
Figure 6.9: Selecting an image for the EC2 instance
Figure 6.10: Choosing an AMI image to use as a base for the new server
Figure 6.11: Specifying the virtual hardware for the EC2 instance
Figure 6.12: Attaching the new EC2 instance to the VPC created earlier
Figure 6.13: Adding storage
Figure 6.14: Adding tags
Figure 6.15: Configuring a security group
Figure 6.16: Reviewing the instance launch details
Figure 6.17: An EC2 instance starting up, showing in Pending state.
Figure 6.18: Stopping an instance to save running costs
Figure 6.19: A list of all existing buckets
Figure 6.20: Naming a new bucket
Figure 6.21: Choosing bucket options
Figure 6.22: Declining the option to make the new bucket public
Figure 6.23: The new bucket in the list of all buckets
Figure 6.24: The main IAM page
Figure 6.25: Activating multifactor authentication
Figure 6.26: Choosing an MFS device
Figure 6.27: Getting a QR code for scanning
Figure 6.28: IAM wants you to set up a password policy for enhanced security...
Figure 6.29: Setting a password policy
Figure 6.30: Specifying characteristics of the password policy
Figure 6.31: A selection of password policy characteristics
Figure 6.32: Naming your group
Figure 6.33: Assigning the standard PowerUserAccess policy to the new Genomi...
Figure 6.34: Creating two new users
Figure 6.35: Assigning the new users to a group
Figure 6.36: Two new users, successfully created
Figure 6.37: Connecting to an EC2 instance via MacOS Terminal
Figure 6.38: Creating a
.ppk
file with PuTTY Key Generator
Figure 6.39: Adding a
.ppk
file to PuTTY
Figure 6.40: Specifying an EC2 instance in PuTTY
Figure 6.41: A successful PuTTY connection
Figure 6.42: Attaching an S3 location to Windows
Chapter 7
Figure 7.1: Configuration window of PuTTY
Figure 7.2: Configuration of PuTTY with support for an X server
Figure 7.3: Ubuntu desktop running as a virtual machine (with VMware) on Win...
Figure 7.4: Configuration of an Ubuntu Virtual machine with VMware on Window...
Figure 7.5: FTP client interface (FileZilla) that allows the transfer of fil...
Figure 7.6: Linux file permissions
Chapter 8
Figure 8.1: Signing into the AWS Console
Figure 8.2: Creating an EFS storage volume
Figure 8.3: Adding a file system
Figure 8.4: Assigning a name
Figure 8.5: Successful creation of a storage volume
Figure 8.6: Getting the name of the security group
Figure 8.7: The AWS EC2 administration console
Figure 8.8: Launching EC2 instances
Figure 8.9: Choosing an operating system
Figure 8.10: Selecting an EC2 instance type
Figure 8.11: EC2 default characteristics
Figure 8.12: Adding storage to an EC2 instance
Figure 8.13: Skipping ahead without creating tags
Figure 8.14: Reviewing the characteristics of a new instance
Figure 8.15: Launching a new EC2 instance
Figure 8.16: Getting the security key pair
Figure 8.17: Launching a new instance
Figure 8.18: A running EC2 instance
Figure 8.19: Clicking View Instances
Figure 8.20: Seeing that an EC2 instance is in the Running state
Figure 8.21: Stopping an EC2 instance
Figure 8.22: Confirming EC2 stop
Figure 8.23: Seeing that an EC2 instance is in the Stopped state
Figure 8.24: Restarting an EC2 instance
Figure 8.25: Noting the IP address of an EC2 instance
Figure 8.26: Adding a storage volume to an EC2 instance
Figure 8.27: Creating a Security Group
Figure 8.28: Confirming the characteristics of the security group
Figure 8.29: A newly created security group
Figure 8.30: Selecting the EFS mount security group
Figure 8.31: Editing inbound security group rules
Figure 8.32: Adding a rule
Figure 8.33: Noting the creation of the new rule
Figure 8.34: Selecting EFS storage
Figure 8.35: Attaching an EFS volume
Figure 8.36: Information about the mounted volume
Figure 8.37: Specifying a host in puTTY
Figure 8.38: Selecting a PPK key file in puTTY
Figure 8.39: A successful connection to a remote EC2 instance, via puTTY
Figure 8.40: Observing the contents of an EFS volume
Figure 8.41: Changing the type of an EC2 instance
Figure 8.42: Applying the type change
Figure 8.43: Restarting the modified EC2 instance
Chapter 9
Figure 9.1: The UCSC Genome Browser displaying the tumor necrosis factor gen...
Figure 9.2: The third exon of the TNF gene
Figure 9.3: The IGV desktop main page
Figure 9.4: The appearance of the IGV browser on initial startup
Figure 9.5: Adding a user in an AWS account
Figure 9.6: Granting S3 storage permissions to the newly created user
Figure 9.7: AWS reporting that the new user has been set up correctly
Figure 9.8: Opening a data file in the IGV browser
Figure 9.9: Choosing a genome reference
Figure 9.10: Specifying the hg38 human genome as a reference
Figure 9.11: Loading a BAM file into the IGV browser
Figure 9.12: Setting up the UCSC Table Browser to download functional elemen...
Figure 9.13: The UCSC Genome Browser showing the ACE2 gene
Figure 9.14: User settings and preferences in the UCSC Genome Browser
Figure 9.15: Resetting everything to default values
Figure 9.16: Choosing a publicly available BAM file representing a populatio...
Figure 9.17: Navigating a genome in the UCSC Genome Browser
Figure 9.18: UCSC Genome Browser indicating that a particular DNA fragment w...
Figure 9.19: An indication that the aligner is having trouble mapping a sequ...
Figure 9.20: IGV indicating an exon
Figure 9.21: IGV indicating an inserted sequence
Chapter 10
Figure 10.1: Virtual machines each have their own operating systems.
Figure 10.2: Containers have their own configurations on top of the host's o...
Figure 10.3: A Docker image can spawn many (potentially slightly different) ...
Figure 10.4: Docker showing a running container
Figure 10.5: Docker showing a newly added container in operation
Figure 10.6: A Docker image for running the MySQL database
Figure 10.7: The Broad Institute's Genomes in the Cloud image
Figure 10.8: Downloading the GITC image
Chapter 11
Figure 11.1: A view of an uploaded protein 3D structure model
Figure 11.2: The sequence of PDB structure 2BBO with residue Phe508 selected...
Figure 11.3: A comparison of the structures of wild-type (PDB 2BBO shown lig...
Figure 11.4: A comparison of the structures of wild-type (PDB 2BBO, lighter)...
Figure 11.5: The surface of wild-type CFTR in the vicinity of Phe508
Figure 11.6: The surface of Phe508Del mutant CFTR in the vicinity of where P...
Chapter 12
Figure 12.1: Revised hallmarks of cancer (adapted from Hallmarks of Cancer: ...
Figure 12.2: RAS/RAF/MEK/ERK/PI3K signaling pathways and associated cancer d...
Figure 12.3: TCGA GDC Data Portal, the most frequently mutated Cancer Census...
Figure 12.4: Plotting of read depth using CNVpytor. The squared dark line in...
Cover
Title Page
Copyright
Dedication
Acknowledgments
About the Authors
Introduction
Table of Contents
Begin Reading
Index
End User License Agreement
iii
xix
xx
xxi
xxii
xxiii
xxiv
xxv
xxvi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
iv
v
vii
ix
307
Catherine Vacher
David Wall
Welcome to Genomics in the AWS Cloud!
From its title, you can conclude that this book is about two things: genomics (the science of sequencing and interpreting genetic data) and Amazon Web Services (one of the three big hosted computing platforms). Genomics in the AWS Cloud, therefore, is meant to appeal either to people from a biology background who want to learn how to do genomics work with AWS or to people with a computer background who want to find out how to apply their skills to genomics.
Both of these areas, genomics and cloud computing, are evolving constantly, and practically no one can claim to be completely au fait with either. This book, therefore, aims at not one but two separate moving targets. Our goal as authors is not to teach you everything there is to know about AWS and genomics—or even about the intersection of the two fields—but rather to show you the following:
Enough of the general concepts of cloud computing and genomics that you understand the problems to be solved and the technologies available to work on those problems
Enough specifics to enable you to work through actual genomics tasks and see results
This book is intended for people who aren't content to use commercial genome sequencing services and want to do their own analysis. We walk you through the process of getting raw data from a blood sample via a lab and then using the AWS services to analyze it—learning which genes are present in the sample and what they might say about you and your health. This will enable you to investigate aspects of your genome that commercial services don't explore because they are not allowed to give medical advice.
As well, this book is suited to people who want to learn about the AWS cloud and want to structure their study around a useful field—genomics.
At the core of genomics is genome sequencing, which is the process of taking some biological material, such as blood or tissue, and converting it to pure information. This is a complicated process that combines the traditional work of a biologist (which is to say, manipulating actual cells in a “wet lab” environment) with information technology. Cells go into the process; a computer-readable data file comes out.
Genome sequencing took a long time to figure out. Crude, expensive methods were first employed in the 1970s and 1980s. More automated methods became available in the late 1990s, and these enabled the sequencing of relatively simple organisms: yeasts, bacteria, and a nearly microscopic nematode worm (Caenorhabditis elegans, long popular in biology labs as an experimental subject). Uncomplicated plants (notably Arabidopsis thaliana—a European weed with a particularly small genome) and modest insects (Drosophila melanogaster—the fruit fly), both longtime standard experimental subjects, soon followed around the turn of the century.
Two biologists described the first draft human genome sequence in an article in the journal Science in 2001. Scientists have worked to refine the human genome sequence since then and also have worked to sequence the genomes of thousands of other organisms.
Key to their work has been a continuous drop in the cost of full-genome sequencing. The first human genome sequence in 2001 cost roughly 2.7 billion U.S. dollars to produce—it required funding of the sort only national governments could provide. Within less than a decade, by 2005, the cost had fallen by four orders of magnitude to something like $1 million—still quite a lot. At this writing, in 2022, it is possible to have a human genome sequenced for less than the cost of a high-end smartphone, and plenty of companies are attracting funding for their plans to bring the cost to less than $100. By the time the asteroid 99942 Apophis makes its closest approach to Earth in 2029, sequencing a full genome will almost certainly cost about the same as the simplest routine medical blood test costs now. The cost of knowing everything about your genetic makeup will be trivial (assuming Apophis doesn't render this unimportant, which the latest reports assure us it won't).
The dramatic fall of the price of genome sequencing, from billions of dollars at the turn of the twenty-first century to a few hundred dollars today, makes it possible for almost all of us to explore our genetic makeup. While we all, as humans, share practically all of our genetic code (upwards of 99.9 percent), the differences make all the difference.
The tiny fraction of our individual genomes that differ from other humans is what accounts for whether we are male or female, all of our physical characteristics, many of our personality traits, and our propensity to health or various kinds of disease.
The availability of low-cost genome sequencing has revolutionized medical and pharmaceutical research and is started to change the practice of medicine. It also enables us to start to understand the building blocks of life and how much, or rather how little, we differ from other life forms.
Genomics in the AWS Cloud is about discovering and studying those differences and learning from them. But there is another part to the equation, which is to say the other set of tools and techniques identified in the title.
Almost in parallel with the advances in genomics that took place between 1995 and the present day, so-called cloud computing evolved enormously during the same time period and today represents a standard way of designing, deploying, and operating information processing systems.
Now, the idea of computing resources that are not local to the people who need them is not new at all. The earliest commercial and scientific computers were, of course, mainframes that were shared across many users—and more than a few of these remain in place today. Servers in organizational or co-location data centers, providing storage and computing resources to privileged users and the general public, have long been part of information technology. In the case of mainframes and client-server systems, users access remote computer systems (often not knowing or caring where they actually are). Functionally, that's cloud computing, and it's not a new thing.
What is new is the ease with which modern cloud computing platforms allow rapid construction and cost-effective use of complex and powerful systems. You can quickly set up elaborate workflows, test them with minimal computing power, and then scale them up enormously when it's time for a production run. More or less, you pay only for the computing power you use, and there are ways to schedule the use of processor cycles for times of low demand, when computing is cheaper. With the exception of storage and data transfer—meaning machine images that can be turned into working compute resources, as well as input and output data—systems configured in the cloud can cost practically nothing when they are not doing useful work. Such efficient use of expensive resources isn't possible with on-premises or traditionally hosted solutions.
There are three main players in the cloud-computing industry.
Amazon Web Services (AWS)
Google Cloud Platform (GCP)
Microsoft Azure
Each of them has its points of strength and weakness, and those relative merits are beyond the scope of this publication. Most organizations mix and match parts of each anyway to take advantage of relative technical superiorities and to maintain leverage with vendors. Suffice it to say that we chose to do our genomics work on AWS.
Considering that genome analysis is essentially a workflow to be carried out on storage and computing resources, AWS is well suited to the job. Here are some of the tools we will use:
Elastic Compute Cloud (EC2) for building and running the Linux servers that actually run the software required for genome analysis
Elastic Block Storage (EBS) for maintaining updated disk images, ready to attach to a working machine when needed
Simple Storage Service (S3) for storing input and output files when ready access to them is needed
Simple Workflow Service (SWF) for automating processes
Glacier when files need to be archived at low cost
Identity and Access Management (IAM) for maintaining security and appropriate user privileges
This book is intended to educate its readers in two areas: the science of genomics and the technology of Amazon Web Services. The idea is that you use the latter as a tool to explore the former.
Since it's unlikely that many readers are familiar with both genomics and AWS, this book is meant to teach you either subject—or both if you are familiar with neither—and how they work together.
Here is a quick introduction to each chapter in this book. You can skip directly to the parts that interest you most, or you can read from beginning to end to get a complete picture.
Chapter 1
: Why Do Genome Analysis Yourself When Commercial Offerings Exist?
This chapter explains what turnkey commercial services (such as 23&Me) exist and what they are good for. It then explains what they do not do and why you might want to do your own genomics work.
Chapter 2
: A Crash Course in Molecular Biology
This chapter brings you up to speed on the state of biological science as it pertains to genomics. Use this to refresh your knowledge from school or to gain a new understanding.
Chapter 3
: Obtaining Your Genome
This chapter explains how to get a sample of your blood suitable for sequencing and how to send it off to a lab for conversion into raw genome data.
Chapter 4
: The Bioinformatics Workflow
This chapter approaches the sequencing process from a biological point of view, explaining how one step of data processing feeds into the next to ultimately produce the results you want.
Chapter 5
: AWS Services for Genome Analysis
This chapter represents a deep dive into the AWS services you can use for genomics work. If you know biology but aren't familiar with AWS, you might want to begin here.
Chapter 6
: Building Your Environment in the AWS Cloud
This chapter goes into more detail about how to set up AWS services for genomics work.
Chapter 7
: Linux and AWS Command-Line Basics for Genomics
All major bioinformatics tools run under Linux, so you'll need to understand that operating environment and how to get things working together within it.
Chapter 8
: Processing the Sequencing Data
This chapter explains how to get from raw sequencing data to useful information about a genome.
Chapter 9
: Visualizing the Genome
This chapter introduces you to tools you can use to depict genomic information graphically, enabling you to understand it better.
Chapter 10
: Containerizing Your Workflow on the Desktop
This chapter explains how to use Docker containers‚ both locally and in AWS, to process data efficiently and scalably.
Chapter 11
: Variants and Applications
This chapter explores certain aspects of genome analysis, allowing you to dig deeper into the information you have derived from your sample.
Chapter 12
: Cancer Genomics
This chapter discusses the analysis of somatic mutations, specifically those found in cancerous tumors, although the workflow could also be applied to any tissue in the body.
You can approach this book by reading straight through, from beginning to end. That is probably the best way to approach it if you have knowledge of neither genomics nor AWS. Alternatively, if you feel you have a good handle on one or the other, you can focus on those chapters that cover your weak area and then study the sections on how AWS can be used to create and automate a genomics workflow.
This book is not about us, but it's probably fair to explain to you, our reader, who we are and where we are coming from as we write this book.
We are a married couple and the parents of two children. Catherine is a working scientist and bioinformatician, attached to a medical research group in Sydney. David is an information technology and communications consultant who designs and builds AWS solutions, among other things.
A key fact is that one of our two children isn't alive anymore. Our daughter, Floriane, died of a sudden cardiac arrest in April 2015. She was nine years old at the time.
Floriane's death and its consequences are not the subject of this book. However, the way she lived and died inspired us to take our knowledge of genomics and cloud computing and look inward at our family. We wanted to know what happened to Floriane and what bearing that had on the rest of us. One way we honor her memory is by attempting to understand what happened to her. As well, we want to keep the rest of us—particularly her younger brother, Roland—safe.
Floriane was a fit and apparently healthy girl of nine. She had no ongoing medical concerns and was by all appearances developing normally. She was active and happy.
One evening, after dinner, she approached David on the sofa and asked him to chase her around the house—one of her favorite activities that they had indulged in many times before. David chased Floriane around, theatrically waving his arms. Floriane giggled excitedly and ran a few meters. David caught her and flipped her upside-down, which excited her further because she knew it led to tickling. At that point, she went completely limp.
David was confused. He thought he'd accidentally hit Floriane's head on something. Catherine, nearby, thought it was a seizure, even though Floriane had never had one previously. It quickly became clear that Floriane was in cardiac arrest.
Not having a defibrillator on hand, we called for an ambulance and did our best in the seconds before she died with manual cardiopulmonary resuscitation (CPR) that proved ineffective. Paramedics showed up with a defibrillator and adrenaline, and used both, but it was all over by then. Floriane died at home, a few minutes after the onset of her cardiac arrest.
(Again, it's not the subject of this book, but please take this opportunity to contemplate the fleetingness of life and to appreciate the people you love. We'll wait.)
And so we were left with questions: What just happened? Why did it happen? Is it going to happen to any of the rest of us? There were many more questions, far more existential in nature, and those questions remain, but those questions are subjects for other venues. Looking at the situation in a limited way, our daughter had just died, completely unexpectedly, and we wanted to know why.
Speculation on the cause of Floriane's death began almost immediately. She hadn't been visibly sick at all. She hadn't exhibited the symptoms—fever and so on—of an infection, such as meningitis. She was far too mature—nine years old, tall, and apparently fit—to have succumbed to the strange and almost random things that claim the lives of babies and are grouped under the catchall descriptor Sudden Infant Death Syndrome (SIDS). She didn't have any known allergies. She wasn't taking any medicines. She hadn't suffered an injury, hadn't eaten anything unusual, hadn't traveled anywhere dangerous—she hadn't done anything outside the realm of what is normal for a fifth-grade girl living in Australian suburbia. And yet she was no longer alive.
The emergency-room doctor offered a broad guess that turned out to be right: cardiomyopathy, or a disease of the heart muscle. Further investigation by the medical examiner showed some characteristics of hypertrophic cardiomyopathy (HCM), which is one of several kinds of cardiomyopathy, but also revealed some characteristics of arrhythmogenic right ventricular cardiomyopathy (ARVC), another disease of the heart. Genetic testing showed two interesting genes: MYH7 (associated with HCM) and RYR2 (associated with ARVC).
The medical examiner officially attributed Floriane's death to HCM but noted that her case had “unusual features”—a reference to the traces of ARVC that had been found. The stated cause of death could just as well have been ARVC, we were told.
We began to research and to think. We discovered that RYR2 is associated with disturbances to the heart rhythm due to adrenaline. Floriane had gone into cardiac arrest when she was wound up with excitement, playing with David. She had received an injection of adrenaline from the paramedics. Could the undeniable structural defects caused by HCM, which usually do prove fatal, but not until age 30 or so, have been amplified by some form of ARVC?
The question gained urgency when we learned that while Floriane's MYH7 was a de novo mutation, she hadn't inherited it from us, and her brother didn't have it.
So began our research into our family's genomes.
Our goal in Genomics in the AWS Cloud is to make the study of your genome, or whatever genome you choose (some are available on the Internet for practice; see Chapter 4), as simple and straightforward as possible. Chapter 2 describes things you should consider before starting. As genome analysis requires computer resources that exceed the average desktop, we set up the analysis environment on the AWS Cloud—that's the subject of Chapters 5 through 7. This book does not presuppose any particular knowledge in molecular biology (which we introduce in Chapter 3) or computers (we talk about Linux in Chapter 8). We paid particular care in explaining how to make sense of our genetic variants in Chapters 11 and 12.
Now, let's start the journey.
Sybex strives to keep you supplied with the latest tools and information you need for your work. If you believe you have found a mistake in this book, please bring it to our attention. At John Wiley & Sons, we understand how important it is to provide our customers with accurate content, but even with our best efforts an error may occur.
In order to submit your possible errata, please email it to our Customer Service Team at [email protected] with the subject line “Possible Book Errata Submission.”
You can email David Wall with your comments or questions at [email protected].
As you begin to explore sequencing a genome and analyzing the results, you will no doubt become aware of a number of commercial operations that offer to do the job for you, neat and tidy, in exchange for a modest amount of money. Why would you want to go through the time and hassle involved in doing the work yourself when such convenient offerings are so easy to use? Why not engage a service to do your genetic analysis and enjoy the benefits of something that “just works”?
The answer has, essentially, two parts.
The first is that the commercial services may not provide all the information you want, not least because they are hamstrung by regulations that govern the provision of medical advice. They tend to provide “novelty” information—Larry Lightbulb kinds of things about hair color and finger length, as well as about racial, national, and tribal heritage. They are great for educating children about the sorts of information that DNA can carry and for talking about heritability of characteristics good, bad, and neutral. When you do the sequencing and analysis yourself, you can extract whatever information you want.
The second is that you are the kind of person who likes to do things yourself, either just for the satisfaction of it or because you want to understand how everything works and fits together. Working on your personal genome in Amazon Web Services (AWS) is an excellent way to learn about those services, and that knowledge can then be put to use for fun and profit.
As you survey the Web, you will find that there are several popular consumer-grade genome sequencing services, including 23andMe, Ancestry, and MyHeritage.
Others include the following:
African Ancestry
: This service is marketed to Africans and people of African heritage. Some users have regretted that the service does not show DNA broken down by origin in the various regions of the African continent.
Athletigen
: This service focuses on markers related to physical fitness, such as those affecting endurance and speed of recovery after exertion.
DNAFit
: This service provides diet plans designed around certain nutrition-related markers.
Fitness Genes
: This service offers training regimes that fit genomic markers related to physical exertion.
GEDmatch
: This service focuses on genealogy and links with others who have submitted their sequences.
Genome Link
: This service provides information on a series of characteristics, such as physical endurance and skin color.
Genopalate
: Focused on nutrition, this service aims to help its customers optimize their diets.
Living DNA
: This is an ancestry-focused service with a user community primarily from the United Kingdom and Ireland.
MyHeritageDNA
: Ancestry focused, this service connects its users with possible relatives and suggests possible genetic risks to health.
Nebula Genomics
: Offering a monthly subscription that entitles its users to monthly updates as new information becomes available, this service includes data on the oral microbiome (i.e., the bacteria found in your saliva).
Promethease
: This is a modestly priced service that detects a number of single nucleotide polymorphisms (SNPs, or “snips”).
Sano Genetics
: This free service concentrates on SNPs related to autism and mathematical reasoning.
SelfDecode
: This is a general-purpose detector of several thousand genetic markers.
Vitagene
: Focused on fitness and athletic performance, this service includes ancestry information as well.
Xcode Life
: This service offers several low-cost specialty tests including one on skin care and another on metabolic diseases.
Of the aforementioned services, perhaps the best-known of the direct-to-consumer genetic testing services is 23andMe, a California company that pioneered the industry in 2007. When someone places an order with 23andMe, the company sends out a kit containing materials needed for the collection of saliva, which is then sent back for analysis. (The idea is that everyone's saliva contains cells that have been shed from the interior of the mouth.) The company presents its report to the customer via its website.
The company ran afoul of the U.S. Food and Drug Administration (FDA) in 2013, when the regulator objected to 23andMe (and other genetics services providers) advertising that its service provided its customers with information on their susceptibility to various genetically linked conditions, such as male-pattern baldness and certain kinds of cancer. This, the FDA said, constituted medical advice of the sort that should be formulated and delivered by a qualified doctor. The 23andMe tests were medical devices and should be regulated as such.
After going quiet for several years, 23andMe applied to the FDA for permission to include information in its reports about a number of mutations and alleles that are well-understood to be associated with pathogenic conditions, including Alzheimer's disease, Parkinson's disease, celiac disease, and a number of BRCA1 and BRCA2 mutations associated with breast cancer. The company argued—and the FDA ultimately agreed—that the test methods used by 23andMe were sufficiently reliable and understood as to not require the involvement of a medical professional. As well, the entities agreed that the relationships between the tested sequences and the various diseases were adequately proven, and that if an individual was found to have a sequence known to be pathogenic, there was no need to hide the truth behind the medical establishment.
With the shift toward presentation of information about ancestry rather than medical conditions, direct-to-consumer online genetic analysis services have, perhaps predictably, begun to appeal to those whose ancestry is more than passing interest.
So, what's in a set of 23andMe reports? If you undergo the saliva test and log into the 23andMe website today, you will get a lot of novelty information about probable hair color, the shapes of certain body parts, and the aspects of aging.
Sometimes, 23andMe gets things right. For example, 23andMe predicted the following for me:
A 67 percent probability of little to no back hair. (Correct!)
A 32 percent probability of a bald spot on the top rear of the head. (Correct, pretty much. It's kind of thinning there, but certainly not bald. Certainly not.)
A 74 percent chance that the earlobes are separate from the sides of the face. (Hear, hear!)
A 71 percent chance that the ring fingers are longer than the index fingers. (Yep.)
A 1 percent chance of red hair. (It's brown.)
A 62 percent chance that dandruff is sometimes a problem. (It is.)
A 25 percent chance of being afraid of heights. (I am a qualified pilot.)
Sometimes, it gets things wrong. In my case, it forecast the following:
A 51 percent chance of blue eye color. (They're green.)
A 66 percent chance of wavy hair. (It's straight.)
A 33 percent chance of a widow's peak across the front of the scalp. (There is definitely a prominent one.)
A 4 percent chance of bunions in the feet. (A substantial one on the right foot has been causing trouble since high school.)
From there, the 23andMe report can get a little comical. For example, my report states that I am likely to wake up at precisely 7:34 a.m. Apparently, the company uses statistical analysis of surveys conducted on people with certain similar genetic markers (those associated with early or late rising) to arrive at the precise time. The suggestion is that genetics determine your wake-up time to the minute, which is just not correct. This is a manufactured novelty “fact,” and it's not correct: I almost always gets out of bed before 6 a.m. Hilariously, my wife Catherine's 23andMe report has her waking up significantly earlier than I do, which has happened exactly zero times.
The company provides an ancestry report, which attempts to describe which part of the world your forebears came from.
Some background here: my wife and I live in Australia, to which we both immigrated (Catherine from France and me from the United States) around the turn of the 21st century. We are both people of the New World, largely characterized by its relatively recent immigrants, now.
In my case, the ancestry timeline (shown in Figure 1.1) appears to agree with much of what I know about my family history. It shows ancestors from Scandinavia as recently as 1940. Family lore states that my maternal grandmother was born in Sweden and was brought to Chicago as a baby in the 1920s, so that makes sense. Similarly, my family records have my maternal great-great-grandfather emigrating from Hesse, in the western part of what had recently become a unified German Empire, in 1879. That fits with 23andMe's report of French and German ancestry in the late 19th century.
Figure 1.1: My ancestry timeline
My paternal side is less well documented, probably because those ancestors have been in the United States for several generations. The surname Wall fits with heritage from the British Isles, though, and there was always vague talk about my paternal grandmother having ancestors from the eternally contested regions of Central and Eastern Europe.
The report includes Cypriot heritage in the 18th century. I have no idea what that is about, having never heard anything about ancestors on an island of the Ottoman Empire in the Eastern Mediterranean. The 1700s are the distant past for us, as far removed in memory as the original ancestors that first distinguished themselves from the other apes.
Conclusion: no one really knows, but it kind of fits.
As for Catherine, she understands her heritage to be French for many generations back, as far as anyone has been able to trace genealogy. I get scolded when I suggest that the family tree may be particularly branchless. The report from 23andMe (shown in Figure 1.2) concurs, sort of. It shows French heritage back into the 19th century, as well as British and Irish ancestors during the same time period. This latter characteristic may be explained by Catherine's maternal antecedents being from Brittany, a maritime region in the extreme west of France. The language and culture there are Celtic and distinct from those of mainstream France—a distinction that was even more pronounced in the past. The Bretons have a lot in common with other Celtic people, including the Irish, Welsh, Manx, and Cornish. A number of her ancestors were sailors, which probably contributed to the genetic intermixing.
It's all interesting in a bourgeois sort of way, in which people living comfortably enough to spend a hundred bucks on a saliva test can look back at their ancestors, whose lives were foreign (in time, if not always location) and therefore conventionally romantic and interesting. In the case of people who find that their ancestors migrated from place to place, it becomes possible to conclude that the lives of the ancestors were unstable or otherwise crappy enough to motivate them to relocate, which allows the patrons of 23andMe to burnish their claims to modest origins. It's all good fun and harmless enough, facilitating European stuff on North American Christmas trees and a bottomless market for generic Irish music among Boomers in the former colonies.
Figure 1.2: My wife's ancestry timeline
There's a dark side to genetic ancestry services like that of 23andMe, though. Some people assign extreme importance to their personal ancestry and believe that certain types of genetic heritage are inherently better than others. A scan of white supremacist websites, for example, will reveal screen shots of ancestry reports from 23andMe and similar services, posted by people attempting to fit in with the group. It's simultaneously risible and sad—or would be so, if it weren't for the undercurrent of violence—to see genetic profiling used in this way. Genetic testing is like any other power tool, though, capable of being used for ill as well as for good.
Perhaps the best part of 23andMe is the fact that it makes available what it calls raw data, which is a text file of 7–10 MB containing information about base pair sequences (see Figure 1.3).
You can take the 23andMe raw data file and use it as input to many of the other services listed earlier in this chapter.
Figure 1.3: 23andMe raw data
In this chapter, we considered the fact that there are a number of commercial genome-analysis services available for use. Most of them ask you to send in some genetic material—usually in the form of cells suspended in saliva—and then test for various markers. The services then present their customers with a report, which usually focuses on ancestry and certain physical characteristics (such as probably hair color and ability to perceive certain flavors) rather than serious medical information. This is due to regulations that govern who may offer medical advice, and how. This is one of the main reasons to pursue your own genome sequencing work on the AWS cloud—so that you have the data and tools you need to answer all the medical questions that interest you.
To properly analyze genomes, you need to understand some key biology concepts. Biology is a vast field, and this chapter focuses on areas that are fundamental for extracting meaningful information from genomes.
As you are probably aware, your genome encodes all the information necessary to build and maintain your body. It is composed of a series of about three billion letters, which can be either A, G, C, or T, and is stored inside the nucleus of each of your three trillion nucleated cells. Red blood cells do not have a nucleus and therefore are your only cells without a copy of your genome.
Note that all living organisms—such as animals, plants, fungi, or bacteria—also have a genome, composed of the same letters A, G, C and T. This book focuses on the analysis of the human genome, but all concepts and techniques are applicable to other species. You can apply them to the analysis of the genome of your cat, goldfish, potted plant, or slime mold growing in your front lawn. In genomics, it is sobering to see that humans are just a life species among many others.
Physically speaking, your genome is encoded into chains of repeating subunits called nucleotides. Each nucleotide consists of a sugar group, a phosphate group, and a base that can be either adenine (A), guanine (G), cytosine (C), or thymine (T). The sugar group is called deoxyribose; hence, these long molecules are named deoxyribonucleic acid (DNA). A chain of nucleotides is referred to as a DNA strand. To make this chemical structure more stable against mutations, your genome is actually stored as two complementary DNA strands. The second strand does not provide any more information and is simply a copy of the first strand, where A has been replaced by T, G by C, T by A, and C by G. The two DNA strands bind via weak chemical bonds called hydrogen bonds (the bases A pair with T, the bases G pair with C), coil around each other, and form the elegant and well-known DNA helix structure.
DNA is a convenient and compact way of storing information. It has already been used to store small, compressed MPEG movies and data of up to 200 MB.
In living organisms, the long DNA molecule is broken into segments called chromosomes. Species have various numbers of chromosomes: the fruit fly has 8 chromosomes, spinach and slime mold have 12, humans have 46, dogs and chickens have 78, and carp have 100. Bacteria stand apart as they have a single chromosome of circular shape, which is free floating as bacteria do not have a nucleus. The 46 human chromosomes are divided into two sets of 23 chromosomes, with each set inherited from each parent. Our chromosomes are numbered from 1 to 22, with the last two chromosomes, X and Y, determining our sex.
Not to be forgotten, each of our cells also contains between 80 and 2,000 small oval structures called mitochondria that produce the energy required for the cell. Mitochondria are thought to be bacteria that have been internalized by the cells of our remote primitive ancestors some 1.45 billion years ago. Each of our mitochondria has its own DNA, which has a circular shape (like in bacteria), unlike our linear chromosomes. We inherit our mitochondrial DNA exclusively from our mother. This allows genealogical researchers to study our maternal lineage many generations ago.
The total length of our genome, if we put our chromosomes end to end, would be nearly 2 meters. To fit into the tiny nucleus of each of our cells, our DNA is first wound around proteins called histones and then tightly folded and coiled. This packing complex of DNA and histone is called chromatin. When the chromatin is tightly packed (also called condensed), the DNA cannot be accessed. When a region of DNA needs to be read, the chromatin is unpacked and is said to be open.
Now, let's look at some chemical nuts and bolts to clarify an important concept that is sometimes confusing when interpreting genomics databases. DNA strands have an orientation: at one end of the strand, the base is connected to the fifth carbon atom of the sugar, whereas at the other end, the base is attached to the third carbon. Each end of a DNA strand is therefore nicknamed 5' or 3'. By convention, DNA sequences are represented from 5' to 3'. Because the second complementary strand does not represent any additional information, usually only the first strand (called the coding strand or Plus strand or + strand) is represented, whereas the complementary strand (Minus strand or – strand) is omitted (see Figure 2.1). Note that the Minus strand, read from 5' to 3' as per the convention, is the reverse complement of the Plus strand. Each paired group of letters of the double-stranded DNA molecule is referred to as a base pair (bp). So, the sequence shown in Figure 2.1 consists of 25 bp. Confusingly, though genomics databases describe DNA mutations or variants on the Plus strand, they are sometimes described on the Minus strand. It is necessary in that case to transform the nucleotide information into its complement (A to T, G to C, and vice versa).
Figure 2.1: DNA molecule representation
Crucially, as the principle of life revolves around replicating itself, DNA can easily be replicated via convenient cell enzymes called DNA polymerases. This enables cells to receive an identical copy of the genome after each cell division. In practice, the chromatin of the chromosomes is decondensed, the two strands of their double helix are unwound, and the DNA polymerase synthesizes a new complementary DNA strand for each of the two initial strands. At the end of the process, the DNA is fully duplicated into identical double helices. The chromosomes then align in the center of the nucleus. Fibers attach to the center (called centrosome) of the chromosomes and pull one copy of each chromosome to opposite poles of the cell. This process by which two new nuclei are formed, in view of creating two identical daughter cells, is caused mitosis. Finally, the cell itself divides into two cells with each an identical genome.
How long does it take for a cell to replicate? Bacteria have very small genomes that can be replicated quickly, as well as short growth phases; in consequence, bacteria can divide in up to 15 minutes. In contrast, fast-growing human cells take about 24 hours to divide. The cell starts with a growth phase (called G1) of about 11 hours, where the cell secretes enzymes and nutrients that will be required for the cell division. The cell then undergoes a phase of DNA synthesis (called S phase) to duplicate the genome, which takes about 8 hours. It is followed by a second growth phase (called G2) of about 4 hours. Finally, the nucleus division (mitosis) and cell division can occur. This last phase, called M phase, is the most visual and spectacular and takes only one hour.
The chromosomes are not usually visible under the microscope, as they are all bundled together in the nucleus. During mitosis, the chromosomes are condensed and neatly align in the center of the cell. It is during this phase that the chromosomes can be seen and photographed under a light microscope when stained with a dye. This technique is called karyotyping. Figure 2.2