Junk DNA - Nessa Carey - E-Book

Junk DNA E-Book

Nessa Carey

0,0

Beschreibung

From the author of the acclaimed The Epigenetics Revolution ('A book that would have had Darwin swooning' – Guardian) comes another thrilling exploration of the cutting edge of human science. For decades after the structure of DNA was identified, scientists focused purely on genes, the regions of the genome that contain codes for the production of proteins. Other regions – 98% of the human genome – were dismissed as 'junk'. But in recent years researchers have discovered that variations in this 'junk' DNA underlie many previously intractable diseases, and they can now generate new approaches to tackling them. Nessa Carey explores, for the first time for a general audience, the incredible story behind a controversy that has generated unusually vituperative public exchanges between scientists. She shows how junk DNA plays an important role in areas as diverse as genetic diseases, viral infections, sex determination in mammals, human biological complexity, disease treatments, even evolution itself – and reveals how we are only now truly unlocking its secrets, more than half a century after Crick and Watson won their Nobel prize for the discovery of the structure of DNA in 1962.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern
Kindle™-E-Readern
(für ausgewählte Pakete)

Seitenzahl: 501

Veröffentlichungsjahr: 2015

Das E-Book (TTS) können Sie hören im Abo „Legimi Premium” in Legimi-Apps auf:

Android
iOS
Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



JUNK

DNA

Also by Nessa Carey The Epigenetics Revolution

JUNK

DNA

A Journey Through the Dark Matter of the Genome

NESSA CAREY

Published in the UK in 2015 by Icon Books Ltd, Omnibus Business Centre, 39–41 North Road, London N7 9DP email: [email protected]

Sold in the UK, Europe and Asia by Faber & Faber Ltd, Bloomsbury House, 74–77 Great Russell Street, London WC1B 3DA or their agents

Distributed in the UK, Europe and Asia by TBS Ltd, TBS Distribution Centre, Colchester Road, Frating Green, Colchester CO7 7DW

Distributed in Australia and New Zealand by Allen & Unwin Pty Ltd, PO Box 8500, 83 Alexander Street, Crows Nest, NSW 2065

Distributed in South Africa by Jonathan Ball, Office B4, The District, 41 Sir Lowry Road, Woodstock 7925

Distributed in India by Penguin Books India, 7th Floor, Infinity Tower – C, DLF Cyber City, Gurgaon 122002, Haryana

ISBN: 978-184831-826-7

Text copyright © 2015 Nessa Carey

The author has asserted her moral rights.

No part of this book may be reproduced in any form, or by any means, without prior permission in writing from the publisher.

Typeset in Janson Text by Marie Doherty

Printed and bound in the UK by Clays Ltd, St Ives plc

For Abi Reynolds, who is always by my side And for Sheldon – good to see you again

Contents

Acknowledgements

Notes on Nomenclature

An Introduction to Genomic Dark Matter

1. Why Dark Matter Matters

2. When Dark Matter Turns Very Dark Indeed

3. Where Did All the Genes Go?

4. Outstaying an Invitation

5. Everything Shrinks When We Get Old

6. Two is the Perfect Number

7. Painting with Junk

8. Playing the Long Game

9. Adding Colour to the Dark Matter

10. Why Parents Love Junk

11. Junk with a Mission

12. Switching It On, Turning It Up

13. No Man’s Land

14. Project ENCODE – Big Science Comes to Junk DNA

15. Headless Queens, Strange Cats and Portly Mice

16. Lost in Untranslation

17. Why LEGO is Better Than Airfix

18. Mini Can Be Mighty

19. The Drugs Do Work (Sometimes)

20. Some Light in the Darkness

Notes

Appendix: Human Diseases in which Junk DNA Has Been Implicated

Index

Acknowledgements

I am lucky that for my second book I continue to have the support of a great agent, Andrew Lownie, and of lovely publishers. At Icon Books I’d particularly like to thank Duncan Heath, Andrew Furlow and Robert Sharman, but not forgetting their former colleagues Simon Flynn and Henry Lord. At Columbia University Press I’m very grateful to Patrick Fitzgerald, Bridget Flannery-McCoy and Derek Warker.

As always, entertainment and enlightenment have been obtained from some unusual quarters. Conor Carey, Finn Carey and Gabriel Carey all played a role in this, and outside the genetic clan I’d also like to thank Iona Thomas-Wright. Endless support and lots of biscuits have been provided by my ever-patient, delightful mother-in-law, Lisa Doran.

I’ve had a blast delivering lots of science talks to non-specialist audiences since my first book was published. The various organisations that have invited me to speak are too many to namecheck but they know who they are and I’ve enjoyed the privilege immensely. It’s been very inspiring. Thank you all.

And finally Abi. Who is mercifully forgiving of the fact that, despite my promises, I still haven’t had that ballroom dancing lesson yet.

Notes on Nomenclature

There’s a bit of a linguistic difficulty in writing a book on junk DNA, because it is a constantly shifting term. This is partly because new data change our perception all the time. Consequently, as soon as a piece of junk DNA is shown to have a function, some scientists will say (logically enough) that it’s not junk. But that approach runs the risk of losing perspective on how radically our understanding of the genome has changed in recent years.

Rather than spend time trying to knit a sweater with this ball of fog, I have adopted the most hard-line approach. Anything that doesn’t code for protein will be described as junk, as it originally was in the old days (second half of the twentieth century). Purists will scream, and that’s OK. Ask three different scientists what they mean by the term ‘junk’, and we would probably get four different answers. So there’s merit in starting with something straightforward.

I also start by using the term ‘gene’ to refer to a stretch of DNA that codes for a protein. This definition will evolve through the course of the book.

After my first book The Epigenetics Revolution was published, I realised the readership was quite binary with respect to gene names. Some people love knowing which gene is being discussed, but for other readers it disrupts the flow horribly. So this time I have only used specific gene names in the text where absolutely necessary. But if you want to know them, they are in the footnotes, and the citations for the original references are at the back of the book.

An Introduction to Genomic Dark Matter

Imagine a written script for a play, or film, or television programme. It is perfectly possible for someone to read a script just as they would a book. But the script becomes so much more powerful when it is used to produce something. It becomes more than just a string of words on a page when it is spoken aloud, or better yet, acted.

DNA is rather similar. It is the most extraordinary script. Using a tiny alphabet of just four letters it carries the code for organisms from bacteria to elephants, and from brewer’s yeast to blue whales. But DNA in a test tube is pretty boring. It does nothing. DNA becomes far more exciting when a cell or an organism uses it to stage a production. The DNA is used as the code for creating proteins and these proteins are vital for breathing, feeding, getting rid of waste, reproducing and all the other activities that characterise living organisms.

Proteins are so important that in the twentieth century scientists used them to define what they meant by a gene. A gene was described as a sequence of DNA that codes for a protein.

Let’s think about the most famous scriptwriter in history, William Shakespeare. It can take a while for us to tune in to Shakespeare’s writings because of the way the English language has changed in the centuries since his death. But even so, we are always confident that the bard only wrote the words he needed his actors to speak.

Shakespeare did not, for example, write the following:

vjeqriugfrhbvruewhqoerahcxnqowhvgbutyunyhewqicxhjafvurytnpemxoqp[etjhnuvrwwwebcxewmoipzowqmroseuiednrcvtycuxmqpzjmoimxdcnibyrwvytebanyhcuxqimokzqoxkmdcifwrvjhentbubygdecftywerftxunihzxqwemiuqwjiqpodqeotherpowhdymrxnamehnfeicvbrgytrchguthhhhhhhgcwouldupaizmjdpqsmellmjzufernnvgbyunasechuxhrtgcnionytuiongdjsioniodefnionihyhoniosdreniokikiniourvjcxoiqweopapqsweetwxmocviknoitrbiobeierrrrrrruorytnihgfiwoswakxdcjdrfuhrqplwjkdhvmogmrfbvhncdjiwemxsklowe

Instead, he just wrote the words which are underlined:

vjeqriugfrhbvruewhqoerahcxnqowhvgbutyunyhewqicxhjafvurytnpemxoqp[etjhnuvrwwwebcxewmoipzowqmroseuiednrcvtycuxmqpzjmoimxdcnibyrwvytebanyhcuxqimokzqoxkmdcifwrvjhentbubygdecftywerftxunihzxqwemiuqwjiqpodqeotherpowhdymrxnamehnfeicvbrgytrchguthhhhhhhgcwouldupaizmjdpqsmellmjzufernnvgbyunasechuxhrtgcnionytuiongdjsioniodefnionihyhoniosdreniokikiniourvjcxoiqweopapqsweetwxmocviknoitrbiobeierrrrrrruorytnihgfiwoswakxdcjdrfuhrqplwjkdhvmogmrfbvhncdjiwemxsklowe

That is, ‘A rose by any other name would smell as sweet’.

But if we look at our DNA script it is not sensible and compact, like Shakespeare’s line. Instead, each protein-coding region is like a single word adrift in a sea of gibberish.

For years, scientists had no explanation for why so much of our DNA doesn’t code for proteins. These non-coding parts were dismissed with the term ‘junk DNA’. But gradually this position has begun to look less tenable, for a whole host of reasons.

Perhaps the most fundamental reason for the shift in emphasis is the sheer volume of junk DNA that our cells contain. One of the biggest shocks when the human genome sequence was completed in 2001 was the discovery that over 98 per cent of the DNA in a human cell is junk. It doesn’t code for any proteins. The Shakespeare analogy used above is in fact a simplification. In genome terms, the ratio of gibberish to text is about four times as high as shown. There are over 50 letters of junk for every one letter of sense.

There are other ways of envisaging this. Let’s imagine we visit a car factory, perhaps for something high-end like a Ferrari. We would be pretty surprised if for every two people who were building a shiny red sports car, there were another 98 who were sitting around doing nothing. This would be ridiculous, so why would it be reasonable in our genomes? While it’s a very fair point that it’s the imperfections in organisms that are often the strongest evidence for descent from common ancestors – we humans really don’t need an appendix – this seems like taking imperfection rather too far.

A much more likely scenario in our car factory would be that for every two people assembling a car, there are 98 others doing all the things that keep a business moving. Raising finance, keeping accounts, publicising the product, processing the pensions, cleaning the toilets, selling the cars etc. This is probably a much better model for the role of junk in our genome. We can think of proteins as the final end points required for life, but they will never be properly produced and coordinated without the junk. Two people can build a car, but they can’t maintain a company selling it, and certainly can’t turn it into a powerful and financially successful brand. Similarly, there’s no point having 98 people mopping the floors and staffing the showrooms if there’s nothing to sell. The whole organisation only works when all the components are in place. And so it is with our genomes.

The other shock from the sequencing of the human genome was the realisation that the extraordinary complexities of human anatomy, physiology, intelligence and behaviour cannot be explained by referring to the classical model of genes. In terms of numbers of genes that code for proteins, humans contain pretty much the same quantity (around 20,000) as simple microscopic worms. Even more remarkably, most of the genes in the worms have directly equivalent genes in humans.

As researchers deepened their analyses of what differentiates humans from other organisms at the DNA level, it became apparent that genes could not provide the explanation. In fact, only one genetic factor generally scaled with complexity. The only genomic features that increased in number as animals became more complicated were the regions of junk DNA. The more sophisticated an organism, the higher the percentage of junk DNA it contains. Only now are scientists really exploring the controversial idea that junk DNA may hold the key to evolutionary complexity.

In some ways, the question raised by these data is pretty obvious. If junk DNA is so important, what is it actually doing? What is its role in a cell, if it isn’t coding for proteins? It’s becoming apparent that junk DNA actually has a multiplicity of different functions, perhaps unsurprisingly given how much of it there is.

Some of it forms specific structures in the chromosomes, the enormous molecules into which our DNA is packaged. This junk prevents our DNA from unravelling and becoming damaged. As we age, these regions decrease in size, finally declining below a critical minimum. After that, our genetic material becomes susceptible to potentially catastrophic rearrangements that can lead to cell death or cancers. Other structural regions of junk DNA act as anchor points when chromosomes are shared equally between different daughter cells during cell division. (The term ‘daughter cell’ means any cell created by division of a parental cell. It doesn’t imply that the cell is female.) Yet others act as insulation regions, restricting gene expression to specific regions of chromosomes.

But a great deal of our junk DNA is not simply structural. It doesn’t code for proteins, but it does code for a different type of molecule, called RNA. A large class of this junk DNA forms factories in the cell, helping to produce proteins. Other types of RNA molecules transport the raw material for protein production to the factory sites.

Other regions of junk DNA are genetic interlopers, derived from the genomes of viruses and other microorganisms that have integrated into human chromosomes, like genetic sleeper agents. These remnants of long-dead organisms carry potential dangers to the cell, the individual and sometimes even to wider populations. Mammalian cells have developed multiple mechanisms to keep these viral elements silent, but these systems can break down. When they do, the effects can range from relatively benign – changing the coat colour of a particular strain of mice – to much more dramatic, such as an increased risk of cancer.

A major role of junk DNA, only recognised in the main in the last few years, is to regulate gene expression. Sometimes this can have a huge and noticeable effect in an individual. One particular piece of junk DNA is absolutely vital for ensuring healthy gene expression patterns in female animals. Its effects are seen in a whole range of situations. A mundane example is the control of the colour patterns of tortoiseshell cats. At its most extreme, the same mechanism also explains why female identical twins may present with different symptoms of a genetically inherited disease. In some cases, this can be so extreme that one twin is severely affected with a life-threatening disorder while the other is completely healthy.

Thousands and thousands of regions of junk DNA are suspected to regulate networks of gene expression. They act like the stage directions for the genetic script, but directions of a complexity we could never envisage in the theatre. Forget about ‘Exit, pursued by a bear’. These would be more along the lines of ‘If performing Hamlet in Vancouver and The Tempest in Perth, then put the stress on the fourth syllable of this line of Macbeth. Unless there’s an amateur production of Richard III in Mombasa and it’s raining in Quito.’

Researchers are only just beginning to unravel the subtleties and interconnections in the vast networks of junk DNA. The field is controversial. At one extreme we have scientists claiming experimental proof is lacking to support sometimes sweeping claims. At the other are those who feel there is a whole generation of scientists (if not more) trapped in an outdated model and unable to see or understand the new world order.

Part of the problem is that the systems we can use to probe the functions of junk DNA are still relatively underdeveloped. This can sometimes make it hard for researchers to use experimental approaches to test their hypotheses. We have only been working on this for a relatively short space of time. But sometimes we need to remember to step back from the lab bench and the machines that go ping. Experiments surround us every day, because nature and evolution have had billions of years to try out all sorts of changes. Even the brief geological moment that represents the emergence and spread of our own species has been sufficient time to create a greater range of experiments than those of us who wear lab coats could ever dream of testing. Consequently, throughout much of this book we will explore the darkness by using the torch of human genetics.

There are many ways to begin shining a light on the dark matter of our genome, so let’s start with an odd but unassailable fact to anchor us. Some genetic diseases are caused by mutations in junk DNA, and there is probably no better starting point for our journey into the hidden genomic universe than this.

1. Why Dark Matter Matters

Sometimes life seems to be cruel in the troubles it piles onto a family. Consider this example. A baby boy was born; let’s call him Daniel. He was strangely floppy at birth, and had trouble breathing unassisted. With intensive medical care Daniel survived and his muscle tone improved, allowing him to breathe unaided and to develop mobility. But as he grew older it became apparent that Daniel had pronounced learning disabilities that would hold him back throughout life.

His mother Sarah loved Daniel and cared for him every day. As she entered her mid-30s this became more difficult because Sarah developed strange symptoms. Her muscles became very stiff, to the extent that she would have trouble releasing items after grasping them. She had to give up her highly skilled part-time job as a ceramics restorer. Her muscles also began to waste away noticeably. Yet she found ways to cope. But when she was only 42 years old Sarah died suddenly from a cardiac arrhythmia, a catastrophic disruption in the electrical signals that keep the heart beating in a coordinated way.

It fell to Sarah’s mother, Janet, to look after Daniel. This was challenging for her, and not just because of her grandson’s difficulties and the grief she was suffering over the early death of her daughter. Janet had developed cataracts in her early 50s and as a consequence her vision wasn’t that great.

It seemed as if the family had suffered a very unfortunate combination of unrelated medical problems. But specialists began to notice something rather unusual. This pattern – cataracts in one individual, muscle stiffness and cardiac defects in their daughter and floppy muscles and learning disabilities in the grandchildren – occurred in multiple families. These individual families lived all over the world and none of them were related to each other.

Scientists realised they were looking at a genetic disease. They named it myotonic dystrophy (myotonic means muscle tone, dystrophy means wasting). The condition occurred in every generation of an affected family. On average there was a one in two chance of a child being affected if their parent had the condition. Males and females were equally at risk and either could pass it on to their children.1

These inheritance characteristics are very typical of diseases caused by mutations in a single gene. A mutation is simply a change from the normal DNA sequence. We typically inherit two copies of every gene in our cells, one from our mother and one from our father. The pattern of inheritance in myotonic dystrophy, where the disease appears in each generation, is referred to as dominant. In dominant disorders, only one of the two copies of a gene carries the mutation. It is the copy inherited from the affected parent. This mutated gene is able to cause the disease even though the cells also contain a normal copy. The mutated gene somehow ‘dominates’ the action of the normal gene.

But myotonic dystrophy also had characteristics that were very different from a typical dominant disorder. For a start, dominant disorders don’t normally get worse as they are passed on from parent to child. There is no reason why they should, because the affected child inherits the same mutation as the affected parent. Patients with myotonic dystrophy also developed symptoms at earlier ages as the disorder was passed on down the generations, which again is unusual.

There was another way in which myotonic dystrophy was different from the normal genetic pattern. The severe congenital form of the disease, the one that affected Daniel, was only ever found in the children of affected mothers. Fathers never passed on this really severe form.

In the early 1990s a number of different research groups identified the genetic change that causes myotonic dystrophy. Fittingly for an unusual disease, it was a very unusual mutation. The myotonic dystrophy gene contains a small sequence of DNA that is repeated multiple times.2 The small sequence is made from three of the four ‘letters’ that make up the genetic alphabet used by DNA. In the myotonic dystrophy gene, this repeated sequence is formed by the letters C, T and G (the other letter in the genetic alphabet is A).

In people without the myotonic dystrophy mutation, there can be anything from five to around 30 copies of this CTG motif, one after the other. Children inherit the same number of repeats as their parents. But when the number of repeats gets larger, greater than 35 or thereabouts, the sequence becomes a bit unstable and may change in number when it is passed on from parent to child. Once it gets above 50 copies of the motif, the sequence becomes really unstable. When this happens, parents can pass on much bigger repeats to their children than they themselves possess. As the repeat length increases, the symptoms become more severe and are obvious at an earlier age. That’s why the disease gets worse as it passes down the generations, such as in the family that opened this chapter. It also became apparent that usually only mothers passed on the really big repeats, the ones that led to the severe congenital phenotype.

This ongoing expansion of a repeated sequence of DNA was a very unusual mutation mechanism. But the identification of the expansion that causes myotonic dystrophy shone a light on something even more unusual.

Knitting with DNA

Until quite recently, mutations in gene sequences were thought to be important not because of the change in the DNA itself but because of their downstream consequences. It’s a little like a mistake in a knitting pattern. The mistake doesn’t matter when it’s just a notation on a piece of paper. The mistake only becomes a problem when you knit something and end up with a hole in your sweater or three sleeves on your cardigan because of the error in the knitting code.

A gene (the knitting pattern) ultimately codes for a protein (the sweater). It’s proteins that we think of as the molecules in our cells that do all the work. They carry out an enormous number of functions. These include the haemoglobin in our red blood cells that carries oxygen around our bodies. Another protein is insulin, which is released from the pancreas to encourage muscle cells to take in glucose. Thousands and thousands of other proteins carry out the dizzying range of functions that underlie life.

Proteins are made from building blocks called amino acids. Mutations generally change the sequence of these amino acids. Depending on the mutation and where it lies in the gene, this can lead to a number of consequences. The abnormal protein may carry out the wrong function in a cell, or may not be able to work at all.

But the myotonic dystrophy mutation doesn’t change the amino acid sequence. The mutated gene still codes for exactly the same protein. It was incredibly difficult to understand how the mutation led to a disease, when there was nothing wrong with the protein.

It would be tempting to write off the myotonic dystrophy mutation as some bizarre outlier with no impact for the majority of biological circumstances. That way we could put it to one side and forget about it. But it’s not alone.

Fragile X syndrome is the commonest form of inherited learning disability. Mothers don’t usually have any symptoms but they pass the condition on to their sons. The mothers carry the mutation but are not affected by it. Like myotonic dystrophy, this disorder is also caused by increases in the length of a three-letter sequence. In this case, the sequence is CCG. And just like myotonic dystrophy, this increase doesn’t change the sequence of the protein encoded by the Fragile X gene.

Friedreich’s ataxia is a form of progressive muscle wasting in which symptoms normally appear in late childhood or early adolescence. In contrast to myotonic dystrophy, the parents are usually unaffected by the disorder. Both the mother and father are carriers. Each parent possesses one normal and one abnormal copy of the relevant gene. But if a child inherits a mutated copy from each parent, the child develops the disease. Friedreich’s ataxia is also caused by an increase in a three-letter sequence, GAA in this case. And once again it doesn’t change the sequence of the protein encoded by the affected gene.3

These three genetic diseases, so different in their family histories, symptoms and inheritance patterns, nevertheless told scientists something quite consistent: there are mutations that can cause disease without changing the amino acid sequence of proteins.

An impossible disease

An even more startling discovery was made a few years later. There is another inherited wasting disorder in which the muscles of the face, shoulders, and upper arms gradually weaken and degenerate. The disease is named after this pattern – it’s called facioscapulohumeral muscular dystrophy. Perhaps unsurprisingly, this is usually shortened to FSHD. Symptoms are usually detectable by the time a patient is in their early 20s. Like myotonic dystrophy, the disease is dominant and passed from affected parent to child.4

Scientists spent years looking for the mutation that causes FSHD. Eventually, they tracked it down to a repeated DNA sequence. But in this case the mutation is very different from the three-letter repeats found in myotonic dystrophy, fragile X syndrome and Friedreich’s ataxia. It is a stretch of over 3,000 letters. We can call this a block. In people who don’t suffer from FSHD, there are from eleven to about 100 blocks, one after another. But patients with FSHD have a small number of blocks, ten at most. That was unexpected. But the real shock for the researchers was that they really struggled to find a gene near the mutation.

Genetic diseases have given us great new insights into biology over the last hundred years or so. It’s easy to underestimate how hard-won some of that knowledge was. The identification of the mutations described here usually represented over a decade of work for significant numbers of people. It was entirely dependent on access to families who were willing to give blood samples and trace their family histories to help scientists home in on the key individuals to analyse.

The reason this kind of analysis was so difficult was because researchers were normally looking for a very small change in a very large landscape, hunting for a single specific acorn in a forest. This all became much easier from 2001 onwards, after the release of the human genome sequence. The genome is the entire sequence of DNA in our cells.

Because of the Human Genome Project, we know where all the genes are positioned relative to one another, and their sequences. This, together with enormous improvements in the technologies used to sequence DNA, has made it much faster and cheaper to find the mutations underlying even very rare genetic diseases.

But the completion of the human genome sequence has had impact far beyond identifying the mutations that cause disease. It’s changing many of our ideas about some of the most fundamental ideas that have held sway in biology since we first understood that DNA was our genetic material.

When considering how our cells work, almost every scientist over the last six decades has been focused on the impacts of proteins. But from the moment the human genome was sequenced, scientists have had to face a rather puzzling dilemma. If proteins are so all-important, why is only 2 per cent of our DNA devoted to coding for amino acids, the building blocks of proteins? What on earth is the other 98 per cent doing?

2. When Dark Matter Turns Very Dark Indeed

The astonishing percentage of the genome that didn’t code for proteins was a shock. But it was the scale of the phenomenon that was surprising, not the phenomenon itself. Scientists had known for many years that there were stretches of DNA that didn’t code for proteins. In fact, this was one of the first big surprises after the structure of DNA itself was revealed. But hardly anyone anticipated how important these regions would prove to be, nor that they would provide the explanation for certain genetic diseases.

At this point it’s worth looking in a little more detail at the building blocks of our genome. DNA is an alphabet, and a very simple one at that. It is formed of just four letters – A, C, G and T. These are also known as bases. But because our cells contain so much DNA, this simple alphabet carries an incredible amount of information. Humans inherit 3 billion of the bases that make up our genetic code from our mother, and a similar set from our father. Imagine DNA as a ladder, with each base representing a rung, and each rung being 25cm from the next. The ladder would stretch 75 million kilometres, roughly from earth to Mars (depending on the relative positions of their orbits on the day the ladder was put in place).

To think of it another way, the complete works of Shakespeare are reported to contain 3,695,990 letters.1 This means we inherit the equivalent of just over 811 books the length of the Bard’s canon from mum and the same number from dad. That’s a lot of information.

If we extend our alphabet analogy a bit further, the DNA alphabet encodes words of just three letters each. Each three-letter word acts as the placeholder for a specific amino acid, the building blocks of proteins. A gene can be thought of as a sentence of three-letter words, which acts as the code for a sequence of amino acids forming a protein. This is summarised in Figure 2.1.

Each cell usually contains two copies of any given gene. One was inherited from the mother and one from the father. But although there are only two copies of each gene in a cell, that same cell can create thousands and thousands of the protein molecules encoded by a specific gene.

This is because there are two amplification mechanisms built into gene expression. The sequence of bases in the DNA doesn’t act as the direct template for the protein. Instead, the cell makes copies of the gene. These copies are very similar to the DNA gene itself, but not identical. They have a slightly different chemical composition and are known as RNA (ribonucleic acid, instead of the deoxyribonucleic acid in DNA). Another difference is that in RNA, the base T is replaced by the base U. DNA is formed of two strands joined together via pairs of bases. We could visualise this as looking a little like a railway track. The two rails are held together by a base on one rail linking to a base on the other, as if the bases were holding hands. They only link up in a set pattern. T holds hands with A, C holds hands with G. Because of this arrangement, we tend to refer to DNA in terms of base pairs. RNA is a single-stranded molecule, just one rail. The key differences between DNA and RNA are shown in Figure 2.2. A cell can make thousands of RNA copies of a DNA gene really quickly, and this is the first amplification step in gene expression.

Figure 2.1 The relationship between a gene and a protein. Each three-letter sequence in the gene codes for one building block in the protein.

The RNA copies of a gene are transported away from the DNA to a different part of the cell, called the cytoplasm. In this distinct region of the cell, the RNA molecules act as the placeholders for the amino acids that form a protein. Each RNA molecule can act as a template multiple times, and this introduces the second amplification step in gene expression. This is shown diagrammatically in Figure 2.3.

Figure 2.2 The upper panel represents DNA, which is double-stranded. The bases – A, C, G and T – hold the two strands together by pairing up. A always pairs with T, and C always pairs with G. The lower panel represents RNA, which is single-stranded. The backbone of the strand has a slightly different composition from DNA, as indicated by the different shading. In RNA, the base T is replaced by the base U.

We can visualise this using the analogy of the knitting pattern from Chapter 1. The DNA gene is the original knitting pattern. This pattern can be photocopied multiple times, akin to producing the RNA. The copies can be sent to lots of people who can each knit the same pattern multiple times, just like creating the protein. It’s a simple but efficient operating model and it works – one original pattern resulted in lots of soldiers with warm feet in the Second World War.

Figure 2.3 A single copy of a DNA gene in the nucleus is used as the template to create multiple copies of a messenger RNA molecule. These multiple RNA molecules are exported out of the nucleus. Each can then act as the instructions for production of a protein. Multiple copies of the same protein can be produced from each messenger RNA molecule. There are therefore two amplification steps in generating protein from a DNA code. For simplicity, only one copy of the gene is shown, although usually there will be two – one inherited from each parent.

The RNA molecule acts as a messenger molecule, carrying a gene sequence from the DNA to the protein assembly factory. Rather logically it is therefore known as messenger RNA.

Taking out the nonsense

So far, things might seem very straightforward but scientists discovered quite some time ago that there is a strange complication. Most genes are split up into bits that code for the amino acids in a protein and intervening bits that don’t. The bits that don’t are like gobbledegook in the middle of a string of sensible words. These intervening bits of nonsense are known as introns.

When the cell makes RNA, it originally copies all of the DNA letters in a gene, including the bits that don’t code for amino acids. But then the cell removes all the bits that don’t code for protein, so that the final messenger RNA is a good instruction set for the final protein. This process is known as splicing, and Figure 2.4 shows diagrammatically how this happens.

As Figure 2.4 shows, a protein is encoded from modular blocks of information. This modularity gives the cell a lot of flexibility in how it processes the RNA. It can vary the modules which it joins together from a messenger RNA molecule, creating a range of final messengers that code for related but non-identical proteins. This is shown in Figure 2.5.

Figure 2.4 In step 1, DNA is copied into RNA. In step 2, the RNA is processed so that only the amino acid-coding regions, denoted by boxes containing letters, are joined together. The intervening junk regions are removed from the mature messenger RNA molecule.

The bits of gobbledegook between the parts of a gene that code for amino acids were originally considered to be nothing but nonsense or rubbish. They were referred to as junk or garbage DNA, and pretty much dismissed as irrelevant. As mentioned earlier, from here on in, we’ll use the term ‘junk’ to denote any DNA that doesn’t code for protein.

Figure 2.5 An RNA molecule can be processed in different ways. As a result, different amino acid-coding regions can be joined together. This allows different versions of a protein molecule to be produced from one original DNA gene.

But we now know that they can have a very big impact. In Friedreich’s ataxia, which we met in Chapter 1, the disorder is caused by an abnormally expanded stretch of GAA repeats in one of the junk regions, between two sections that encode amino acids. This raised the perfectly reasonable question – if the mutation doesn’t affect the amino acid sequence, why do people with this mutation develop such debilitating symptoms?

The mutation in the Friedreich’s ataxia gene occurs in the junk region between the first two amino acid-coding regions. In Figure 2.5, this would be between regions ‘D’ and ‘E’. A normal gene contains from five to 30 GAA repeats but a mutated gene contains from 70 up to 1,000 repeated GAA motifs.2 Researchers showed that when cells contained this expanded repeat, they stopped producing the messenger RNA encoded by the gene. Because they didn’t make messenger RNA, they couldn’t make the protein either. If you don’t send out the copies of the knitting patterns, the soldiers don’t get socks.

In fact, the cells didn’t even make the long, unprocessed RNA copy of the gene.3 The big GAA expansion acts as a ‘sticky’ region, which prevents good copying of the DNA. It’s analogous to trying to photocopy a 50-page document, when pages four to twelve have been glued together. They won’t feed into the copier, and the process grinds to a halt, for that particular document. In the case of the Friedreich’s ataxia gene, no copying means no RNA, which means no protein.

It’s not completely clear why lack of the protein encoded by the Friedreich’s ataxia gene causes the disease symptoms. The protein seems to be involved in preventing iron overload in the parts of the cell that generate energy.4 When a cell fails to produce the protein, the iron rises to toxic levels. Some cell types seem to be more sensitive than others to iron levels, and these include the ones affected in the disease.

A related but different mechanism accounts for Fragile X syndrome, the form of learning disability we encountered in Chapter 1. The mutation in Fragile X syndrome is the expansion of a CCG three-base repeat. Similarly to the Friedreich’s ataxia mutation, there are usually fifteen to 65 copies of the repeat on a normal chromosome. On a chromosome carrying the Fragile X mutation there are from around 200 to several thousand copies.5,6 But the expansion lies in a different part of the gene in Fragile X compared with Friedreich’s ataxia. The mutation is found before the first amino acid-coding region, essentially in the junk to the left of block ‘D’ in Figure 2.5. When the junk repeat gets very large, no messenger RNA is produced, and consequently there is no protein produced from this gene.7

The function of the Fragile X protein is to carry lots of different RNA molecules around in the cell. This gets them to the correct locations, influences how these RNAs are processed and how they generate proteins. If there is no Fragile X protein, the other RNA molecules aren’t properly regulated, and this plays havoc with the normal functioning of the cell.8 For reasons that aren’t clear, the neurons in the brain seem particularly sensitive to this effect, hence the learning disability in this disorder.

An everyday analogy may help with visualising this. In the UK, a relatively small amount of snow can incapacitate the transport networks. The snow covers the roads and the railway tracks, preventing cars and trains from moving. When this happens, people can’t get to their place of work and this creates all sorts of problems. Schools can’t open, deliveries aren’t made, banks can’t dispense cash, etc. One starting event – the snow – has all sorts of consequences because it ruins the transport systems in society. A similar thing happens in Fragile X syndrome. Just like snow on the roads and railway tracks, the effect of the mutation is to mess up a transport system in the cell, with multiple knock-on effects.

Switching off the expression of a specific gene is the key step in the pathology of both Friedreich’s ataxia and Fragile X syndrome. Support for this hypothesis has been provided by very rare cases of both disorders. There are small numbers of patients where the repeat in the junk regions is of the same small size found in most healthy people. In these patients, there are mutations that change the sequence in the amino acid-coding regions. These particular amino acid sequence changes actually make it impossible for the cell to produce the protein. In other words, it doesn’t matter why the protein isn’t expressed. If it’s not expressed, the patients have the symptoms.

Just when you have a nice theory

So far it might seem like there’s a nice straightforward theme emerging. We could speculate that expansions in the junk regions are only important because they create abnormal DNA. This DNA isn’t handled properly by the cells, resulting in a lack of specific important proteins. We could suggest that normally these junk regions are unimportant, with no significant role in the cell.

But there is something that argues against this. The normal range of repeats in both the Fragile X and Friedreich’s ataxia genes is found in all human populations, and has been retained throughout human evolution. If these regions were completely nonsensical we would expect them to have changed randomly over time, but they haven’t. This suggests that the normal repeats have some function.

But the real grit in this genetic oyster comes from myotonic dystrophy, the disorder that opened Chapter 1. The myotonic dystrophy expansion gets bigger as it passes down the generations. A parent’s chromosome may contain the sequence CTG repeated 100 times, one after another. But when they pass this on to their child, this may have expanded so the child’s chromosome has the sequence CTG repeated 500 times. As the number of CTG repeats gets larger, the disease becomes more and more severe. This isn’t what we would expect if the expansion just switches off the nearby gene. All cells of someone with myotonic dystrophy contain two copies of the gene. One carries the normal number of repeats, and the other carries the expanded number. So, one copy of the gene should always be producing the normal amount of protein. That would mean that the most the overall levels of the protein should drop would be about 50 per cent.

We could hypothesise that as the repeat gets longer there is progressively less gene expression from the mutant version of the gene. This could lead to a gradual decline in the amount of protein produced overall. This could range from a 1 per cent drop overall for fairly small expansions, to a 50 per cent final decrease for the large ones. This could lead to different symptoms. The problem is that there aren’t really any inherited genetic diseases like this. We just don’t see disorders where very minor variations in expression have such a big effect (all patients with the expansion develop symptoms), but with such fine tuning between patients (the symptoms becoming more extreme as the expansion lengthens).

It’s worth looking at where the expansion occurs in the myotonic dystrophy gene. It’s right at the far end, after the last amino acid-coding region. In Figure 2.5, this would be on the horizontal line to the right of box ‘G’. This means that the entire amino acid-coding region can be copied into RNA before the copying machinery encounters the expansion.

It’s now clear that the expansion itself gets copied into RNA. It is even retained when the long RNA is processed to form the messenger RNA. The myotonic dystrophy messenger RNA does something unusual. It binds lots of protein molecules that are present in the cell. The bigger the expansion, the more protein molecules that get bound. The mutant myotonic dystrophy messenger RNA acts like a kind of sponge, mopping up more and more of these proteins. The proteins that bind to the expansion in the myotonic dystrophy messenger RNA are normally involved in regulating lots of other messenger RNA molecules. They influence how well messenger RNA molecules are transported in the cell, how long the messenger RNA molecules survive in the cell and how efficiently they encode proteins. But if all these regulators are mopped up by the expansion in the myotonic dystrophy gene messenger RNA, they aren’t available to do their normal job.9 This is shown in Figure 2.6.

Again an analogy may help. Imagine a city where every member of the police force is engaged in controlling a riot in a single location. There will be no officers left for normal policing, and burglars and car thieves may run amok elsewhere in the city. It’s the same principle in the cells of people with the myotonic dystrophy mutation. The CTG repeat sequence expansion in a single gene – the myotonic dystrophy gene – ultimately leads to mis-regulation of a whole number of other genes in the cell.

Figure 2.6 The upper panel shows the normal situation. Specific proteins, represented by the chevron, bind to the CTG repeat region on the myotonic dystrophy messenger RNA. There are plenty of these protein molecules available to bind to other messenger RNAs to regulate them. In the lower panel, the CTG sequence is repeated many times on the mutated myotonic dystrophy messenger RNA. This mops up the specific proteins, and there aren’t enough left to regulate other messenger RNAs. For clarity, only a small number of repeats have been represented. In severely affected patients, they may number in the thousands.

This is because the expansion mops up more and more of the binding proteins as it gets larger. This leads to disruption of a greater quantity of other messenger RNAs, causing problems for increasing numbers of cellular functions. This eventually results in the wide range of symptoms found in patients carrying the myotonic dystrophy mutation, and explains why the patients with the largest repeats have the most severe clinical problems.

Just as we saw in Friedreich’s ataxia and Fragile X syndrome, the normal CTG repeat sequences in the myotonic dystrophy gene have been highly conserved in human evolution. This is consistent with them having a healthy and important functional role. We are even more convinced this is the case for the myotonic dystrophy gene because of the proteins that bind to the repeat in the messenger RNA. These also bind to shorter repeat lengths, of the size that are present in normal genes. They just don’t bind in the same abundance as they do when the repeat has expanded.

It’s clear from the myotonic dystrophy example that there is a reason why messenger RNA molecules contain regions that don’t code for proteins. These regions are critical for regulating how the messenger RNAs are used by the cells, and create yet another level of control, fine-tuning the amount of protein ultimately produced from a DNA gene template. But what no one appreciated when the myotonic dystrophy mutation was identified, almost ten years before the release of the human genome sequence, was just how extraordinarily complex and variable this fine-tuning would turn out to be.

3. Where Did All the Genes Go?

On 26 June 2000, it was announced that the initial draft of the sequence of the human genome had been completed. In February 2001, the first papers describing this draft sequence in detail were released. It was the culmination of years of work and technological breakthroughs, and more than a little rivalry. The National Institutes of Health in the USA and the Wellcome Trust in the UK had poured in the majority of the approximately $2.7 billion1 required to fund the research. This was carried out by an international consortium, and the first batch of papers detailing the findings included over 2,500 authors from more than 20 laboratories worldwide. The bulk of the sequencing was carried out by five laboratories, four of them in the US and one in the UK. Simultaneously, a private company called Celera Genomics was attempting to sequence and commercialise the human genome. But by releasing their data on a daily basis as soon as it was generated, the publicly funded consortium was able to ensure that the sequence of the human genome entered the public domain.2

An enormous hoopla accompanied the declaration that the draft human genome had been completed. Perhaps the most flamboyant statement was from US President Bill Clinton, who declared that ‘Today we are learning the language in which God created life’.3 We can only speculate on the inner feelings of some of the scientists who had played such a major role in the project as a politician invoked a deity at the moment of technological triumph. Luckily, researchers tend to be a shy lot, especially when confronted by celebrities and TV cameras, so few expressed any disquiet publicly.

Michael Dexter was the Director of the Wellcome Trust, which had poured enormous sums of money into the Human Genome Project. He was not much less fulsome, albeit somewhat less theistic, when he defined the completion of the draft sequence as ‘The outstanding achievement not only of our lifetime, but in terms of human history’.4

You might not be alone in thinking that perhaps other discoveries have given the Human Genome Project a run for its money in terms of impact. Fire, the wheel, the number zero and the written alphabet spring to mind, and you probably have others on your own list. It could also be claimed that the human genome sequence has not yet delivered on some of the claims that were made about how quickly it would impact on human disease. For instance, David Sainsbury, the then UK Science Minister, stated that ‘We now have the possibility of achieving all we ever hoped for from medicine’.5

Most scientists knew, however, that these claims should be taken with whole shovelfuls of salt, because we have been taught this by the history of genetics. Consider a couple of relatively well-known genetic diseases. Duchenne muscular dystrophy is a desperately sad disorder in which affected boys gradually lose muscle mass, degenerate physically, lose mobility and typically die in adolescence. Cystic fibrosis is a genetic condition in which the lungs can’t clear mucus, and the sufferers are prone to severe life-threatening infections. Although some cystic fibrosis patients now make it to the age of about 40, this is only with intensive physical therapy to clear their lungs every day, plus industrial levels of antibiotics.

The gene that is mutated in Duchenne muscular dystrophy was identified in 1987 and the one that is mutated in cystic fibrosis was identified in 1989. Despite the fact that mutations in these genes were shown to cause disease over a decade before the completion of the human genome sequence, there are still no effective treatments for these diseases after 20-plus years of trying. Clearly, there’s going to be a long gap between knowing the sequence of the human genome, and developing life-saving treatments for common diseases. This is especially the case when diseases are caused by more than one gene, or by the interplay of one or more genes with the environment, which is the case for most illnesses.

But we shouldn’t be too harsh on the politicians we have quoted. Scientists themselves drove quite a lot of the hype. If you are requesting the better part of $3 billion of funding from your paymasters, you need to make a rather ambitious pitch. Knowing the human genome sequence is not really an end in itself, but that doesn’t make it unimportant as a scientific endeavour. It was essentially an infrastructure project, providing a dataset without which vast quantities of other questions could never be answered.

There is, of course, not just one human genome sequence. The sequence varies between individuals. In 2001, it cost just under $5,300 to sequence a million base pairs of DNA. By April 2013, this cost had dropped to six cents. This means that if you had wanted to have your own genome sequenced in 2001, it would have cost you just over $95 million. Today, you could generate the same sequence for just under $6,000,6 and at least one company is claiming that the era of the $1,000 genome is here.7 Because the cost of sequencing has decreased so dramatically, it’s now much easier for scientists to study the extent of variation between individual humans, which has led to a number of benefits. Researchers are now able to identify rare mutations that cause severe diseases but only occur in a small number of patients, often in genetically isolated populations such as the Amish communities in the United States.8 It’s possible to sequence tumour cells from patients to identify mutations that are driving the progression of a cancer. In some cases, this results in patients receiving specific therapies that are tailored for their cancer.9 Studies of human evolution and human migration have been greatly enhanced by analysing DNA sequences.10

Honey, I lost the genes

But all this was for the future. In 2001, amidst all the hoopla, scientists were poring over the data from the human genome sequence and pondering a simple question: where on earth were all the genes? Where were all the sequences to code for the proteins that carry out the functions of cells and individuals? No other species is as complex as humans. No other species builds cities, creates art, grows crops or plays ping-pong. We may argue philosophically about whether any of this makes us ‘better’ than other species. But the very fact that we can have this argument is indicative of our undoubtedly greater complexity than any other species on earth.

What is the molecular explanation for our complexity and sophistication as organisms? There was a reasonable degree of consensus that the explanation would lie in our genes. Humans were expected to possess a greater number of protein-coding genes than simpler organisms such as worms, flies or rabbits.

By the time the draft human genome sequence was released, scientists had completed the sequencing of a number of other organisms. They had focused on ones with smaller and simpler genomes than humans, and by 2001 had sequenced hundreds of viruses, tens of bacteria, two simple animal species, one fungus and one plant. Researchers had used data from these species to estimate how many genes would be found in the human genome, along with data from a variety of other experimental approaches. Estimates ranged from 30,000 to 120,000, revealing a considerable degree of uncertainty. A figure of about 100,000 was frequently bandied about in the popular press, even though this had not been intended as a definitive estimate. A value in the region of 40,000 was probably considered reasonable by most researchers.

But when the draft human sequence was released in February 2001, researchers couldn’t find 40,000 protein-coding genes, let alone 100,000. The scientists from Celera Genomics identified 26,000 protein-coding genes, and tentatively identified an additional 12,000. The scientists from the public consortium identified 22,000 and predicted there would be a total of 31,000 in total. In the years since the publication of the draft sequence, the number has consistently decreased and it is now generally accepted that the human genome contains about 20,000 protein-coding genes.11

It might seem odd that scientists didn’t immediately agree on the numbers of genes as soon as the draft sequence was released. But that’s because identifying genes relies on analysing sequence data and isn’t as easy as it sounds. It’s not as if genes are colour-coded, or use a different set of genetic letters from the other parts of the genome. To identify a protein-coding gene, you have to analyse specific features such as sequences that can code for a stretch of amino acids.

As we saw in Chapter 2, protein-coding genes aren’t formed from one continuous sequence of DNA. They are constructed in a modular fashion, with protein-coding regions interrupted by stretches of junk. In general, human genes are much longer than the genes in fruit flies or the microscopic worm called C. elegans, which are very common model systems in genetic studies. But human proteins are usually about the same size as the equivalent proteins in the fly or the worm. It’s the junk interruptions in the human genes that are very big, not the bits that code for protein. In humans, these intervening sequences are often ten times as long as in simpler organisms, and some can be tens of thousands of base pairs in length.

This creates a big signal-to-noise problem when analysing genes in human sequences. Even within one gene there’s just a small region that codes for protein, embedded in a huge stretch of junk.

So, back to the original problem. Why are humans such complicated organisms, if our protein-coding genes are similar to those from flies and worms? Some of the explanation lies in the splicing that we saw in Chapter 2. Human cells are able to generate a greater variety of protein variants from one gene than simpler organisms. Over 60 per cent of human genes generate multiple splicing variants. Look again at Figure 2.5 (page 18). A human cell could produce the proteins DEPARTING, DEPART, DEAR, DART, EAT and PARTING. It might produce these proteins in different ratios in different tissues. For example, DEPARTING, DEAR and EAT could all be produced at high levels in the brain, but the kidney might only express DEPARTING and DART. And the kidney cells might produce 20 times as much of DART as of DEPARTING. In lower organisms, cells may only be able to produce DEPARTING and PARTING, and they may produce them at relatively fixed ratios in different cells. This splicing flexibility allows human cells to produce a much greater diversity of protein molecules than lower organisms.

The scientists analysing the human genome had speculated that there might be protein-coding genes that are specific to humans, which could account for our increased complexity. But this doesn’t seem to be the case. There are nearly 1,300 gene families in the human genome. Almost all of these gene families occur through all branches of the kingdom of life, from the simplest organisms upwards. There is a subset of about 100 families that are specific to animals with backbones but even these were generated very early in vertebrate evolution. These vertebrate-specific gene families tend to be involved in complex processes such as the parts of the immune system that remember an infection; sophisticated brain connections; blood clotting; signalling between cells.

It’s a little as if our protein-coding genome has been built from a giant LEGO kit. Most LEGO kits, especially the large starter boxes, contain a selection of bricks that are variations on a small number of themes. Rectangles and squares, some sloping pieces, perhaps a few arches. Various colours, proportions and thicknesses, but all basically similar. And from these you can build pretty much all basic structures, from a two-brick step to an entire housing development. It’s only when you need to build something extremely specialist, like the Death Star, that it’s necessary to have very unusual pieces that don’t fit the basic LEGO templates.

Throughout evolution, genomes have developed by building out from a standard set of LEGO templates, and only very rarely have they created something completely new. So we can’t explain human complexity by claiming we have lots of unusual human-specific protein-coding genes. We simply don’t.

But where this all becomes odd is when we compare the size of the human genome with that of other organisms. Looking at Figure 3.1, we can see that the human genome is much bigger than that of C. elegans and much, much bigger than that of yeast. But in terms of numbers of protein-coding genes, there isn’t anything like as great a difference.

Figure 3.1 In the upper panel, the areas of the circle represent the relative sizes of the genomes in humans, a microscopic worm and single-celled yeast. The human genome is much bigger than those from the simpler organisms. The lower panel represents the relative numbers of protein-coding genes in each of the three species. The disparity here between humans and the other two organisms is much less than in the top panel. The large relative size of the human genome clearly can’t be explained solely in terms of numbers of protein-coding genes.

These data demonstrated convincingly that the human genome contains an extraordinary amount of DNA that doesn’t code for proteins. Ninety-eight per cent of our genetic material doesn’t act as the template for those all-important molecules believed to carry out the key functions of a cell or an organism. Why do we have so much junk?

Poisonous fish and genetic insulation

One possibility is that the question is irrelevant or inappropriate. Maybe the junk has no function or biological significance. It can be a mistake to assume that because something is present, it has a reason to be there. The human appendix serves no useful purpose; it’s just an evolutionary hangover from our ancestral lineages. Some scientists speculated back in 2001 that this might also be true of most of the junk DNA in the human genome.