Human
genome: End of the beginning
Nature 431, 915-916 (21 October 2004) | doi: 10.1038/4319
Lincoln D. Stein1
This issue of Nature features an article1 entitled "Finishing
the euchromatic sequence of the human genome". It has been
authored by members of the International Human Genome Sequencing
Consortium (IHGSC), and appears on page 931. The article marks
the latest, but by no means the last, milestone in this historic
project. But readers can be forgiven for being a bit confused
by the announcement. Wasn't the human genome 'finished' several
years ago?
The
answer is 'yes' — and 'no'. Early in 2001, the duelling IHGSC
(public) and Celera Corporation (private) groups published papers
in Nature2 and Science3 describing the completion of so-called
'draft' sequences. These sequences have revolutionized molecular
biology by largely eliminating the need to clone and sequence
genes involved in human health and disease. Instead of going to
the bench, biologists now go to the web to look up gene sequences
in public online databases.
But
despite their immediate usefulness, the draft sequences were far
from perfect. Both drafts were missing some 10% of the so-called
'euchromatin' — the gene-rich portion of the genome — and some
30% of the genome as a whole (which includes the gene-poor regions
of 'heterochromatin'). The drafts contained hundreds of thousands
of gaps, and had misassembled regions where portions of the genome
were flipped or misplaced. As a result, any large-scale analyses
of the genome, such as studies of the mechanisms of gene evolution
or the long-range structure of the genome, had to contend with
numerous uncertainties and artefacts. For example, studies of
'pseudogenes', the dying remnants of genes that have accumulated
mutations that render them non-functional, had to contend with
the possibility that any apparent pseudogene was instead the result
of a sequencing error.
Since
the publication of the drafts, the IHGSC sequencing centres have
quietly undertaken a laborious 'finishing' process, in which each
gap in the draft was individually examined and subjected to a
battery of steps involving cloning and resequencing stretches
of DNA. The sequence announced today has just 341 gaps remaining,
and consists of contiguous runs of sequence averaging 38 million
base pairs. The authors estimate that the finished sequence covers
99% of the euchromatic portion of the genome and that the overall
error rate is less than 1 error per 100,000 base pairs. This substantially
exceeds the original goals for the project.
The
finishing procedure roughly doubled the total time and cost of
the project. Does it contribute anything new to our understanding
of the genome? It does indeed, and to prove the point the authors
of the current paper1 describe several large-scale analyses of
the genome that would have been difficult to perform on the draft
sequence. One analysis studied the processes of gene birth and
death. The authors find 1,183 human genes that show evidence of
having been recently 'born' by a process of gene duplication and
divergence. They also find 37 genes that seem to have recently
'died' by acquiring a mutation that rendered the gene non-functional.
The resulting pseudogene then slowly degrades and disappears.
In
a second analysis, the authors use the finished sequence to map
out segmental duplications — large regions of the genome that
have duplicated in recent evolution. They find that 5% of the
genome is involved in segmental duplications, and that the distribution
of these regions varies widely across the chromosomes. Knowing
the nature and extent of such duplications is important for understanding
the evolution of the human genome, and for studying the many medically
relevant disorders that are involved in segmental duplications,
such as DiGeorge syndrome and Charcot-Marie-Tooth syndrome.
Another
paper in this issue, by She et al.4 (page 927), directly compares
the outcomes of this second analysis with results obtained on
an unfinished version of the human genome (an improved version
of the Celera draft). She et al. find that the draft version artefactually
'simplifies' the genome by eliminating many duplicated regions.
Their results bear on one of the highly publicized differences
between the public and private genome projects. The public project
used an older strategy in which the genome was first cloned into
bacterial artificial chromosomes (BACs); the clones were then
mapped, and each clone was sequenced and their sequences assembled
individually. Celera championed an untested technique, 'whole-genome
shotgun' (WGS), in which the entire genome was shattered into
bite-size pieces, sequenced, and then assembled by software in
one conceptually simple step.
Celera
proved that the WGS technique is both technically feasible and
provides a dramatic cost-saving over the clone-by-clone approach.
Although the Celera draft has languished because the availability
of public data in free online databases undermined the company's
business plan to sell genome-database subscriptions, the effort
left a permanent mark on the public project. Almost all genome-sequencing
projects since then have used some form of WGS. The cautionary
results contained in the new papers from the IHGSC1 and She et
al.4 argue for a hybrid strategy in which WGS is supplemented
by a modest amount of BAC cloning and mapping. This would protect
draft WGS sequences from some of the 'simplification' reported
by She et al. and provide the clones needed for finishing selected
regions of special interest.
What
is next for the human genome project? Even with a finished sequence
in hand there is much still to do. Surprisingly, one task is to
develop the definitive catalogue of protein-coding genes. In the
current paper1, the number is estimated to be between 20,000 and
25,000. This wide range reflects limitations to state-of-the-art
gene-prediction software that leave doubts about the validity
of many predicted genes. One promising approach is to use comparative
genomics to align the human genome with the genomes of other animals.
Because natural selection ensures that functional regions are
more highly conserved than non-functional ones, this approach
highlights candidate protein-coding regions. The same approach
shows promise for finding other functional elements such as gene
promoters, which control the timing and level of expression of
genes, and micro-RNAs, which have been implicated as regulatory
agents of many developmental processes.
Much
farther in the future is the task of sequencing the remaining
20% of the genome that lies within heterochromatin, the gene-poor,
highly repetitive sequence that is implicated in the processes
of chromosome replication and maintenance. The repetitiveness
of heterochromatin means that it cannot be tackled using current
sequencing methods, and new technologies will have to be developed
to attack it. So don't be shocked to see another paper announcing
the 'finishing' of the human genome in 2010 — it will describe
how the heterochromatin problem has been cracked.
In
sequencing the human genome, researchers have already climbed
mountains and travelled a long and winding road. But we are only
at the end of the beginning: ahead lies another mountain range
that we will need to map out and explore as we seek to understand
how all the parts revealed by the genome sequence work together
to make life.
Top
of page
References
International Human Genome Sequencing Consortium Nature 431, 931-945
(2004). | Article |
International Human Genome Sequencing Consortium Nature 409, 860-921
(2001). | Article | PubMed | ISI | ChemPort |
Venter, J. C. et al. Science 291, 1304-1351 (2001). | Article
| PubMed | ISI | ChemPort |
She, X. et al. Nature 431, 927-930 (2004). | Article |
Lincoln D. Stein is at Cold Spring Harbor Laboratory, 1 Bungtown
Road, Cold Spring Harbor, New York 11724, USA.
e-mail: Email: lstein@cshl.edu