6 Bacteria and Archaea

Back

W. Ford Doolittle

86

The Triumph of Molecular Phylogeny

The collection of chapters in this volume and the symposium

for which they were assembled celebrate one of the signal

achievements of 20th century biology: the integration of

molecular sequence analyses with more traditional comparative

and paleontological approaches in the construction of a

universal Tree of Life. Integration is one of the key words here.

Without molecular data, we would still find it easy to tell

birds from bees or to distinguish any bird or bee from broccoli,

brewer’s yeast, or bacteria. But we would have no strong

basis for deciding, as we have (see Baldauf et al., ch. 4 in this

vol.), that all birds and bees are closer kin to yeast than to

broccoli. Nor would we have much reason to be as confident

as we are that, despite the manifest differences in size, shape,

and lifestyle, organisms in the first four groups—all eukaryotes,

with nucleated cells—share a common ancestor with

the nonnucleated prokaryotes (Bacteria and Archaea). For

all the very deep branchings, only molecular data—in the

form of DNA or protein sequence, or sometimes three-dimensional

protein structure—can provide unarguable evidence

for common ancestry and define lines of descent.

Unarguable is another key word. Of course, biologists

have never been at a loss for theories about how one type of

living thing might be evolutionarily related to another, and

what features might be important for deciding this. I remember

being taught in high school that brewer’s yeast and other

fungi were really a complex kind of bacterium, because of

their shared absorptive mode of taking nutrients, cell walls,

and general cellular simplicity, for instance. By the time I

started college, this view had been replaced by the synthesis

known as Whittaker’s Five Kingdoms (Animals, Plants,

Fungi, Protozoa, and Bacteria each as separate assemblages).

Such theories were always fluid and arguable, because there

were few commonly agreed upon grounds for formulating

or proving them. One difficulty was in knowing which shared

features are truly homologous (similar because they derive

from such a feature in a common ancestor) and which are

analogous (independently evolved for similar purposes, e.g.,

the wings of birds and bats, or the aquatic habits of fishes

and whales). Claims for evolutionary relatedness can only

be made on the basis of homologous traits. Another difficulty

was in converting data about shared features (if homologous)

into quantifiable measures of overall organismal

similarity. How do we combine data about biochemical

pathways, cellular ultrastructure, and behavior, which are

so profoundly different in quality, into a single quantity

measuring relatedness?

Molecular sequence data, at least at first blush, obviate

both problems. There are 20100 possible proteins 100 amino

acids long. Anything more than about 15% sequence identity

between two proteins cannot be mere coincidence and

is unlikely to be the result of evolution independently rediscovering

the same solution twice (convergence), because one

of evolutionary biology’s best-learned lessons is that there are

many different ways to solve the same challenge. So signifiBacteria

and Archaea 87

cant sequence similarity can be taken as significant evidence

of homology. It is also eminently quantifiable: we have only

to line two sequences up so as to optimize the match, and

count the identical amino acids (or nucleotides, for an RNA

or DNA sequence). These advantages of molecular sequence

data were first recognized by Emile Zuckerkandl and Linus

Pauling, whose 1965 papers founded the now flourishing

discipline of molecular phylogeny (Zuckerkandl and Pauling

1965). Further, Zuckerkandl and Pauling argued that gene

sequence data (or its direct read-out in RNA or protein sequence)

deserve our attention more than features of organismal

form and function, because they are more fundamental. DNA

sequence determines organismal form and function, and not

the other way round. Indeed, the latter contain no evolutionary

information that is not encoded in the former.

Implicit in Zuckerkandl and Pauling’s arguments, and

embodied in the molecular phylogenetic work they inspired,

was the assumption that, in picking a gene to do phylogeny

with, all we needed to worry about was the ease with which

it (or its RNA or protein product) can be isolated and sequenced,

and the breadth of its distribution. (Hemoglobins

are marvelous for doing vertebrate phylogeny, but plants and

bacteria don’t have them.) What we didn’t have to concern

ourselves with was the possibility that different genes in a

genome might have different phylogenetic histories. This assumption

is depicted in figure 6.1A and could be summarized

as individual gene trees = genome tree = organism tree.

Carl Woese made something like this assumption near the

end of the 1960s, when he chose small subunit (SSU) ribosomal

RNA (rRNA) as a “universal molecular chronometer,”

a stand-in for all genes. SSU rRNA was one of the few ubiquitously

distributed gene products that could be easily isolated

and (at least partially) sequenced at that time (Woese

1987). It still would be one of the best all around choices

(see Pace, ch. 5 in this vol.).

Ironically, a strong violation of the principle illustrated

by figure 6.1A was proposed by Lynn Margulis, at very nearly

the same time, and provided one of the first hypotheses about

deep phylogeny that the infant discipline of molecular phylogeny

could cut its teeth on (Margulis 1970). She dusted

off and made modern the endosymbiont hypothesis for the

origin of chloroplasts and mitochondria, first proposed by

Mereschowsky in the late 19th century. According to this

notion, these energy-generating organelles (the first responsible

for photosynthesis in all plants and algae, and the second

for respiration in almost all eukaryotes) were once

free-living bacteria that had become trapped in the cytoplasm

of ancient eukaryotic cells, as permanent endosymbionts

(fig. 6.2). In this sheltered and nutrient-rich environment,

many genes useful only for independent life were lost,

whereas many producing proteins still needed for photosynthesis

or respiration were transferred to the nucleus (so that

their products would thenceforth have to be transported back

into the organelle). A few genes were retained on the tiny

residual genomes found in mitochondria and plastids, however,

and these could unequivocally be used to trace the

evolutionary origins of these organelles.

Among such retained genes were those for organellar

versions of SSU rRNA. By the mid 1970s, several groups had

shown that chloroplast and mitochondrial SSU rRNA genes

Figure 6.1. Three models for the relationships between organismal, genome, and gene phylogenies, for four imaginary species

(labeled A, B, C, and D). (A) shows the “standard model”: no genes are exchanged between genomes, so the gene complements of any

genome can change only through loss of genes or duplication of genes, followed by divergence in sequence and function. (B) shows

the “stable core”: some, possibly even most, genes can be exchanged between genomes over evolutionary time, but a core of genes is

immune to this process, and the (congruent) phylogenies of these genes can be used to trace organismal phylogeny, and construct the

true Tree of Life. In (C), the “shifting core” model, no two genes need have the same phylogeny throughout all of life’s history.

Nevertheless, within restricted regions of the tree, most genes might evolve in a coherent fashion, showing congruent phylogenies.

A

A B C D A B C D A B C D

B C

88 The Origin and Radiation of Life on Earth

were indeed of independent bacterial origin (cyanobacteria

and a-proteobacteria, respectively), exhibiting phylogenies

clearly different from each other (Gray and Doolittle 1982).

More to the point, their phylogenies also differed from that

of the nuclear-gene-encoded SSU rRNA of cytoplasmic ribosomes—

a marker for the evolutionary history of the protoeukaryotic

host that first harbored the symbionts (fig. 6.2).

So this very important idea about cellular evolution was also

the first serious counterexample to the assumption that all

of an organism’s genes should have the same phylogeny.

Indeed, it was the fact that they don’t that proved the endosymbiont

hypothesis.

In the rest of this chapter I show that there are very

many other genes like this, genes that show different phylogenies

from SSU rRNA and from each other (and have

nothing to do with the endosymbiont hypothesis). Within

the prokaryotic domains (Bacteria and Archaea), in particular,

much coding DNA can be and demonstrably has been

exchanged across species, genus, phylum, or even domain

boundaries—so many genes, indeed, that the pattern of

relationships defined by SSU rRNA genes may not be exhibited

by the majority of the genes in any genome. For

prokaryotes, the appropriate model for typical relationships

between gene phylogenies might look more like B or C than

A in figure 6.1. This is probably not so much a problem for

eukaryotes, especially complex multicellular ones, and I will

confine myself to the topic assigned me, Bacteria and

Archaea. But because there seems to be so much gene sharing

between the two, my title might more appropriately have

been Bacteriandarchaea.

None of this necessarily means that Darwin was fundamentally

wrong, or that the concept of a unique and universal

organismal Tree of Life is passй, or that—if certain

assumptions hold—rRNA does not track this tree best. But

there is not a unique universal genomic tree, and we need to

develop more sophisticated (but also much more interesting

and exciting) ways of thinking about what we mean by

the Tree of Life.

Superbugs, Drugs, and Lateral Gene Transfer

The mid 1960s also saw the discovery of lateral gene transfer

(LGT), the process (or rather, collection of processes)

underlying microbial gene sharing. Infectious disease microbiologists,

mostly in the United States and Japan, found that

the rapid rise of resistance to commonly (and often excessively)

used antibiotics among human pathogens (especially

in hospitals) was not due to the expected Darwinian mechanism

of random mutation followed by natural selection

(Falkow 1975). Instead, genes determining resistance to

antibiotics (by a variety of mechanisms) had been recruited

from preexisting natural reservoirs and were being passed

around among pathogens on small circular DNA molecules

(plasmids), themselves well adapted to spreading infectiously

between bacterial species (fig. 6.3). Selection is still involved—

pathogens receiving the resistance-conferring plasmids produce

more progeny because they have them. So the process

is Darwinian. But it was not mutations occurring within genes

within species, but whole genes (or suites of genes) transferred

across species boundaries, on which selection was

acting. Indeed, we now know that plasmids can carry several

different genes for resistance to several different kinds

of antibiotics simultaneously, and that special mechanisms

and genetic devices (insertion sequences, transposons, and

integrons, to give some names) have evolved to facilitate the

assembly and transmission of such genes (Bushman 2002).

We were also soon to learn that antibiotic resistance determinants

were not the only kinds of coding sequences that

plasmids could carry. Clusters of several genes involved in

the synthesis of unusual and inessential metabolites or the

degradation of unusual and rarely available substrates were

also exchanged in this way. Two Canadians (Sorin Sonea and

Maurice Panniset) and an Australian (Darryl Reanney) soon

constructed a bold if inchoate theory on this foundation

(Sonea and Panniset 1976, Reanney 1976). They asserted that

because of between-species gene transfer—mediated not only

by plasmids but also by bacterial viruses (phages) and

through cell-to-cell contact (conjugation) or DNA uptake

(transformation)—all bacteria might be viewed as one species,

responding to environmental challenges (over evolutionary

time) as a single “global superorganism.” As I recall it,

these claims were widely dismissed during the 1970s and

Figure 6.2. The endosymbiont hypothesis for the origin of

mitochondria. A respiring a-proteobacterium was acquired by a

nonrespiring host (the protoeukaryote) as an endosymbiont,

conferring the benefits of respiration (efficient metabolism). The

endosymbiont lost genes needed for independent growth and

transferred many other genes to the nucleus. A small mitochondrial

genome (sometimes only a dozen genes) remains in the

organelle. A similar hypothesis would have chloroplasts derive

from cyanobacteria (blue-green algae). Both hypotheses are

considered proven (Gray and Doolittle 1982).

Bacteria and Archaea 89

1980s—they were so hopelessly radical! Most of the genes

then known to be transferred by plasmids could be viewed

as somehow “specialized” and, under most circumstances,

dispensable. Genes for core informational functions (replication,

transcription, and translation) were not known to be

subject to LGT, nor were genes of basic and widely conserved

metabolic pathways. So LGT was seen as a genetic add-on,

not a fundamental evolutionary force. It might even have

appeared on the scene recently, as the microbes’ way of coping

with human activity, namely, antibiotic use and the flooding

of microbial environments with many unusual pollutants,

some highly toxic but some of novel nutritional value (for

bacteria).

Pathogenicity (and Other) Islands

As we acquired the ability to characterize and especially to

sequence longer and longer stretches of DNA, however, we

could begin to see that still much more complex genetic

packages could be delivered across species boundaries by

LGT. And chromosome as well as plasmids could harbor the

transferred genes. In particular, pathogenic bacteria often

differ from harmless relatives by the possession of large functionally

specialized clusters, called pathogenicity islands,

some containing more than 100 genes (Hacker and Kaper

2000). These include virulence factors of many sorts, facilitating

survival within, protection from, or attack on the host,

as well as genes promoting the islands’ transfer as units. Often,

pathogenicity islands are inserted within a particular type

of chromosomal sequence (a gene for transfer RNA) and have

different compositional characteristics (relative composition

of G, C, A, and T) than the surrounding genes (fig. 6.3). Most

cogently, the genes of which they are composed may be found

in very similar form in very distantly related bacterial (or even

archaeal) genomes, but not in the pathogen’s closest relatives.

Clearly, they have been transferred into the genomes in which

we find them, although we don’t generally know the transfer

mechanism. So, very complex and important (for bacteria

and for us) suites of biochemical/physiological/behavioral

characteristics can be acquired in “one fell swoop” by LGT.

And recently, we’ve come to realize that there are also “symbiosis

islands” (promoting cooperation with hosts), “saprophytic

islands” (facilitating decay), and “ecological islands”

(metabolism in unusual circumstances).

Genomic Diversity: The Iceberg of Which

Phylotypic Diversity Is but the Tip

Still, resistance factors and complex multigene determinants

of interactions (benign or malign) with hosts and environments

might be seen as “specialized.” Surely, they constitute

no serious threat to our understandings of the evolutionary

histories of the everyday genes comprising the bulk of most

genomes, or to our ability to reconstruct the universal tree

using a nontransferrable marker, like SSU rRNA.

Genomics and, in particular, the appearance of complete

bacterial and archaeal genomic sequences now call even this

view into question. More than 100 such sequences will soon

be publicly available, and these will demolish the notion that

genomes in general contain just a few genes (or gene clusters)

of foreign origin, and these only for specialized functions.

Particularly striking are the comparisons that can be

drawn between different isolates of the very same bacterial

species. Consider for instance Escherichia coli, the laboratory

workhorse of molecular biologists and biotechnologists for

the last five decades. The complete genome sequence of K12,

their favorite strain, was reported in 1997 (Blattner et al.

1997). Many of its 4405 genes were already familiar from

genetic experiments or piecemeal gene sequencing studies.

The community therefore thought that it had this species

under wraps, genomically—until four years later, when the

genome of another E. coli isolate, O157:H7, was completed

(Perna et al. 2001) This is the strain that first attracted popular

attention in 1993 through the death of three young customers

of a fast-food restaurant in California, and two years

ago killed seven drinking from contaminated wells in

Ontario. The sequencing showed that it has 1387 genes that

K12 doesn’t have, whereas K12 itself has 528 genes not found

in O157:H7—numbers corresponding to 26% of the genome

of O157 and 12% of K12’s. Many of these differences can

only be explained by LGT, verifiable through similarity to

homologous genes in evolutionarily distant bacteria (or even

archaea) and, most persuasively, through the construction

of phylogenetic trees for each gene. These many differences

are also clearly the consequence of many different LGT

events, not just the acquisition of a few large pathogenicity

islands. In fact there are 177 physically separated “O islands”

Figure 6.3. Bacterial antibiotic resistance genes found on

plasmids have been the major cause of the rise in drug-resistant

“superbugs.” Their spread is one form of LGT. Also, genes for

many functions related to pathogenicity are clustered in

transferrable regions of bacterial chromosomes.

90 The Origin and Radiation of Life on Earth

(genes or gene clusters present in O157 but not K12) and about

234 “K islands.” Although many of the strain-specific genes of

O157:H7 are likely to be specialized determinants of virulence,

many are not. They encode seemingly pedestrian microbial

functions (e.g., carbohydrate transfer, glutamate fermentation,

or aromatic compound degradation).

Preliminary data for other E. coli strains show the O157:H7

versus K12 difference to be typical, not aberrant. Similar studies

based on similar information on other pathogens produce

similar results. Strains of the same “species” often differ from

each other by up to 25% in gene content. Simple logic (with

the assumption that, on average, bacterial genomes are getting

neither larger nor smaller) dictates that about half of this

difference can be attributed to acquisition of new genes by one

or the other strain, after their joint separation from a common

ancestor. (The other half could be explained by loss, from one

or the other strain, of genes present in that ancestor.)

We know about genomic variability in pathogens because

it is easy to obtain funding to study the biology of pathogens.

Data on nonpathogens are scant. Recently Camilla Nesbш in

my lab, with Karen Nelson at The Institute for Genomic

Research, has been looking at genomic diversity within

Thermotoga maritima, a nonpathogen par excellence. This

hyperthermophilic bacterium grows best at 80°C and was

isolated from the seafloor in a geothermal area near Vulcano,

Italy. Preliminary data suggest that here, too, there will be

something like 20% variability in gene content, between otherwise

very similar isolates. If this turns out to be generally

true for “environmental microbes” (including Archaea), then

we cannot explain away within-species genomic variation as

a by-product of intense host–parasite warfare: we must accept

it as a fact of prokaryotic life. We must also accept, then,

that the microbial world is even more wildly diverse than those

who use “phylotyping” (amplification and sequencing of SSU

rRNA genes from environmental DNA samples; see Pace,

ch. 5 in this vol.) have already told us. Such studies have revealed,

through a plethora of new twigs on the branches of

the SSU rRNA tree, a hitherto unimaginable diversity of relatives

of known groups. They have also led to the discovery of

completely new groups, without previously known relatives.

For each isolate identified by a single SSU rRNA sequence

(“phylotype”), however, there may now be many more genomic

variants, differing in their content of truly different

(nonhomologous) genes by more than, say, the genomes of all

the animals. (Animals do, of, course vary in gene content, but

through duplication and functional divergence of genes they

already have, or through gene loss—scarcely ever through

the introduction of genuinely novel genes by LGT.)

How Much Exchange over Life’s Whole History?

There is no easy way to know how old any bacterial species

is, or (which is almost the same question) how long strains

within a species have been diverging—and surely there is no

uniform age. Howard Ochman and Isaac Jones estimate that

various E. coli strains began to diverge about 25–40 Myr

(million years) ago, based on an often quoted but largely

unverified estimate of the divergence of Escherichia from

Salmonella at 100–150 Myr ago (Ochman and Jones 2000).

In contrast, Yersinia pestis, the cause of plague, may be only

a few thousand years old (Achtman et al. 1999)! But however

ancient bacterial species in general may be, their ages

will be dwarfed by that of life itself. So, if 10–20% of a genome

can “turn over” because of LGT and gene loss within

(generously) 100 Myr, what fraction would we expect to have

been affected by LGT over 3.8 billion years? No one thinks

that all genes are equally exchangeable, but still it is reasonable

to ask what fraction of any contemporary genomes’ genes

has been affected by LGT. There are several ways one might

try to do this.

Ochman and Jeff Lawrence look at basic compositional

features of genes, in particular, the relative frequencies of A,

T, G, and C and the choice among alternative codings for the

same amino acids (Ochman et al. 2000). Prokaryotic species

differ significantly in these parameters, which tend to be similar

within a genome. Thus, a recently transferred gene might

“stick out like a sore thumb” from the surrounding long-term

residents. (With time—perhaps a few hundred million

years—genome-specific mutational and selectional pressures

will attenuate and ultimately erase the differences.) With

analyses based on these premises, Ochman and collaborators

find foreign gene contents from 0.0% (for Mycoplasma

genitalium or Rickettsia prowazecki, intracellular human parasites)

to 16.6% for the cyanobacterium Synechocystis, with

E. coli boasting 12.8% transfers.

Eugene Koonin and his colleagues employ a completely

different method (called BLAST) that makes all possible

pairwise comparisons between each of a genome’s genes and

all homologous genes in other genomes (or the larger databases),

and calculates sequence similarity (Koonin et al.

2001). Genes that have greatest sequence similarity to genes

in species that are distant on the rRNA tree (rather than to

genes in species that are close) are likely transfers. The most

easily detected transfers would be those involving the greatest

distances: genes in an archaeal genome that are most

similar to homologs in the bacterial domain, and vice versa.

Koonin finds up 15.6% interdomain transfer (for an

archaean, Halobacterium salinarum). Rumor in the field now

has it that similar analyses will show that one-third of the

genes in the yet-to-be-published genome sequence of the

methane-producing archaean Methanosarcina mazei are of

bacterial provenance—an astonishing result!

The third and best way to assess a genome’s origins is to

construct phylogenetic trees for each of its genes, by stateof-

the-art methods. For many individual genes, compelling

cases can be developed. My favorite example is the gene for

HMGCoA reductase (3-hydroxy-3-methylglutaryl coenzyme

A reductase), a key enzyme in the synthesis of isoprenoid

compounds (sterols, e.g.) in all three domains (and the tarBacteria

and Archaea 91

get of the statins that many people take to reduce endogenous

cholesterol synthesis). Our attention was first drawn to

HMGCoA reductase because BLAST analyses showed that the

version of this gene in Archaeoglobus fulgidis (a hyperthermophilc

archaean sometimes found in undersea oil wells) was

very like homologous genes in bacteria and unlike the versions

found in other Archaea. In fact, most Archaea have an

HMGCoA reductase very similar to that of eukaryotes, so for

them statins are antibiotics! A tree prepared by Yan Boucher

for HMGCoA reductases (fig. 6.4) not only confirmed this

result but identified other transfers—Bacteria to Giardia

intestinalis (a single-celled pathogenic eukaryote), Archaea to

Vibrio cholerae (a bacterial pathogen), and Archaea to Streptomyces

species (bacteria that produce antibacterial antibiotics).

Gene-by-gene analyses are time consuming, because

human judgment is still often required. Less reliable but very

rapid programs for preparing, by simple automatic methods,

all the trees for all the genes in a genome are being developed.

That of Thomas Sicheritz-Ponten and Siv Andersson

shows, not unlike Koonin’s BLAST studies, interdomain

(Bacteria to Archaea or Archaea to Bacteria) transfers

amounting to up to about 20% of a genome (Sicheritz-

Ponten and Andersson 2001).

Is this about the limit? Are 70–80% of most genomes well

behaved in the long-term evolutionary sense, as well as the

short? Probably not. Foreign gene estimates are all likely

to be underestimates. Ochman’s analyses, for instance, can

only look back a few hundred million years. Koonin’s and

Sicheritz-Ponten’s results described interdomain transfers

(Bacteria to Archaea or vice versa). Because Bacteria and

Archaea have dissimilar gene expression machinery and control

signals, genes transferred between them should often be

poorly read. Harder to detect, intradomain transfers should

be much more frequent.

Hunting Down the Core

There is another way to skin this cat. Instead of asking what

fraction of genes in a given genome have clearly different histories

than the majority (or than SSU rRNA), we can ask if

we can find, by comparing all genomes, a stable core of shared

genes (fig. 6.1B) that have the same history. There is a general

belief that such a core should exist, based on a hypothesis

and an observation.

The hypothesis, first articulated by Woese when he decided

to settle on SSU rRNA as a “universal molecular chronometer,”

has come to be called the “complexity hypothesis”

(Jain et al. 1999). The idea is simple: genes whose protein

(or RNA) products must interact in the cell will coevolve.

Mutations that affect the structure of one gene product (call

it A) will be compensated by mutations that affect another,

interacting, gene product (B) in a compensatory way, so that

the essential interactions between A and B are preserved

throughout the evolutionary history of a species or lineage.

Meanwhile, in another, related lineage, the homologous gene

products A' and B' will also be coevolving, but likely along a

somewhat different path. If the B gene of the first lineage were

replaced by the B'> of the second lineage, there might be

problems: the A gene product might not interact as effectively

with the B' product (and similarly, A' might not be effective

with B). This seems a very reasonable conjecture, and the

corollary—that genes involved in even more complex interactions

(A + B + C + D + E . . .) should be very hard to exchange

for homologous genes in different lineages, without

detriment to growth—seems inescapable.

SSU rRNA is the central part of an enormously complex

structure, the ribosome. This factory for translation (the RNA

protein part of DNA →RNA →protein) also requires two

other RNAs and more than 50 proteins, in order to do its

vital and always essential job. The complexity hypothesis

would predict that the genes encoding these RNAs and proteins

could not be transferred across even very short evolutionary

distances. Similarly, the various genes encoding the

machineries of transcription (DNA →RNA) and replication

(copying of DNA) should be hard to transfer. Certainly, it is

the case that the genes identified as foreign in individual

sequenced bacterial or archaeal genomes are seldom genes

of these informational classes. But there are now several reliable

reports of transfer of “informational genes,” especially

those involved in translation and (in a few cases) SSU rRNA

itself (Yap et al. 1999)!

The observation on which confidence in a stable core rests

is what some of us call “coherence.” Many individual genes,

when known from a sufficient number of species, do re-create

the same major groups—Archaea (and within them euryarchaeotes

and crenarchaeotes) or Bacteria (and within them

the known bacterial phyla, such as cyanobacteria, a-, b- and

g-proteobacteria and so forth). There is no published systematic

survey that says how many “many” is, however, or

that compares a large number of well-resolved trees for congruence

of topology. And few genes agree on branching order

of bacterial phyla (even though they do distinguish Bacteria

and Archaea). Pace (ch. 5 in this vol.) suggests that the poor

resolution at the base of the bacteria bespeaks a rapid radiation

some 3.5 billion years ago, perhaps caused by a key innovation.

This is one explanation but not the only.

Surely the most rigorous test of the stable core idea would

be to compare all bacterial and archaeal genomes, distill out

the set of genes of which all genomes have a copy, make trees,

and tally up how many subscribe to which topology. Efforts

to do this have failed: there are very few genes shared by all

genomes (even all bacterial or all archaeal genomes)—perhaps

50 or fewer (Teichman and Mitchison 1999). Few of

these genes give statistically robust trees, so we simply cannot

say whether their topologies are congruent or not. The

assumption that there might be a stable core of genes for all

prokaryotes is not disproved by this, but neither is it proven:

it remains a hypothesis. In an effort to test the stable core

idea on a more limited basis, we looked at the core of genes

92 The Origin and Radiation of Life on Earth

shared by four sequenced eukarychaeotes, asking if these all

produced the same tree (Nesbш et al. 2001). Several hundred

genes could be looked at and, because there are only three

unrooted phylogenetic trees for four taxa, easily scored for

agreement or disagreement. It turns out that each of the three

possible trees is significantly represented among the 263

shared genes we looked at. In other words, although there is

a core of genes shared by the four genomes, it does not seem

to be a stable core. The shared genes often appear to have

different phylogenetic histories. This could mean that genes

are not infrequently replaced by homologous but possibly

quite different versions of themselves, transferred in across

species lines.

So it is not possible to prove that there is any sizable stable

core, even within a relatively restricted group such as the

euryarchaeotes. Hervй Philippe and collaborators have tried

another approach (Brochier et al. 2002). Individual trees

constructed for 57 translational proteins shared by 45 bacterial

species mostly disagree, as expected: there is too much

noise and too little phylogenetic signal. But if they strung

all gene sequences together to obtain one concatenated sequence,

then a statistically robust tree could be obtained, and

44 of the 57 genes did not significantly contradict this result.

(The 13 others showed significant evidence for transfer.)

So perhaps these comprise a true core for all of Bacteria.

But 44 is but a few percent of the number of genes in a typical

bacterial genome. And when Brown and collaborators

(2001) included members of Archaea in a similar study, they

were obliged to reduce the apparent stable core even further,

to only 14 genes. Woese may be correct in asserting, “An

organismal genealogical trace of some kind does seem to exist

. . . but that trace is carried clearly almost exclusively in the

componentry of the cellular information processing systems”

(Woese 2000:8393). However, when it comes to prokaryotes,

and the deepest branches of the universal tree, proving

even this modest claim is surprisingly difficult!

Figure 6.4. Phylogeny of genes

encoding HMGCoA reductase, a

key enzyme in the synthesis of

sterols and related lipids. The

predominant bacterial form

(class 2) and predominant

eukaryotic/archaeal form (class

2) are unquestionably homologous

but with different functional

characteristics. Four LGT

events are very strongly supported

by the phylogenetic

analysis. The boxed numbers are

bootstrap values, measures of

statistical robustness, for a tree

obtained by maximum likelihood,

maximum parsimony, and

distance methods. Archaeal

names are italicized, eukaryotic

names are underlined, and

bacterial names are in regular

letters.

Bacteria and Archaea 93

Other Models

Absence of evidence is not evidence of absence. A conservative

summary of what I’ve said so far is that the existence of

a stable core is hard to prove. The signal-to-noise ratio in the

data we need to decide about events occurring three and

more billion years ago is too low, and our methods are still

too crude. “Hard to prove” is not “disproven.” But all parties

to the debate now accept that the core of genes that has been

stably associated in all prokaryotic genomes since the first

genome is far smaller than we used to think. And, just maybe,

there might be no such core.

What if there weren’t? Could there be some other model

than those depicted in figure 6.1, A and B, to explain the

undeniable fact that we can classify bacteria and archaea into

groups that have many shared defining features—that the

entire edifice of Linnaean hierarchical classification has been

more or less successfully imposed on microbial systematics?

Jeff Lawrence, Peter Gogarten, and I have been working on

such a model, which is still in the verbal stages (no formal

mathematics) and has as yet no fixed name (Gogarten et al.

2002). Here I call it the model of the “shifting core” or, alternatively,

the model of “nested gene pools.” In fact, it’s

not much different from what Woese himself now believes

(Woese 2000), although we would probably disagree on the

values of its parameters.

Imagine that all genes are potentially exchangeable but

that the frequency or likelihood of exchange varies tremendously.

Many factors would affect this. Complexity of interactions

of the gene’s product, and whether or not it was

genetically linked (and so could be co-transferred) with other

interacting genes would be important factors, related to the

genes themselves. So would essentiality: genes that must always

be present can only be replaced through an intermediate

stage in which both the originally resident and the incoming

foreign gene are found in the same genome. (Such intermediates

are well known.) Biochemistry of the donor and recipient

organism would be a key determinant. Transferred genes

for various components of the photosynthetic apparatus are

only likely to be of any use to species that already do photosynthesis.

If of no use, transferred genes will soon be lost and

we will never know that a transfer occurred. Similarly, the

differences in gene expression systems between Bacteria and

Archaea must reduce the frequency of successfully fixed

transfers between them. Environmental niche matters, too:

genes from thermophiles make proteins that work best in

other thermophiles. Finally, donors and recipients must be

found in close proximity in nature, and physical and genetic

mechanisms to pass DNA between them (including “accidental”

mechanisms) must exist.

Imagine that we ourselves create hundreds of different

bacterial species, with genes and genomes made from

scratch by machine, and then set them up in various niches

and allow them to transfer genes according to such rules.

Although there would initially have been no deep “phylogenetic”

relationships between these human-made species

or their genes, patterns of shared genes and similarities in

sequences would eventually emerge, because of recurring

transfers at different frequencies. In other words, LGT itself

can create and maintain the patterns we seek to explain

by the model depicted in figure 6.1A, but the underlying

process would be as shown in figure 6.1C. According to

this model, organisms that exchange genes most frequently

would comprise “species.” Different species whose organisms

share genes somewhat less frequently would comprise

genera, and so on up the Linnaean ladder. Bacteria are coherent

as a domain because they more frequently exchange

genes with other bacteria than with members of Archaea

(and vice versa), but still, interdomain transfer does occasionally

happen.

This model may not be correct in its extreme form (no

stable core at all), but something like it must apply in the

long run to most of the genes that make up prokaryotic genomes.

In the short run (corresponding to the divergence of

strains in a species or species in a genus, perhaps), it may

most accurately describe only the 20% of a genome’s worth

of genes that are found in some genomes but not others.

[However, recombination within genes—which I have not

discussed—may have a similarly confounding effect, at this

level (see Maynard Smith et al. 2000).]

The One True Tree

Darwin did describe the relationships of all organisms as a

tree and thought that the patterns of similarities and differences

between all contemporary species could be explained

as the result of successive bifurcative speciation events, going

back to one, or just a few first living things. If we had a videotape

of all that (and 3.8 billion years to sit down and watch

it!), we could trace all the bifurcations, and that tracing would

be the universal Tree of Life. But there is no video, so we have

been trying to reconstruct these bifurcations by comparing

the sequences of genes, initially on the assumption that any

gene would in principle do, but more recently with the belief

that only some genes will tell the true story. But even if

none do, and figure 6.1C shows how genomes truly evolve,

the situation need not be seen as hopeless. Some kind of

consensus of the phylogenies of all genes of all genomes,

weighted perhaps in favor of those least frequently transferred,

might still have a good chance of recreating the pattern

of speciation events recorded on our imaginary videotape.

We don’t yet know how best to make such consensus phylogenies.

Some investigators want to call them “genome phylogenies,”

a misleading term, I believe. Frequent LGT does

not mean there is no single true universal Tree of Life for

organisms, only that reconstructing this tree has become

more problematic. But frequent LGT does mean that there

is no single true universal tree of genomes, because these are

made up of parts that have different phylogenies!

94 The Origin and Radiation of Life on Earth

Cold Comfort to Creationism

Advocates of Biblical interpretations of life’s history and proponents

of “intelligent design” like to cite disagreement

within the evolutionary community and, in particular, claims

to have “overthrown Darwin” as support for their views.

Therefore, early publications asserting that evidence for extensive

LGT was “uprooting the Tree of Life” have found

popularity with them. Perhaps some of us (especially me)

were not careful enough in stating that what was being uprooted

was the tree of genomes. Our acceptance of the video

version of the organismal tree remains steadfast, regardless

of problems in constructing it.

Even so, there is a challenge to Darwinism, as it has itself

evolved over the last century. Darwinists (more properly,

neo-Darwinists) see adaptation happening as the result of

selection among mutations that have arisen in genes within

populations of species, and speciation as most commonly the

result of divergent (and ultimately incompatible) adaptations

being fixed in different populations. Explicitly or implicitly,

figure 6.1A is the model of genome evolution most compatible

with this neo-Darwinian view. This, I assert, is what

Darwin himself would have expected, had he lived to see the

centenary of the publication of The Origin of Species. If adaptations

are instead often due to acquisition of genes from

different species, then figure 6.1C might the more relevant

model. I’d hope that Darwin, had he hung on for still another

half century, would have found this at least amusing

and recognized the profound difference.

In any case, what does it matter what Darwin would think?

Evolutionary biologists are committed to materalistic,

nonsupernatural explanations of the patterns of similarity and

difference we see in the living world, not to the correctness of

Darwin’s own particular explanations. If we substitute one

materalistic, nonsupernatural explanation for another, this is

a sign of paradigmatic health, not weakness. Sometimes I think

we ourselves forget this, and defend Darwin and neo-Darwinism

(and, indeed, the gene-based Tree of Life) as if they were

received truth, not provisional interpretations of a fascinatingly

complex world. We should stop doing that!

Literature Cited

Achtman, M., K. Zurth, G. Morelli, G. Torrea, A. Guiyole, and

E. Carniel. 1999. Yersinia pestis, the cause of plague, is a

recently emerged clone of Yersinia tuberculosis. Proc. Natl.

Acad. Sci. USA 96:14043–14048.

Blattner, F. R., G. Plunkett, III, C. A. Bloch, N. T. Perna,

V. Burland, M. Riley, J. Collado-Vides, J. D. Glasner, C. K.

Rode, G. F. Mayhew, et al. 1997. The complete genome

sequence of Escherichia coli K-12. Science 277:1453–1474.

Brochier, C., E. Bapteste, D. Moreira, and H. Philippe. 2002.

Eubacterial phylogeny based on translational apparatus

proteins. Trends Genet. 18:1–5.

Brown, J. R., C. J. Douady, M. J. Italia, W. E. Marshall, and M. J.

Stanhope. 2001. Universal trees based on large combined

protein sequence datasets. Nat. Genet. 28:281–285.

Bushman, F. 2002. Lateral DNA transfer. Cold Spring Harbor

Laboratory Press, Cold Spring Harbor, NY.

Falkow, S. 1975. Infectious multiple drug resistance. Pion Ltd.,

London.

Gogarten, J. P., W. F. Doolittle, and J. G. Lawrence. 2002.

Prokaryotic evolution in the light of gene transfer. Mol. Biol.

Evol. 19:2226–2238.

Gray, M. W., and W. F. Doolittle. 1982. Has the endosymbiont

hypothesis been proven? Microbiol. Rev. 46:1–42.

Hacker, J., and J. B. Kaper. 2000. Pathogenicity islands and the

evolution of microbes. Annu. Rev. Microbiol. 54:641–679.

Jain, R. C., M. C. Rivera, and J. A. Lake. 1999. Horizontal gene

transfer among genomes: the complexity hypothesis. Proc.

Natl. Acad. Sci. USA 96:3801–3806.

Koonin, E. V., K. S. Marakova, and L. Aravind. 2001. Horizontal

gene transfer in prokaryotes: quantification and classification.

Annu. Rev. Microbiol. 55:709–742.

Margulis, L. 1970. Origin of eukaryotic cells. Yale University

Press, New Haven, CT.

Maynard Smith, J., E. J. Feil, and N. H. Smith. 2000. Population

structure and evolutionary dynamics of pathogenic bacteria.

Bioessays 22:1115–1122.

Nesbш, C. L., Y. Boucher, and W. F. Doolittle. 2001. Defining

the core of nontransferable prokaryotic genes: the euryarchaeal

core. J. Mol. Evol. 53:340–350.

Ochman, H., and I. B. Jones. 2000. Evolutionary dynamics of

full genome content in Escherichia coli. EMBO J. 19:6637–

6643.

Ochman, H., J. G. Lawrence, and E. A. Groisman 2000. Lateral

gene transfer and the nature of bacterial innovation. Nature

405:299–304.

Perna, N. T., G. Plunkett, III, V. Burland, B. Mau, J. D. Glasner,

D. J. Rose, G. F. Mayhew, P. S. Evans, J. Gregor, et al. 2001.

Genome sequence of enterohaemorrhagic Escherichia coli

O157:H7. Nature 409:529–532.

Reanney, D. C. 1976. Extrachromosomal elements as possible

elements of adaptation and development. Bacteriol. Rev.

40:552–590.

Sicheritz-Ponten, T., and S. G. Andersson. 2001. A phylogenomic

approach to microbial evolution. Nucleic Acids

Res. 29:545–552.

Sonea, S., and M. Panniset. 1976. Manifesto for a new bacteriology.

Rev. Can. Biol. 35:103–167.

Teichman, S. A., and G. Mitchison. 1999. Is there phylogenetic

signal in prokaryotic proteins? J. Mol. Evol. 49:98–107.

Woese, C. R. 1987. Bacterial evolution. Microbiol. Rev. 51:221–

271.

Woese, C. R. 2000. Interpreting the universal phylogenetic tree.

Proc. Natl. Acad. Sci. USA 97:8392–8396.

Yap, W. H., Z. Zhang, and Y. Wang. 1999. Distinct types of

rRNA operons exist in the genomes of the actinomycete

Thermomonospora chromogena and evidence for horizontal

transfer of an entire rRNA operon. J. Bacteriol. 181:5201–

5209.

Zuckerkandl, E., and L. Pauling. 1965. Evolutionary divergence

and convergence in proteins. Pp. 97–166 in Evolving genes

and proteins (V. Bryson and H. J. Vogel, eds.). Academic

Press, New York.