Main

The amoebozoa are a richly diverse group of organisms whose genomes remain largely unexplored. The soil-dwelling social amoeba Dictyostelium discoideum has been actively studied for the past 50 years and has contributed greatly to our understanding of cellular motility, signalling and interaction1. For example, studies in Dictyostelium provided the first descriptions of a eukaryotic cell chemoattractant and a cell–cell adhesion protein2,3.

Dictyostelium amoebae inhabit forest soil and consume bacteria and yeast, which they track by chemotaxis. Starvation, however, prompts the solitary cells to aggregate and develop as a true multicellular organism, producing a fruiting body comprised of a cellular, cellulosic stalk supporting a bolus of spores. Thus, Dictyostelium has evolved mechanisms that direct the differentiation of a homogeneous population of cells into distinct cell types, regulate the proportions between tissues and orchestrate the construction of an effective structure for the dispersal of spores4. Many of the genes necessary for these processes in Dictyostelium were also inherited by Metazoa and fashioned through evolution for use within many different modes of development.

The amoebozoa are also noteworthy as representing one of the earliest branches from the last common ancestor of all eukaryotes. Each of the surviving branches of the crown group of eukaryotes provides an example of the ways in which the ancestral genome has been sculpted and adapted by lineage-specific gene duplication, divergence and deletion. Comparison between representatives of these branches promises to shed light not only on the nature and content of the ancestral eukaryotic genome, but on the diversity of ways in which its components have been adapted to meet the needs of complex organisms. The genome of Dictyostelium, as the first free-living protozoan to be fully sequenced, should be particularly informative for these analyses.

Mapping, sequencing and assembly

An international initiative to sequence the genome of Dictyostelium discoideum AX4 (refs 5, 6) was launched in 1998. The high repeat content and (A + T)-richness of the genome (the latter rendering large-insert bacterial clones unstable) posed severe challenges for sequencing and assembly. The response to these challenges was to use a whole-chromosome shotgun (WCS) strategy, partially purifying each chromosome electrophoretically and treating it as a separate project. This approach was supported by novel statistical tools to recover chromosome specificity from the impure WCS libraries, and by highly detailed HAPPY maps that provided a framework for sequence assembly. These approaches have enabled the completion of this difficult genome to a high standard, and are likely to be valuable in tackling the many other genomes that present challenges of composition and complexity.

Genome mapping

To support sequence assembly, we made high-resolution maps of the chromosomes using HAPPY mapping7,8,9, which relies on analysing the sequence content of single DNA molecules prepared by limiting dilution. A total of 3,902 markers selected mostly from the emerging shotgun data were mapped, and maps of all six chromosomes were assembled (see Methods and Table 1; see also Supplementary Fig. 1 and Supplementary Table 1).

Table 1 Sequence assembly details

Genome sequencing and assembly

Two strategies were used to recover chromosome-specific data from impure WCS libraries (see Methods). The first (for chromosomes 1, 2 and 3) used enrichment of the respective libraries as the main statistical indicator of the chromosomal assignment of contigs, and HAPPY maps were used to guide assembly. The second strategy (for chromosomes 4, 5 and most of 6) used mapping data to assign sequences to chromosomes initially, with detailed HAPPY maps being used to validate final assemblies. A 1,508-kilobase (kb) portion of chromosome 6 was sequenced as a pilot project using a combination of approaches (see Methods).

Repetitive tracts complicated assembly. For chromosomes 1, 2 and 3, inspection of polymorphisms, combined with HAPPY maps, allowed unambiguous assembly in many cases. For chromosomes 4, 5 and 6, low-coverage sequencing of AX4-derived yeast artificial chromosomes (YACs) alleviated the problems by providing a local data set within which the troublesome repeat element was present as a single copy. Nevertheless, some repeat tracts proved intractable and remain as gaps. Thirty-four unlinked (floating) contigs of >1 kb, totalling 225,339 base pairs (bp), remain unpositioned in the genome, but can be provisionally assigned to specific chromosomes based on their content of reads from the WCS libraries. Most or all of these floating contigs are bounded by repetitive regions. The chromosome 2 sequence in the current assembly supersedes that previously published9, having benefited from further HAPPY mapping and manual sequence finishing.

The six chromosomal assemblies span 33,817 kb (Table 1), including 156 kb in the form of clone-, sequence- and repeat gaps. Assuming that most of the floating contigs lie beyond the termini of the assemblies, the total genome size is estimated at 34,042,810 bp. In estimating the completeness of the sequence, we note that of 967 well-characterized D. discoideum genes, 957 (99%) were found initially in the assemblies. Of the remaining ten, seven (cupE, trxA, trxB, trxC, staA, staB and cinB) have close matches, suggesting that their GenBank entries may contain errors or represent alternative alleles. Only three (fcpA, wasA and roco5) had no matches in the initial assemblies, although the first two of these were recovered by searches of unincorporated sequence followed by local reassembly. Of 133,168 ‘qualified’ D. discoideum AX4 expressed sequence tags (ESTs of >200 bp and >20% G + C, and not matching mitochondrial sequence; ref. 10 and H. Urushihara et al., unpublished data), 128,207 (96.3%) are found in the assemblies (the higher proportion of missing sequences among the ESTs probably reflects the higher error rate inherent in EST data).

We conclude that the current assembly represents 95% of the chromosomal sequence (less than 1% of which is in floating contigs) and ≥99% of genes, with most of the missing sequence comprising complex or simple repeats. The most stringent test of the medium- to long-range accuracy of the assembly comes from comparison with the HAPPY maps. This is particularly true for chromosomes 4, 5 and 6, where HAPPY markers were used to nucleate contigs but not to guide their assembly or ordering, specifically to allow such a comparison to be made without circularity of argument. As can be seen, good agreement between map and sequence confirms the accuracy of the assembly (Fig. 1).

Figure 1: Chromosomal assemblies compared against HAPPY map data.
figure 1

The locations of markers as found in the sequence (y axis) are plotted against their location in HAPPY maps (x axis) for chromosomes 1–6. Markers mapped to one chromosome but found in the assembled sequence of another are indicated by diamonds on the x axis. The dashed box indicates a large inverted duplication on chromosome 2: markers in this region are shown at one of their two possible map locations but are found at two points in the sequence.

Sequence characteristics of the genome

The genome is (A + T)-rich (77.57%) and has a broadly uniform composition, apart from the more (G + C)-rich repeat-dense regions (Fig. 2). On a finer scale, nucleotide composition tracks the distribution of exons (see below). Among dinucleotides, CpG is under-represented, not just in absolute terms but also relative to its isomer GpC (the former occurring only 62% as often as the latter). This bias normally reflects cytosine methylation at CpG sequences, promoting their mutation to TpG (which is over-represented relative to GpT by 38%). Hence, these observations suggest that cytosine methylation may occur in Dictyostelium, contrary to earlier findings11.

Figure 2: The genome of Dictyostelium discoideum.
figure 2

To view a larger version of this image download the pdf (2.9 MB).

On each of the chromosomal assemblies (numbered 1–6) the diameter of the tube represents coding density (proportion of coding bases summed over both strands; centre-weighted sliding window of 100 kb; scale on right). The coloured bands on the chromosomes represent tRNAs (red), complex repeats (blue), gaps (black) and ribosomal DNA sequences (yellow). G + C content is plotted above each chromosome (centre-weighted sliding window of 100 kb; scale on left). The locations of HAPPY markers are indicated by short green ticks immediately below the distance scale. Immediately beneath each chromosome, the locations (short vertical ticks) of genes known to be upregulated (red), downregulated (blue) or whose level of expression does not change significantly (grey) in the transition from solitary to aggregative existence (expression data from ref. 91) are indicated; coloured horizontal bars below this indicate significant clusters of genes that are preferentially expressed in germinating spores (red), de-differentiating cells (green), pre-spore cells (blue) or in pre-stalk cells (yellow). The translucent ‘hourglass’ shape on chromosome 2 is centred on a large inverted duplication. The translucent cylinder on chromosome 3 indicates a typical 300-kb region, which is shown in expanded form in inset a to illustrate the clustering of identical tRNA genes (red arrows indicate polarity of tRNA genes); a 50-kb section of this region is expanded further in inset panel b, revealing the close association of TRE elements (specific family named above) with tRNAs. The translucent yellow disc on chromosome 4 indicates the location of the presumed chromosomal master copy of the rDNA element. In inset panel c, the structure of the palindromic extrachromosomal element is shown schematically. (I) Magenta bands indicate rDNA genes; green bands indicate G + C-rich regions; red end caps indicate short repetitive telomere structures; the translucent hoop indicates the central region of asymmetry. (II) Two chromosomal sequence contigs, each carrying an rDNA-like sequence (green or yellow; dotted lines indicate corresponding part of element) flanked by complex repeats (blue). From these contigs, we infer the probable structure (III) of the genomic master copy (grey indicates flanking sequence on chromosome 4). This structure suggests a mechanism for regenerating the extrachromosomal copies by transcription of a single strand (IV), hairpin formation and strand extension (V; broken line indicates synthesis of complementary strand), unfolding of the hairpin and synthesis of a fully complementary strand (VI; broken line indicates synthesis of second strand; telomeric caps added post-synthetically).

Simple sequence repeats are abundant and unusual

Simple sequence repeats (SSRs) are more abundant in Dictyostelium than in any other genome sequenced so far, comprising >11% of bases (Supplementary Fig. 2). In non-coding sequence, tracts of dinucleotides or longer motifs occur every 392 bp on average and comprise 6.4% of the bases. There is a bias towards repeat units of 3–6 bases, whereas dinucleotide tracts predominate in most other genomes. Homopolymer tracts are also abundant, comprising a further 16% of non-coding sequence. The base composition of non-coding SSRs and homopolymer tracts (99.2% A + T content) is even more biased than that of the surrounding sequence, suggesting that either selection or the mechanism of repeat expansion favours (A + T)-rich repeats.

Notably, SSRs are also abundant in protein-coding sequence, occurring on average every 724 bp within exons. We consider these coding SSRs in further detail below, in the context of proteins.

Transposable elements are clustered

The genome is rich in transposable elements9,12. Completion of the sequence confirms the earlier observation that transposable elements of the same type are clustered, suggesting their preferential insertion within similar resident elements. However, none of the elements appears to use a specific sequence as a target for insertion: they insert at random within other elements of the same type. Non-long terminal repeat (LTR) retrotransposons are known to insert next to transfer RNA genes; we find many such instances (Fig. 2), but again no specific sequences were identified as insertion targets.

tRNAs are numerous and paired by specificity

The sequenced genome encodes 390 tRNAs, a number at the upper end of the eukaryotic spectrum (for example, Plasmodium falciparum = 43, Drosophila melanogaster = 284, Homo sapiens = 496). Allowing for the normal wobble rules in codon–anticodon pairing13,14, every sense codon can be decoded, apart from the rare alanine codon GCG; we infer that the missing tRNA(s) lie in one or more gaps in the sequence. We also find a possible selenocysteine tRNA in the genome, as well as corresponding selenocysteine insertion targets in two predicted proteins (see Supplementary Fig. 3).

Dictyostelium, in common only with Acanthamoeba castellanii15, has been shown to lack certain apparently essential tRNAs in its mitochondrial genome16. It therefore seems likely that at least some chromosomally encoded tRNAs (those for valine, threonine, asparagine and glycine, as well as one arginine and two serine tRNAs) are imported into mitochondria.

Although the gross distribution of tRNAs is uniform, organization of tRNAs on a finer scale is striking: about 20% occur as pairs or triplets with identical anticodons (and usually 100% sequence identity), separated by <20 kb and often by <5 kb (Fig. 2). There are 41 such groups in the genome; a random distribution would produce few, if any. This pattern is unique among sequenced genomes, and suggests a wave of recent duplications. However, tRNA pairs are found in tandem, converging and diverging orientations with comparable frequencies, suggesting no straightforward duplication mechanism; nor is there usually duplication of extensive flanking sequences. Whether the preference of TRE elements for inserting adjacent to tRNAs is related to the large number and unusual distribution of tRNAs is unclear.

A chromosomal master copy of the extrachromosomal rDNA element

In Dictyostelium, ribosomal RNA genes lie on an 88-kb palindromic extrachromosomal element17, present at 100 copies per nucleus (Fig. 2). Evidence also exists of chromosomal copies: at least the central 3.2 kb of the element is located17 on chromosome 4, whereas chromosome 2 carries both a partial rDNA sequence and a 5S rRNA pseudogene9,18.

In this study, two unanchored contigs assigned to chromosomes 4 and 5 contained junctions between rDNA sequences and complex repeats—attempts to extend the sequence and integrate these contigs into the assemblies failed owing to the highly repetitive nature of the adjoining sequences. We postulate that these contigs represent the junctions between a ‘master copy’ of the rDNA and the remainder of chromosome 4 (Fig. 2). One contig contains sequence matching a region of (G + C)-rich repeats near the centre of the palindrome, whereas the other matches sequence near the tip of the palindrome arm, adjacent to the one unclosed gap in the rDNA element sequence17. This gap is believed to represent a tandem array of short repeats, probably added post-synthetically to the extrachromosomal elements.

The structure of this master copy suggests a mechanism for generating the extrachromosomal copies by a process of transcription, hairpin formation and second-strand synthesis (Fig. 2). This process would account for the complete absence of sequence variation between the two arms of the palindrome.

Centromeres, telomeres and rearrangements

Repeat clusters may serve as centromeres

Centromeres mobilize eukaryotic chromosomes during cell division but vary widely in their structure and organization19, making them difficult to identify. Each Dictyostelium chromosome carries a single cluster of repeats rich in DIRS (Dictyostelium intermediate repeat sequence) elements20,21 near one end22, and this sole but striking structural consistency suggests that these clusters may serve as centromeres. Although the repetitive nature of the chromosomal termini impeded their assembly, most of the cluster on chromosome 1 was assembled (Fig. 3) and shows a complex pattern of DIRS and related Skipper elements, each preferentially associated with others of the same type. Frequent insertions and partial deletions have created a mosaic with little long-range order.

Figure 3: DIRS repeat region of chromosome 1.
figure 3

Complete complex repeat units are represented by coloured triangles whose size corresponds to the sequence length of the repeat unit (see key at top of figure). The bottom-left and top-right corners of each triangle represent 5′ and 3′ ends of the repeat, respectively. The arrangement of complete and partial repeat units within the first 187 kb of D. discoideum chromosome 1 is shown (bottom) by corresponding portions of the triangles; the orientation of the triangles indicates the direction in which each repeat unit lies. The vertical scale (sizes of repeat units) is the same as the horizontal scale (chromosomal distances).

In Dictyostelium cells demonstrating condensed chromosomes characteristic of mitosis, DIRS-element probes hybridize to one end of each chromosome (Supplementary Fig. 4), consistent with the mapping data. DIRS-like elements in other species are more uniformly scattered along the chromosomes23, suggesting that their restricted distribution in Dictyostelium chromosomes is functionally important. Furthermore, the DIRS-containing ends of the chromosomes cluster not only during mitosis, but also during interphase (Supplementary Fig. 4), as has been observed for centromeres in Schizosaccharomyces pombe24.

rDNA sequences seem to act as telomeres

No (G + T)-rich telomere-like motifs were identified in the sequence; however, earlier findings22 suggested that the chromosomes terminate in the same (G + A)-rich repeat motif that caps the extrachromosomal rDNA element. We therefore surveyed all shotgun sequence to identify reads containing a junction between complex repetitive elements and rDNA-like sequence. Only 556 such reads were identified, of which 221 could be built into 13 contigs, which we refer to as C/R (complex-repeat/rDNA) junctions.

Of the 13 junctions, two represent known regions lying internally within the chromosomal assemblies. Of the remaining 11, one had twice the sequence coverage of the others, suggesting that it represents two distinct but identical portions of the genome (a possibility supported by the fact that another two of the junctions differed from each other by only two bases). Hence, we infer that the 11 remaining contigs represent 12 distinct junctions between repetitive elements and rDNA-like sequences—potentially one for every chromosomal end.

On the basis of their content of sequence reads from each of the whole-chromosome libraries, we assigned two of the C/R junctions to each of the chromosomes. Chromosomes 4 and 5 cannot be distinguished in this way, but three junctions, including the one believed to be present as two copies, are assigned to this chromosome pair. The point in the rDNA palindrome that is represented differs from one junction to the next (Supplementary Fig. 5), but several junctions fall at common parts of the palindrome. This may reflect a preference in the mechanism that forms or maintains the junctions, or may result from a homogenizing recombination between them or with other rDNA sequences. Certainly the low frequency of differences between the rDNA components of the junction fragments and the extrachromosomal rDNA element argues for some process that limits or rectifies mutation. At each junction, we see only the rDNA sequence that immediately adjoins the complex repeat, as further assembly is precluded by the multi-copy nature of rDNA. Therefore we cannot tell whether each junctional rDNA sequence extends to the telomere-repeat-carrying tip of the rDNA palindrome sequence, nor whether other sequences lie beyond the rDNA components.

HAPPY mapping of markers derived from six of these C/R junctions confirmed not only the chromosomal assignments that had been made based on the origins of their component sequences, but also their locations at the termini of the mapped regions of the chromosomes. For the other junctions, the absence of unique sequence features precluded such mapping. Taken as a whole, this evidence strongly suggests that rDNA-like elements form part of the telomere structure in D. discoideum, and that common mechanisms stabilize both the extrachromosomal rDNA element and the chromosomal termini.

Chromosome 2 duplication

Chromosome 2 of D. discoideum AX4 carries a perfect inverted 1.51-megabase (Mb) duplication (Fig. 2; see also refs 9, 25). This duplication, containing 608 genes, is known25 to be absent from the wild-type isolate NC4 and from one of its direct descendents (AX2), but present in another (AX3); AX4 in turn is derived from AX3. The sequences adjoining the right-hand end of the duplication—a partial copy of a DIRS element (and a partial DDT-A element) and a region identical to part of the rDNA palindrome, both at about 3.74 Mb (Fig. 2)—have been implicated in centromeric and telomeric functions, respectively, elsewhere in the genome.

We propose that this duplication arose from a ‘breakage-fusion-bridge’ cycle as first described in maize26 and since observed in many genomes. The nearby DIRS and rDNA components, in this view, represent abortive attempts to stabilize the halves of the broken chromosome by establishing new telomeres and centromeres, followed by re-fusion of the pieces to create a restored and enlarged chromosome (Supplementary Fig. 6).

Chromosome 2 (the largest of the chromosomes, even discounting the duplication in AX4) may be prone to breakage: in the Bonner isolate of NC4, maintained in vegetative growth for 50 years, chromosome 2 is represented by two smaller fragments27. Comparison with more recent data22 indicates that the break point in NC4-Bonner lies in the same region as the duplication in AX4, suggesting that NC4-Bonner underwent the early stages of this process, but that the chromosome fragments were stabilized and maintained after the initial breakage. Preliminary results (data not shown) from HAPPY mapping also suggest that although wild-type isolates V12M2 and NC4 both lack the duplication seen in AX4, NC4 may carry a duplication of 300 kb near the opposite end of chromosome 2.

Content and organization of the proteome

Prediction of protein-coding genes (see Methods) was performed on the complete set of chromosomes and floating contigs (Table 2). In assessing the completeness and accuracy of the predictions, we find that of the 957 well-characterized D. discoideum genes that are present in the current sequence, 823 (86%) are predicted as transcripts with structures matching the experimentally determined ones. For a further 123 (13%), the predicted transcript differs from the experimentally determined one, about one-half of these differing only in their 5′ boundary; the remaining 11 (1%), although present in the sequence, were not predicted as transcripts. Similarly, of the 128,207 qualified ESTs present in the current sequence, 127,097 (99.1%) fall within predicted transcripts. Combining our estimate of sequence coverage (above) with these estimates of the success of gene prediction, we infer that approximately 98% of all D. discoideum genes are present in the predicted set.

Table 2 Comparison between the predicted protein-coding gene set of D. discoideum and those of other organisms

The level of overprediction, conversely, is harder to estimate: prediction was performed generously to ensure that most true genes were represented. Of the 13,541 predicted proteins, 47.5% are represented by qualified ESTs, reflecting the inevitable bias in EST sampling. Among the shortest predicted proteins, fewer are represented by ESTs (for example, 21% of those of <60 amino acids); this is at least partly due to a higher level of overprediction. On the basis of the simplifying assumption that 50% of all genes coding for proteins of <100 amino acids are mis-predictions, we estimate the true number of genes at roughly 12,500. This number is closer to that seen in multicellular organisms rather than in most unicellular eukaryotes (Table 2). The same relative complexity is seen in the total number of amino acids encoded by the respective genomes; this measure of complexity is less affected by the inclusion of shorter (and hence more dubious) gene predictions. Introns in Dictyostelium are few and short, and intergenic regions are small, producing a compact genome of which 62% encodes protein.

Genes are distributed approximately uniformly across the genome (Fig. 2). Although we do not see widespread clustering of genes with coordinated expression patterns (see Methods), we do find statistically significant (P < 0.01) clusters of genes expressed predominantly at some developmental stages or in specific cell types (Fig. 2).

(A + T)-richness influences protein composition and codon usage

Codon usage in Dictyostelium favours codons of the form NNT or NNA over their NNG or NNC synonyms, the bias being even greater than for the (A + T)-rich Plasmodium genome. Comparison of tRNA and codon frequencies (Supplementary Table 2) reveals a similar picture to that in human28 and other eukaryotes, suggesting that the same use is made of ‘wobble’ and of base modifications (for example, of adenine to inosine in some tRNAs) to expand the effective repertoire of tRNAs.

As in Plasmodium29, the extreme (A + T)-richness is reflected not just in the choice of synonymous codons, but also in the amino acid composition of the proteins. Amino acids encoded solely by codons of the form WWN (where W indicates A or T and N indicates any base; these are Asn, Lys, Ile, Tyr and Phe) are much commoner in Dictyostelium proteins than in human ones; the reverse is true for those encoded solely by SSN codons (where S indicates C or G; these are Pro, Arg, Ala and Gly).

Geometry reflects phylogeny—duplications in the genome

The predicted gene set of Dictyostelium is rich in relatively recently duplicated genes. Of the 13,498 predicted proteins analysed, 3,663 fall into 889 families clustered by BLASTP similarities of e < 10-40. Most (538) families contain only two members, but 351 families contain between three and 81 proteins (Supplementary Table 3). Hence, 2,774 (20%) of all predicted proteins have arisen by relatively recent duplication, potentially accounting for much of Dictyostelium's excess gene number compared with typical unicellular eukaryotes.

We tried to infer the mechanisms by which such duplications arise and propagate in the genome. Where members of a family are clustered on one chromosome, the physical distance between family members often (23 out of 86 families examined) correlates strongly with their evolutionary divergence (see Methods). Where a family is split between different chromosomes, members on the same chromosome are often (23 out of 50 families examined) more related to each other than to members on different chromosomes; the reverse is never observed.

These findings suggest that three processes combine to account for most of the duplications in Dictyostelium: tandem duplication, local inversion and interchromosomal exchange. In this model, gene families expand by tandem duplication of either single genes or blocks containing several consecutive genes, as in an earlier model30; inversions within these expanding clusters may reverse local gene order. An elegant illustration of these two processes is provided by a cluster of acetyl-coA synthetases on chromosome 2 (Fig. 4). The third process (exchange of segments between chromosomes) may fragment these clusters at any stage. If such an interchromosomal exchange splits a gene family early in its expansion, then each of the two resulting subfamilies has a long subsequent period of evolution independent of the other, so similarities will be greatest between genes on the same chromosome. If, conversely, the split occurs later, then all family members, whether on the same chromosome or on different chromosomes, will tend to resemble each other equally closely. We cannot exclude the possibility of duplication occasionally creating a second copy of a gene, or group of genes, directly on a different chromosome from the first. However, all instances that we have examined can be accounted for without such intermolecular duplication.

Figure 4: Phylogeny of gene family members compared to their physical order.
figure 4

The optimally parsimonious phylogenetic tree of 11 acetyl-CoA synthase genes, computed using the PHYLIP module ‘Protpars’ (http://evolution.gs.washington.edu/phylip/doc/protpars.html), is shown to the left; dictyBase identification numbers are shown at the end of each branch. The graph (right) indicates the arrangement on chromosome 2 of these genes (solid black boxes; gaps indicate introns; pointed ends indicate direction of transcription). Chromosomal distance scale is given along the bottom and other unrelated genes in the same region are indicated in grey above the x axis. The correspondence between phylogeny and physical order implies that the cluster has arisen by a series of segmental tandem duplications and local inversions in parallel with sequence divergence.

Amino acid repeats

Tandem repeats of trinucleotides (and of motifs of 6, 9, 12, and so on, bases) are unusually abundant in Dictyostelium exons and naturally correspond to repeated sequences of amino acids. However, at the protein level the situation is even more extreme: there are many further amino acid repeats that use different synonymous codons, and so do not arise from perfect nucleotide repeats. Among the predicted proteins, there are 9,582 SSRs of amino acids (homopolymers of length ≥10, or ≥5 consecutive repeats of a motif of two or more amino acids). Of these, the most striking are polyasparagine and polyglutamine tracts of ≥20 residues, present in 2,091 of the predicted proteins. Also abundant are low-complexity regions such as QLQLQQQQQQQLQLQQ: there are 2,379 tracts of ≥15 residues composed of only two different amino acids. In total, repeats or simple-sequence tracts of amino acids (even by these conservative definitions) occur in 34% of predicted proteins and encode 3.3% of all amino acids.

It seems likely that these repeats have arisen through nucleotide expansion, but have been selected at the protein level. Evidence for selection at the protein level is that any given trinucleotide repeat occurs predominantly in only one of the three reading frames. For example, the repeat …ACAACAACAACA… is usually translated as polyglutamine ([CAA]n) rather than polythreonine ([ACA]n) or polyasparagine ([AAC]n). Further evidence comes from the many trinucleotide repeats that have apparently mutated to produce only synonymous codons (for example, …GATGACGATGATGAC…, translated as polyaspartate). Moreover, the distribution of repeats and simple-sequence tracts is nonrandom: most proteins either have no such features (66% of proteins) or have two or more (18% of proteins), suggesting that they are tolerated only in certain types of protein. The polyasparagine- and polyglutamine-containing proteins appear to be over-represented in protein kinases, lipid kinases, transcription factors, RNA helicases and messenger RNA binding proteins such as spliceosome components (Supplementary Fig. 9). Protein kinases and transcription factors are also over-represented in the polyasparagine- and polyglutamine-containing proteins of Saccharomyces cerevisiae, so it is possible that these homopolymers serve some functional role in these protein classes. A more detailed analysis of amino acid homopolymers is given in Supplementary Tables 4–6 and Supplementary Figs 7–10.

Phylogeny, evolution and comparative proteomics

The organisms that diverged from the last common ancestor of all eukaryotes followed different evolutionary paths, but all retained the basic properties of eukaryotic cells. Their genomes have been sculpted by chromosomal deletions and duplications that led to lineage-specific gene family expansions, reductions and losses, as well as genes with new functions31,32. Our analysis of Dictyostelium's proteome shows that similar mechanisms have shaped its genome, augmented by horizontal gene transfer from bacterial species.

Phylogeny of eukaryotes based on complete proteomes

Using morphological criteria, early workers were unsure whether to classify Dictyostelids as fungi or protozoa33. Molecular methods indicated that they were amoebozoa and also suggested that Dictyostelium diverged from the line leading to animals at about the same time as plants34,35. A study of more than 100 proteins suggested that Dictyostelium diverged after the plant–animal split, but before the divergence of the fungi36. The recent finding of a gene fusion encoding three pyrimidine biosynthetic enzymes, shared only by Dictyostelium, fungi and Metazoa, indicates that the amoebozoa are a true sister group of the fungi and Metazoa37.

To examine the phylogeny of Dictyostelium on a genomic scale, we applied an improved method for predicting orthologous protein clusters to complete eukaryotic proteomes38 (for details, see Supplementary Information). The data were used to construct a phylogenetic tree that confirms the divergence of Dictyostelium along the branch leading to the Metazoa soon after the plant–animal split (Fig. 5). Despite the earlier divergence of Dictyostelium, many of its proteins are more similar to human orthologues than are those of S. cerevisiae, probably due to higher rates of evolutionary change along the fungal lineage. Whether the greater similarity between amoebozoa and Metazoa proteins translates into a generally higher degree of functional conservation between them compared to the fungi remains to be seen.

Figure 5: Proteome-based eukaryotic phylogeny.
figure 5

The phylogenetic tree was reconstructed from a database of 5,279 orthologous protein clusters drawn from the proteomes of the 17 eukaryotes shown, and was rooted on 159 protein clusters that had representatives from six archaebacterial proteomes. Tree construction, the database of protein clusters and a model of protein divergence used for maximum likelihood estimation are described in Supplementary Information. The relative lengths of the branches are given as Darwins (where 1 Darwin = 1/2,000 of the divergence between S. cerevisiae and humans). Species that are not specified are Plasmodium falciparum (malaria parasite), Chlamydomonas reinhardtii (green alga), Oryza sativa (rice), Zea mays (maize), Takifugu rubripes (fish) and Anopheles gambiae (mosquito).

Proteins shared by Dictyostelium and major organism groups

To examine shared functions, we identified eukaryote-specific Superfamily and Pfam protein domains, and sorted them according to their presence or absence within 12 completely sequenced genomes to arrive at their distribution among the major organismal groups (see Supplementary Tables 7–10 and Supplementary Fig. 11). Plants, Metazoa, fungi and Dictyostelium all share 32% of the eukaryotic Pfam domains (Fig. 6). The protein domains present in Dictyostelium, Metazoa and fungi, but absent in plants, are interesting because they probably arose soon after plants diverged and before Dictyostelium diverged from the line leading to animals. The major classes of domains in this group of proteins include those involved in small and large G-protein signalling (for example, RGS proteins), cell cycle control and other domains involved in signalling (Supplementary Tables 8 and 9). It also appears that glycogen storage and usage arose as a metabolic strategy soon after the plant–animal divergence, because glycogen synthetase seems to have appeared in this evolutionary interval.

Figure 6: Distribution of Pfam domains among eukaryotes.
figure 6

The number of eukaryote-specific Pfam domains present in each group of eukaryotic organisms is shown. The boxed numbers are the domains that are present in Dictyostelium and the other numbers are those domains that are absent from Dictyostelium. The animals are H. sapiens, T. rubripes, C. elegans, D. melanogaster; the fungi are N. crassa, Aspergillus nidulans, S. pombe and S. cerevisiae; and the plants are Arabidopsis thaliana, O. sativa and C. reinhardtii. A complete listing of the domains can be found in the Supplementary Information.

Particularly notable are the cases where otherwise ubiquitous domains appear to be completely absent in one group or another. For instance, Dictyostelium seems to have lost the genes that encode collagen domains, the circadian rhythm control protein timeless and basic helix–loop–helix transcription factors (Supplementary Table 7). Metazoa, on the other hand, appear to have lost receptor histidine kinases that are common in bacteria, plants and fungi, whereas Dictyostelium has retained and expanded its complement to 14 members39.

Orthologues of human disease genes

An important motivation for sequencing the Dictyostelium genome was to aid the discovery of proteins that would facilitate studies of orthologues in human, with possible implications for human health. Although orthologues of human genes implicated in disease are of course present in many species, Dictyostelium provides a potentially valuable vehicle for studying their functions in a system that is experimentally tractable and intermediate in complexity between the yeasts and the higher multicellular eukaryotes. To assess the usefulness of Dictyostelium for investigating the functions of genes related to human disease we used the protein sequences of 287 confirmed human disease genes as queries and carried out a systematic search for putative orthologues in the Dictyostelium proteome40. At a stringent threshold value of e ≤ 10-40, we identified 64 such proteins. Of these, 33 were similar in length to the human protein and had similarity extending over >70% of the two proteins (Table 3). The number of Dictyostelium orthologues of human disease genes is lower than in D. melanogaster or Caenorhabditis elegans but higher than in S. cerevisiae or S. pombe. Of the 33 putative orthologues of confirmed human disease genes in Dictyostelium, five are absent in both S. cerevisiae and S. pombe (e-value ≤10-30), a further four are absent from S. cerevisiae and two are not found in S. pombe.

Table 3 Dictyostelium genes related to human disease genes

Horizontal gene transfer

The acquisition of genes by horizontal transfer from one species to another (HGT) has become increasingly recognized as a mechanism of genome evolution41,42,43. We identified 18 potential instances of HGTs, by screening Dictyostelium protein domains that are similar to bacteria-specific Pfam domains and have phyletic relationships consistent with HGT (see Supplementary Information). The transferred domains appear to have replaced functions, added new functions or evolved into new functions (Table 4). The thy1 gene, which encodes an alternative form of thymidylate synthase (ThyX), appears to have replaced the endogenous gene, as the conventional thymidylate synthase (ThyA) is not present44. Other HGT domains also have established functions, which are presumably retained and give Dictyostelium the ability to degrade bacterial cell walls (dipeptidase), scavenge iron (siderophore), or resist the toxic effects of tellurite in the soil (terD). Still other horizontally transferred domains have become embedded within Dictyostelium genes that encode larger proteins. An example of this is the Cna B domain that is found within four large predicted proteins, one of which, colossin A, is predicted to be 1.2 MDa (Supplementary Fig. 12).

Table 4 Candidate horizontal gene transfers from bacteria

Dictyostelium ecology

Dictyostelium faces many complex ecological challenges in the soil. Amoebae, fungi and bacteria compete for limited resources in the soil while defending themselves against predation and toxins. For instance, the nematode C. elegans is a competitor for bacterial food and a predator of Dictyostelium amoebae, but also a potential dispersal agent for Dictyostelium spores45. Dictyostelium has expanded its repertoire of several protein classes that are probably crucial for such interspecies interactions and for survival and motility in this complex ecosystem.

Polyketide synthases

A small number of natural products have already been identified from Dictyostelium, but the gene content suggests that it is a prolific producer of such molecules. Some of them may act as signals during development, such as the dichlorohexanophenone DIF-1, but others are likely to mediate currently unknown ecological interactions46. Many antibiotics and secondary metabolites destined for export are produced by polyketide synthases, modular proteins of around 3,000 amino acids47. We identified 43 putative polyketide synthases in Dictyostelium (see Supplementary Information). By contrast, S. cerevisiae completely lacks polyketide synthases and Neurospora crassa has only seven. Furthermore, two of the Dictyostelium proteins have an additional chalcone synthase domain, representing a type of polyketide synthase most typical of higher plants and found to be exclusively shared by Dictyostelium, fungi and plants. In addition to polyketide synthases, the predicted proteome has chlorinating and dechlorinating enzymes as well as O-methyl transferases, which could increase the diversity of natural products made. Thus, Dictyostelium appears to have a large secondary metabolism, which warrants further investigation.

ABC transporters

ATP-binding cassette (ABC) transporters are prevalent in the proteomes of soil microorganisms and are thought to provide resistance to xenobiotics through their ability to translocate small-molecule substrates across membranes against a substantial concentration gradient48,49,50,51. There are 66 ABC transporters encoded by the genome, which can be classified according to the subfamilies defined in humans (ABCA, ABCB, ABCC, ABCD, ABCE, ABCF and ABCG) based on domain arrangement and signature sequences52. At least 20 of them are expressed during growth and are probably involved in detoxification and the export of endogenous secondary metabolites.

Cellulose degradation

Many of the predicted cellulose-degrading enzymes in the proteome (see Supplementary Information) that have secretion signals are expressed in growing cells that do not produce cellulose53. The proteome also contains one xylanase enzyme that can degrade the xylan polymers that are often found associated with the cellulose of higher plants. Perhaps Dictyostelium uses these enzymes to degrade plant tissue into particles that are then taken up by cells. These enzymes may also aid in the breakdown of cellulose-containing microorganisms upon which Dictyostelium feeds. Alternatively, these enzymes may promote the growth of bacteria that can serve as food, because Dictyostelium's habitat also contains cellulose-degrading bacteria.

Specializations for cell motility

During both growth and development, Dictyostelium amoebae display motility that is characteristic of human leukocytes54. As a consequence, studies of Dictyostelium have contributed significantly to cytoskeleton research55. Dictyostelium's survival depends on an ability to efficiently sense, track and consume soil bacteria using sophisticated systems for chemotaxis and phagocytosis. Its multicellular development depends on chemotactic aggregation of individual amoebae and the coordinated movement of thousands of cells during fruiting body morphogenesis. The proteome reveals an astonishing assortment of proteins that are used for robust, dynamic control of the cytoskeleton during these processes. As suggested by functional parallels to human cells, these proteins are most similar to metazoan proteins in their variety and domain arrangements (Fig. 7; see also Supplementary Table 11). Surprisingly, although the actin cytoskeleton has been studied for over 25 years, 71 putative actin-binding proteins apparently escaped classical methods of discovery. For example, actobindins had not been previously recognized in Dictyostelium. Curiously, the actin depolymerization factor (ADF) and calponin homology (CH) domain proteins appear to have diversified by domain shuffling, a substantial fraction having domain combinations unique to Dictyostelium (Supplementary Table 12 and Supplementary Fig. 13). In addition to 30 actin genes, there are also orthologues of all actin-related protein (ARP) classes present in mammals, as well as three founding members of a new class (Supplementary Fig. 14).

Figure 7: Microfilament system proteins.
figure 7

Proteins with probable interactions with the actin cytoskeleton are tabulated by their documented or predicted functions. Coloured boxes indicate the presence of a protein related to the Dictyostelium (D) protein in Metazoa (M), fungi (F) or plants (P). Dictyostelium-specific proteins have no recognizable relatives or differ from relatives due to extensions or unusual domain compositions. For details see Supplementary Information. Actin-binding modules: ACT, actin fold; ADF, actin depolymerization factor/cofilin-like domain; CAP, capping protein fold; CH, calponin homology domain; EVH, Ena/VASP homology domain 2; FH2, formin homology 2 domain; GEL, gelsolin repeat domain; KELCH, Kelch repeat domain; MYO, myosin motor domain; PRO, profilin fold; TAL, the I/LWEQ actin-binding domain of talin and related proteins; TRE, trefoil domain; VHP, villin head piece; WH2, Wiskott Aldrich syndrome homology region 2.

Cytoskeletal remodelling during chemotaxis and phagocytosis is regulated by a considerable number of upstream signalling components. Of the 18 Rho family GTPases in Dictyostelium, some are clear Rac orthologues and one belongs to the RhoBTB subfamily56. However, the Cdc42 and Rho subfamilies characteristic of Metazoa and fungi are absent, as are the Rho subfamily effector proteins. The activities of these GTPases are regulated by two members of the RhoGDI family, by components of ELMO1–DOCK180 complexes and by a large number of proteins carrying RhoGEF and RhoGAP domains (> 40 of each), most of which show domain compositions not found in other organisms. Remarkably, Dictyostelium appears to be the only lower eukaryote that possesses class I phosphatidylinositol-3-OH kinases, which are at the crossroad of several critical signalling pathways (for details of the regulators and their effectors, see Supplementary Table 13)57. The diverse array of these regulators and the discovery of many additional actin-binding proteins suggest that there are many aspects of cytoskeletal regulation that have yet to be explored.

Multicellularity and development

The evolution of multicellularity was arguably as significant as the origin of the eukaryotic cell in enabling the diversification of life. The common unicellular ancestor of the crown group of organisms must have possessed the basic machinery to regulate nutrient uptake, metabolism, cellular defence and reproduction, and it is likely that these mechanisms were adapted to integrate the functions of cells in multicellular organisms. Dictyostelium achieved multicellularity through a different evolutionary route compared with plants and animals, yet the ancestors of these respective groups probably started with the same endowment of genes and faced the same problem of achieving cell specialization and tissue organization.

When starved, Dictyostelium develops as a true multicellular organism, organizing distinct tissues within a motile slug and producing a fruiting body comprised of a cellular, cellulosic stalk supporting a bolus of spores4. Thus, Dictyostelium has evolved differentiated cell types and the ability to regulate their proportions and morphogenesis. A broad survey of proteins required for multicellular development shows that Dictyostelium has retained cell adhesion and signalling modules normally associated exclusively with animals, whereas the structural elements of the fruiting body and terminally differentiated cells clearly derive from the control of cellulose deposition and metabolism now associated with plants. The Dictyostelium genome offers a first glimpse of how multicellularity evolved in the amoebozoan lineage. In the following sections, we consider some of the systems that are particularly relevant to cellular differentiation and integration in a multicellular organism.

Signal transduction through G-protein-coupled receptors

The needs of multicellular development add greatly to those of chemotaxis in demanding dynamically controlled and highly selective signalling systems. G-protein-coupled cell surface receptors (GPCRs) form the basis of such systems in many species, allowing the detection of a variety of environmental and intra-organismal signals such as light, Ca2+, odorants, nucleotides and peptides. They are subdivided into six families, which, despite their conserved secondary domain structure, do not share significant sequence similarity58. Until recently, in Dictyostelium only the seven CAR/CRL (cAMP receptor/ cAMP receptor-like) family GPCRs had been examined in detail59,60. Surprisingly, a detailed search uncovered 48 additional putative GPCRs of which 43 can be grouped into the secretin (family 2), metabotropic glutamate/GABAB (family 3) and the frizzled/smoothened (family 5) families of receptors (Fig. 8; see also Supplementary Information). The presence of family 2, 3 and 5 receptors in Dictyostelium was surprising because they had been thought to be specific to animals. Their occurrence in Dictyostelium suggests that they arose before the divergence of the animals and fungi and were later lost in fungi, and that the radiation of GPCRs pre-dates the divergence of the animals and fungi. The secretin family is particularly interesting because these proteins were thought to be of relatively recent origin, appearing closer to the time of the divergence of animals61. The putative Dictyostelium secretin GPCR does not contain the characteristic GPCR proteolytic site, but its transmembrane domains are clearly more closely related to secretin GPCRs than to other families (Fig. 8). Many downstream signalling components that transduce GPCR signals could also be recognized in the proteome, including heterotrimeric G-protein subunits (fourteen Gα, two Gβ and one Gγ proteins) and seven regulators of G-protein signalling (RGS) that share highest similarity with the R4 subfamily of mammalian RGS proteins.

Figure 8: The G-protein-coupled receptors.
figure 8

A CLUSTALX alignment of the sequences encompassing the seven transmembrane domains of all Dictyostelium GPCRs, and selected GPCRs from other organisms, was used to create an unrooted dendrogram with the TreeView program. A black circle marks the innermost node of each branch supported by >60% bootstraps. The hash symbol indicates that this gene model has to be split, and the asterisk indicates a putative pseudogene. DictyBase identifiers (DDB) were used for the newly discovered Dictyostelium receptors and SwissProt identifiers for all other receptors. A.th., A. thaliana; B.t., Bos taurus; CAR/CRL, cAMP receptor/cAMP receptor-like; C.e., C. elegans; DICDI, D. discoideum; D.m., D. melanogaster; G.c., Geodia cydonium; P.p., Polysphondylium pallidum; X.l., Xenopus laevis.

SH2 domain signalling

In animals, SH2 domains act as regulatory modules of proteins in intracellular signalling cascades, interacting with phosphotyrosine-containing peptides in a sequence-specific manner. Dictyostelium is the only organism, outside of the animal kingdom, where SH2 domain phosphotyrosine signalling has been shown to occur62. What has been lacking in Dictyostelium is evidence of the other components of such signalling pathways; that is, equivalents of the metazoan SH2-domain-containing receptors, adaptors and targeting proteins. Three newly predicted proteins are strong candidates for these roles (Supplementary Fig. 15). One of them, CblA, is highly related to the metazoan Cbl proto-oncogene product. This is entirely unexpected because it is the first time that a Cbl homologue has been observed outside the animal kingdom. The Cbl protein is a ‘RING finger’ ubiquitin-protein ligase that recognizes activated receptor tyrosine kinases and various molecular adaptors63. Remarkably, the Cbl SH2 domain went unrecognized in the protein sequence, but it was revealed when the crystal structure of the protein was determined64. Thus, although SH2 domain proteins are less prevalent in Dictyostelium, there is the potential for the kind of complex interactions that typify metazoan SH2 signalling pathways.

ABC transporter signalling

Dictyostelium, like other organisms, has adapted ABC transporters to control various developmental signalling events. Several ABC transporters (TagA, TagB and TagC) are used for peptide-based signalling, similar to that previously observed for mating in S. cerevisiae and antigen presentation in human T cells65,66,67. The novel domain arrangement of the Tag proteins—a serine protease domain fused to a single transporter domain—suggests that they have been selected for improved efficiency in signal production. Additional ABC transporters are needed for cell fate determination in Dictyostelium, suggesting that this ubiquitous protein family may be used in similar developmental contexts within many different species68.

Kinases and transcription factors

Much cellular signal transduction involves the regulation of protein function through phosphorylation by protein kinases, often leading to the reprogramming of gene transcription in response to extracellular signals. The Dictyostelium proteome contains 295 predicted protein kinases, representing as wide a spectrum of kinase families as that observed in Metazoa (Supplementary Tables 14–16 and Supplementary Fig. 16). Given the presence of SH2-domain-based signalling it was surprising that no receptor tyrosine kinases could be recognized in the genome. However, Dictyostelium has a number of other receptor kinases, such as the histidine kinases and a group of eight novel putative receptor serine/threonine kinases, which are involved in nutrient and starvation sensing69. Most of the ubiquitous families of transcription factors are represented in Dictyostelium, with the notable exception of the otherwise ubiquitous basic helix–loop–helix proteins (Supplementary Table 17 and Supplementary Fig. 17). Compared with other eukaryotes, Dictyostelium appears to have fewer transcription factors relative to the total number of genes, suggesting that many transcription factors have yet to be defined, or that the activities of a smaller repertoire of factors are combined and controlled to achieve complex regulation (Supplementary Table 18 and Supplementary Fig. 18).

Cell adhesion

Throughout Dictyostelium development, cells must modulate their adhesiveness to the substrate, to the extracellular matrix and to other cells in order to create tissues and carry out morphogenesis. To accomplish this, Dictyostelium uses a surprising number of components that have been normally only associated with animals. For example, disintegrin proteins regulate cell adhesiveness and differentiation in a number of Metazoa, and at least one Dictyostelium disintegrin, AmpA, is needed throughout development for cell fate specification70. We also identified distant relatives of vinculin and α-catenin—normally associated with adherens junctions—which support the idea that the epithelium-like sheet of cells that surrounds the stalk tube contains such junctions71. Consistent with this, the Dictyostelium genome encodes numerous proteins previously described as components of adherens junctions in Metazoa, such as β-catenin (Aardvark), α-actinin, formins, VASP and myosin VII.

In animals, tandem repeats of immunoglobulin, cadherin, fibronectin III or E-set domains are often present in cell adhesion proteins, although their common protein fold pre-dates the emergence of eukaryotes. EGF/laminin domains are also found in adhesion proteins but, before the analysis of the Dictyostelium genome, no non-metazoan was known to have more than two EGF repeats in a single predicted protein. Dictyostelium has 61 predicted proteins containing repeated E-set or EGF/laminin domains, and many of these contain additional domains that suggest they have roles in cell adhesion or cell recognition, such as mannose-6-phosphate receptor, fibronectin III, or growth factor receptor domains and transmembrane domains (Fig. 9). In support of this idea, four of these proteins (LagC, LagD, AmpA and ComC) have been shown to be required for cell adhesion and signalling during development70,72,73,74.

Figure 9: Putative adhesion/signalling proteins.
figure 9

Proteins containing repeated EGF/laminin and/or E-set SCOP Superfamily domains are classified into groups containing mannose-6-phosphate receptor, mainly EGF/laminin, mainly E-set, or combinations of domains. Most of these proteins have predicted transmembrane domains and so are expected to be cell surface proteins. ComC, LagC and LagD are proteins that have been characterized to have adhesion and/or signalling functions during multicellular development72–74. C2, calcium-dependent lipid binding; Fn 3, fibronectin type III; GFR, growth factor receptor; LDL, L domain-like leucine-rich repeat; M-6-P R, mannose-6-phosphate receptor; RNI, RNI-like.

Cellulose-based structures

During development, Dictyostelium cells produce a number of cellulose-based structural elements. Dictyostelium slugs synthesize an extracellular matrix, or sheath, around themselves that is comprised of proteins and cellulose. Several of the smaller sheath proteins bind cellulose and are believed to have a role in slug migration, whereas the larger, cysteine-rich EcmA protein is essential for full integrity of the sheath and for establishing correct slug shape75,76. During terminal differentiation, cellulose is deposited in the stalk and in the cell walls of the stalk and spore cells77,78,79. The first confirmed eukaryotic gene for cellulose synthase was discovered in Dictyostelium and this gene has since been recognized in many plants, N. crassa and the ascidian Ciona intestinalis80. The fungal and urochordate enzymes are more closely related to the Dictyostelium homologue than to plant or bacterial cellulose synthases, indicating that the common ancestor of fungi and animals carried a gene for cellulose synthase that was subsequently lost in most animals. The Dictyostelium genome encodes more than 40 additional proteins that are likely to be involved in cellulose synthesis or degradation, and are probably involved in the production and remodelling of cellulose fibres of the slug sheath, stalk tube and cell walls (see Supplementary Information).

The fundamental similarities in cellular cooperation found in Dictyostelium and in the Metazoa clearly resulted in a parallel positive selection for structural and regulatory genes required for cell motility, adhesion and signalling. Dictyostelium uses a set of signals and adhesion proteins that are distinct from those employed for similar purposes in Metazoa but, like the Metazoa, Dictyostelium has maintained a diversity of GPCRs, protein kinases and ABC transporters that enable it to respond to those signals. Dictyostelium has also retained and modified an organizational strategy perfected in plants, basing several structural elements on cellulose. At one level Dictyostelium has achieved multicellularity by using strategies that are similar to plants and Metazoa, but the differences between them suggest convergent evolution, rather than lineal descent from an ancestor with overt or latent multicellular capacities.

Conclusion

The complete protein repertoire of Dictyostelium provides a new perspective for studying its cellular and developmental biology. At a systems level, Dictyostelium provides a level of complexity that is greater than the yeasts, but much simpler than plants or animals. Thus, high-resolution molecular analyses in this system may reveal control networks that are difficult to study in more complex systems, and may presage regulatory strategies used by higher organisms81,82,83. At a practical level, the comparative genomics of Dictyostelium and related pathogens, such as Entamoeba histolytica, should aid in the functional definition of amoebozoa-specific genes that may open new avenues of research aimed at controlling amoebic diseases. Dictyostelium's adeptness at hunting bacteria also renders it susceptible to infections by intracellular bacterial pathogens84,85. Dictyostelium and human macrophages display fundamental similarities in their cell biology, which has spurred the use of Dictyostelium as a model host for bacterial pathogenesis. It is also an attractive model in which to study other disease processes: for a number of human disease-related proteins, it provides a test-bed for studying their functions in a model organism that has greater similarity to higher eukaryotes than do the yeasts, yet shares the latter's experimental tractability.

The high frequency of repeated amino acid tracts in Dictyostelium proteins has long been known anecdotally, but we can now survey their precise nature and number, and find them to be more abundant than in any other sequenced genome. Many human diseases result from the expansion of triplet nucleotide repeats, some of which encode polyglutamine tracts that cause cell degeneration86,87. Learning how Dictyostelium cells tolerate so many proteins with amino acid homopolymers will, we hope, help to elucidate the roles of these motifs in protein function and dysfunction.

Comparative genomic studies in eukaryotes are providing the raw material for global examinations of the evolution of cellular regulation and developmental mechanisms88. Many genes have been lost in one species but retained in others, such that each new genome sequence adds to our understanding of the genetic complement of the eukaryotic progenitor. Thus, our understanding of eukaryotes will continue to be refined as more genome sequences become available from representatives of large groups of organisms whose genomes remain largely unexplored, such as the amoebozoa. The surprising molecular diversity of the Dictyostelium proteome, which includes protein assemblages usually associated with fungi, plants or animals, suggests that their last common ancestor had a greater number of genes than had been previously appreciated.

Methods

Details on the availability of reagents can be found in the Supplementary Information. All analyses described here were performed on Version 2.0 of the genome sequence. Updates to the sequence and annotation are available at http://www.dictybase.org and http://www.genedb.org/genedb/dicty/index.jsp. Further details of analyses not explicitly described below can be found in the Supplementary Information.

HAPPY mapping

A short-range ( 100-kb), high-resolution (± 8.54-kb) mapping panel was prepared as described9. Briefly, 96 aliquots each containing ± 0.52 haploid genome equivalents of sheared AX4 genomic DNA were pre-amplified by PEP (primer extension pre-amplification89). A total of 4,913 STS markers (Supplementary Table 1) were typed by two-phase hemi-nested polymerase chain reaction (PCR; multiplexed for up to 1,200 markers in the first phase) on aliquots of the diluted PEP products. Maps were assembled from good-quality data essentially as described previously8. A second, longer-range (± 150 kb) mapping panel was used to confirm some linkages on chromosomes 2 and 5. HAPPY map analysis and PCR primer design for HAPPY mapping was performed using various custom programs (P.H.D. and A.T.B., unpublished).

Chromosome purification

Genomic DNA from D. discoideum strain AX4 was prepared and separated by pulsed field gel electrophoresis essentially as described27,9, except that gels were run in stacked pairs; one member of each pair was stained with ethidium bromide, and bands excised from its unstained counterpart by alignment.

WCS and YAC subclone libraries

For WCS libraries, gel slices (above) were disrupted by several passages through a 30-gauge syringe needle, digested with β-agarase (NEB) and phenol-extracted. DNA was concentrated by ethanol precipitation, sonicated, end-blunted using mung bean nuclease and size-fractionated on 0.8% low-melting-point agarose gels. Fractions of 1.4–2 kb and 2–4 kb were excised, DNA extracted as before and ligated into the SmaI site of pUC18 or pUC19. Clone propagation and template preparation followed standard protocols.

For YAC subclone libraries, AX4-derived YACs were identified (and their position and integrity confirmed) by screening the set described by ref. 22 using markers from the HAPPY map. Subclones were prepared from PFG-purified YACs essentially as for the WCS libraries; contaminating yeast-derived sequences were filtered out in silico.

Sequencing and assembly

Details of the sequencing and assembly methods can be found in Supplementary Information. Generally, mapped sequence features were used to nucleate sequence contigs assembled from the WCS data, and extended using read-pair information and iterative searches for overlapping sequences, followed by directed gap closure using a range of approaches.

Fluorescent in situ hybridization

In situ hybridization was performed as in ref. 17.

Gene prediction and identification of sequence features

Full details are provided in the Supplementary Information. Briefly, automated gene prediction was performed using a combination of programs that had been trained on well-characterized D. discoideum genes, and the results integrated with reference to D. discoideum complementary DNA sequences and homology to genes in other species. Other features in the predicted proteins, and other sequence features, were identified using a variety of software packages.

Analysis of functional gene clustering

Microarray targets (refs 53, 90, 91; and N. Van Driessche and G. Shaulsky, unpublished data) and gene models were mapped onto the genome sequence using BLAST92 and the modified LIS algorithm93. To look for clustering of genes with correlated temporal expression profiles, pairwise correlation coefficients were calculated for genes with known expression profiles on each chromosome91. Blocks of ≥6 consecutive genes were sought, for which either (1) all pairwise correlation coefficients were positive and ≥70% were >0.2 (genes with similar developmental trajectories) or (2) each gene had a partner with an absolute correlation coefficient value of >0.6 (tightly co-regulated genes); no statistically significant clusters met these criteria.

To look for clustering of genes associated with specific developmental stages94,95 or cell types90,96, the genome was scanned with various sized windows97 for regions with significant (P < 0.01) over-representation of genes in any one of these groups.

Analysis of duplicated genes

Predicted protein sequences were clustered using TribeMCL98, using a BLASTP expectation of <10-40 as a cutoff. A χ2 test invalidated the hypothesis that members of a family are randomly distributed in the genome. Within each family, protein divergences (similarity distances computed using the ‘Protdist’ module of PHYLIP; http://evolution.genetics.washington.edu/phylip.html) and physical intergenic distances between all pairs of family members were tabulated, and the correlation coefficient between the former and latter values was calculated. Analysis was performed on the 86 gene families (representing 155 gene pairs) with at least 10 intrachromosomal distance pairings to provide robust statistical confidence.

Other sequence analyses and graphical representation

Other sequence analyses (nucleotide and dinucleotide composition; identification of simple-sequence repeats in nucleotide and protein sequence; coding density computation; tRNA cluster identification) were performed using a range of custom software (P.H.D. and A.T.B., unpublished). Graphical representation of chromosomes in Fig. 2 was done primarily using Cinema4D-8.5 (Maxon Computer GmbH) after pre-processing using custom software (P.H.D.).