A Maximum Likelihood Method for Analyzing Pseudogene Evolution: Implications for Silent Site Evolution in Humans and Rodents

Bustamante, Carlos D.; Nielsen, Rasmus; Hartl, Daniel L.

doi:10.1093/oxfordjournals.molbev.a003975

Abstract

We present a new likelihood method for detecting constrained evolution at synonymous sites and other forms of nonneutral evolution in putative pseudogenes. The model is applicable whenever the DNA sequence is available from a protein-coding functional gene, a pseudogene derived from the protein-coding gene, and an orthologous functional copy of the gene. Two nested likelihood ratio tests are developed to test the hypotheses that (1) the putative pseudogene has equal rates of silent and replacement substitutions; and (2) the rate of synonymous substitution in the functional gene equals the rate of substitution in the pseudogene. The method is applied to a data set containing 74 human processed-pseudogene loci, 25 mouse processed-pseudogene loci, and 22 rat processed-pseudogene loci. Using the informatics resources of the Human Genome Project, we localized 67 of the human-pseudogene pairs in the genome and estimated the GC content of a large surrounding genomic region for each. We find that, for pseudogenes deposited in GC regions similar to those of their paralogs, the assumption of equal rates of silent and replacement site evolution in the pseudogene is upheld; in these cases, the rate of silent site evolution in the functional genes is ∼70% the rate of evolution in the pseudogene. On the other hand, for pseudogenes located in genomic regions of much lower GC than their functional gene, we see a sharp increase in the rate of silent site substitutions, leading to a large rate of rejection for the pseudogene equality likelihood ratio test.

Introduction

It has long been held that pseudogenes provide the molecular evolutionist with an important tool for studying the rate and pattern of neutral evolution (Li, Gojobori, and Nei 1981 ). To the extent that pseudogenes evolve without selective constraint, the rate and pattern of substitutions in pseudogenes will faithfully reflect the underlying mutational process. Pseudogenes have, therefore, been used to infer the mutational process for nucleotide changes within species (Gojobori, Li, and Graur 1982 ; Li, Wu, and Luo 1984 ) and to compare this process between species (Petrov and Hartl 1999 ), as well as to study deletion rates among taxa (Graur, Shuali, and Li, 1989 ; Petrov, Lozovskaya, and Hartl 1996 ; Ophir and Graur 1997) and the effect that rates of DNA loss have on genome size (Petrov et al. 2000 ; Bensasson et al. 2001) .

The question of whether silent sites are evolving at the neutral mutation rate is far from resolved. The analysis of polymorphism data and codon frequencies in Escherichia coli and Salmonella enterica suggest that there is considerable weak selection operating on silent sites (Sharp and Li 1986 ; Andersson and Kurland 1990 ; Hartl, Moriyama, and Sawyer 1994 ; Hartl et al. 2000 ). Recent work in the comparative analysis of Drosophila genes has also revealed evidence for constraint on synonymous site evolution, presumably because of codon bias (Akashi 1996, 1997 ; Akashi and Schaeffer 1997 ; McVean and Vieira 2001 ). In rodent genes, however, there seems to be no relationship between the rate of synonymous substitution and codon bias (Smith and Hurst 1999 ). It has been suggested that a balance between mutation and selection may account for genomic variation in GC content; this would affect both the apparent codon bias and rate of substitution at silent sites (for a review see Bernardi 2000) .

Pseudogenes provide the molecular evolutionist with a direct opportunity to infer the strength of selection on changes at synonymous sites. Ophir et al. (1999) employed a distance method to analyze a set of 12 human and murid (rat and mouse) pseudogene gene trees; they found that, on average, murid and human third-position sites evolve at 40% the rate of pseudogene third-position sites. Because not all changes at third-position sites are synonymous, it is difficult to extrapolate from this result as to how much selection is acting on synonymous changes. The issue of selection on synonymous sites is of considerable practical importance in the study of molecular evolution because the ratio (ω) of the number of replacement substitutions per replacement site (dn) to the number of synonymous substitutions per synonymous site (ds) is widely used to detect adaptive evolution at the protein level.

The methods we present in this paper to study pseudogene evolution exploit recent statistical developments in molecular phylogenetics. In particular, we develop codon-based models for gene trees that contain a (processed) pseudogene, the functional gene from which the pseudogene was derived, and the ortholog of the functional gene from a closely related species (fig. 1 ). The models are implemented in a maximum likelihood framework and lead to two likelihood ratio tests of neutral evolution assuming a shared mutational process for all lineages—one to test the equality of silent and replacement substitution rates in a putative pseudogene, and another to test if the synonymous sites in the functional gene are evolving at the same rate as the pseudogene. We apply this method to a data set consisting of 121 processed pseudogenes (74 from human, 22 from rat, and 25 from mouse [Ophir and Graur 1997]) to test the assumption that processed pseudogenes evolve neutrally and to estimate the strength of selection on synonymous sites in the functional paralogs of pseudogenes.

Using the informatics resources of the Human Genome Project, we localized 67 of the gene-pseudogene pairs and estimated the GC content of the surrounding genomic areas for both. We used this data as well as the GC content of fourfold redundant sites (GC4) in the functional genes to assess whether rejection of pseudogene or silent site neutrality is correlated with the GC content.

It is important to note that the data sets presented here contain only retropseudogenes (i.e., pseudogenes generated by reverse transcription of mRNAs and insertion in the genome). These genes lack a promoter for expression and can, therefore, safely be termed “dead on arrival.” Furthermore, they generally insert far from their functional paralog and thus are expected to avoid the concerted evolution frequently observed for duplicated genes arranged in tandem.

Statistical Methods

The Model

The method we use to analyze pseudogene evolution is based on the likelihood models developed by Goldman and Yang (1994) , Muse and Gaut (1994) , Nielsen and Yang (1998) , Yang (1998) , and Yang and Nielsen (1998) . In these models, the DNA sequence is treated as a sequence of triplets of nucleotides (codons). We will assume that the substitution processes in each codon site are independent and can be described by a continuous time Markov chain with state space on the 61 codons of the standard genetic code (excluding the stop codons). Furthermore, we assume that the process can be parameterized in terms of the transition-transversion rate ratio (κ), the dn/ds ratio (ω), and the stationary frequencies of each codon (π_i). The transition rate matrix Q = {q_ij} is then defined as

The transition probabilities of the process can be calculated by exponentiating the rate matrix using standard numerical methods (e.g., numerical diagonalization of Q). The codon frequencies are estimated from the data using the method of moments to reduce the number of parameters in the model and to save computational time. The estimates are obtained from the observed base frequencies in the three codon positions (see Yang and Nielsen 1998 for details).

For the purpose of analyzing the pseudogenes, we assume a phylogenetic tree in which for each pseudogene there is a functional paralog from the same species and an ortholog of the functional gene from a related species (outgroup) (fig. 1 ). In the most general model, we will assume that the values of ω and the branch length of the three branches (t_f, t_o, and t_ψ) are independent parameters, and that the codon frequencies and κ do not vary among branches. The set of parameters that will be estimated by maximum likelihood is then Θ = {t_f, t_o, t_ψ, ω_f, ω_o, ω_ψ, κ}. The likelihood function to be optimized is proportional to

where n is the number of codons in each sequence, S is the 3 × n matrix of sequence data containing the codon in sequence k at position j in entry S_kj, and P_Θ(c_i → S_kj) is the transition probability along the appropriate branch of the phylogeny from codon c_i (i = 1, 2, …, 61) in the internal node to codon S_kj. Parameter estimates are obtained by maximizing the logarithm of equation (2) with respect to Θ. In this model, branch lengths are not scaled by the expected number of substitutions. Hence, for a particular lineage in the phylogeny, the maximum likelihood estimate of the number of synonymous substitutions per synonymous site (ds) and the number of replacement substitutions per replacement site (dn) can be calculated from the maximum likelihood estimates of t, κ, and ω, using the methods described in Yang and Nielsen (1998) .

Likelihood Ratio Tests

The first hypothesis we will test is whether the putative pseudogene has been an untranscribed pseudogene or a neutrally evolving gene in the entire evolution of the pseudogene lineage. If a putative pseudogene is truly a pseudogene, then ω_ψ ought to equal 1; that is, the rate of synonymous substitution should equal the rate of replacement substitution in the pseudogene. To do so, we compare the log maximum likelihood of the general model with seven parameters (t_f, t_o, t_ψ, ω_f, ω_o, ω_ψ, κ) to the log maximum likelihood of the nested six-parameter model (t_f, t_o, t_ψ, ω_f, ω_o, κ), which assumes that ω_ψ = 1. Appealing to the usual large sample results for nested hypotheses (e.g., Stuart, Ord, and Arnold 1987 , p. 246), two times the logarithm of the maximum likelihood ratio of the two hypotheses (LRT[6,7]) is asymptotically distributed as a χ² random variable with one degree of freedom (df). In particular, if the maximum log likelihood ratio under the seven-parameter model is more than 1.92 (3.84/2) log likelihood units larger than the maximum likelihood value under the six-parameter model, we reject the null hypothesis (ω_ψ = 1) at the 5% significance level. Throughout the rest of the paper, we will refer to LRT[6,7] as the pseudogene equality likelihood ratio test (PELRT).

Once we have established that the rates of silent and replacement substitutions are statistically similar in the pseudogene, suggesting that the pseudogene rate of substitution reflects the underlying neutral mutation rate, we will wish to test whether silent sites in the functional gene evolve at the same rate as the pseudogene rate. To do so, we will exploit the genealogical relationship between the pseudogene, the functional gene, and the functional ortholog. If the pseudogene arose after the two species diverged, the pseudogene sequence and the functional ortholog will be more closely related to each other than either of them would be to the outgroup sequence. Therefore, if the rate of evolution of the pseudogene equals the rate of synonymous evolution in the functional gene, then t_f = t_ψ. To test this hypothesis, we will compare the maximum likelihood under the constrained model with six parameters {t_f, t_o, t_ψ, ω_f, ω_o, κ} to the maximum likelihood under the nested model with five parameters {t_f = t_ψ, t_o, ω_f, ω_o, κ}. Again, significance is tested by comparing twice the log likelihood ratio, LRT[5,6], to a χ²₁-distribution. Throughout the rest of this paper, we will refer to LRT[5,6] as the silent site equality likelihood ratio test (SSELRT).

Both these tests can be sensitive to the codon frequencies. Biased estimates of ω will be obtained if codon frequency biases are not properly accounted for (e.g., Yang and Nielsen 2000 ). This should cause little problem in the current context because the codon frequencies are similar. However, if the genomic context differs widely between pseudogenes and their functional paralog, the codon frequencies might be affected. In such cases, the result could potentially be an excess of significant results of the PELRT.

Pooled Likelihood Ratio Test

One issue of interest is whether we can reject the null hypotheses when combining the data across loci. If we assume that the genes are independent of one another, we can sum the log likelihoods under each model and perform the SSELRTs and the PELRTs on the pooled data. SSELRT applied to a combined data set would test whether all pseudogenes in the data set conform to ω_ψ = 1. PELRT applied to the whole data set would test whether the average rate of silent site evolution in each gene is the same rate as in the respective pseudogene for all genes. As models 6 and 7 differ by one df, the PELRT for n genes will be distributed as χ²_n. Likewise, the combined SSELRT statistic will be χ²_{n − r} distributed, where r is the number of genes that reject the PELRT; the PELRTs are relevant because the SSELRT on a gene is justified only if the gene fails to reject the PELRT.

Binomial Likelihood Ratio Test

A separate (but not independent) analysis is based on the fraction of genes that reject a particular test. We perform this test to corroborate that rejection of the pooled test is not because of a handful of genes that have very large test statistics. If the null hypothesis, H₀, for n independent tests of the same hypothesis performed at the α significance level (e.g., 0.05) is true, the number of tests that reject the null hypothesis, k, is binomially distributed with parameters n and α. To test the hypothesis that the observed proportion of tests that reject the null hypothesis, k/n, is equal to α, we use a binomial likelihood ratio test (BLRT) which has a test statistic

BLRT is approximately χ²₁ if H₀ is true for all tests. A confidence interval for the true proportion of tests that reject H₀ is defined as the range of values of p for which one would not reject the BLRT.

Logistic Regression

To explore whether the GC content correlates with the rate of rejection of either the PELRT or the SSELRT, we used the logistic regression framework. This is an appropriate analysis to employ because our response variable is dichotomous (whether a particular test rejects the null hypothesis or not), and our predictor variable is continuous (GC content). We ran four separate logistic regressions using the genomic GC content or the GC content of fourfold redundant sites as explanatory variables, and whether or not the SSELRT or the PELRT rejects the null hypothesis as the outcome variables. We use the notation β_{Genomic GC} and β_GC4 to refer to the maximum likelihood estimates of the slope parameter for regressions using the genomic GC content and the GC4 content as predictor variables, respectively, and α to refer to the intercept parameter for each regression. The P value reported for each regression is for a likelihood ratio test comparing the fit of the model with the maximum likelihood estimate of the slope parameter to a model with slope equal to 0 (Christensen 1997 ). The 95% confidence intervals for the predicted rate of rejection of each test as a function of GC content were found using nonparametric bootstrap sampling.

Bootstrap Test

To test whether the average number of replacement substitutions per replacement site in pseudogenes, d̅n̅, equaled the average number of silent substitutions per silent site in pseudogenes, d̅s̅, we used a paired bootstrapping procedure. Namely, we sampled (dn_ψ, ds_ψ) pairs as estimated from our data with replacement and calculated d̅n̅ and d̅s̅ for each sample of pseudogenes 100,000 times. The P value reported is twice the proportion of times d̅n̅ was less than d̅s̅ (because we were performing a two-sided test). Again, this test is not independent of the pooled likelihood ratio test.

Estimating Effective Level of Selection on Synonymous Sites

The difference in substitution rates at silent sites in functional genes and pseudogenes can be used to estimate S, the effective level of selection on mutations at silent sites; that is, the level of selection that would produce the observed reduction in substitution rate, assuming that all mutations at silent sites in functional genes have the same selective effect and that the mutation rates are the same for silent sites in functional genes and in pseudogenes. It can be shown (Kimura 1962 , result [11]) that if mutations at silent sites have a selective effect s and that pseudogene sites evolve neutrally, the ratio of the rates of substitution at silent sites in functional genes, d̅s̅_f̅, to the rate of substitution at pseudogene sites, d̅s̅, is given by

where S = 4N_es and N_e is the effective population size.

To generate confidence intervals for d̅s̅_f̅ /d̅s̅, we used 100,000 nonparametric bootstrap samples generated by sampling with replacement (ds_f,ds_ψ) pairs estimated from our data. Using equation (4) , we can transform the distribution of bootstrap estimates of d̅s̅_f̅ /d̅s̅ into a distribution for estimates of S.

Materials and Methods

Data

For each pseudogene reported in Ophir and Graur (1997), we searched the nonredundant database (NR) of the NCBI server (http://www.ncbi.nlm.nih.gov/BLAST/) using the program (BLASTN) (Altschul et al. 1990 ) to find the closest available paralogous sequence from the same species. We then ran BLASTN on the paralog to find the single closest murid or human ortholog for the paralog-pseudogene pair (GenBank accession numbers available via the MBE website). Genes for which no functional ortholog could be found—as evidenced by the intercalation of phylogenetically incongruent sequences in the search results—were omitted (e.g., if a search using a functional human gene revealed a higher homology between the human gene and a crocodile gene rather than the human gene and the mouse gene, the pseudogene-ortholog pair was omitted).

Sequences were obtained, edited, and aligned by CLUSTALW using various versions (3.5–4.0.30) of the program DAMBE (Xia 2000 ). All stop codons and codons containing nucleotides with gaps in the alignment were removed and the reading frame was set to the reading frame of the functional gene. Pseudogenes that were identical to their functional paralog, except for gaps, were omitted from the analysis. Likewise, genes for which the estimated pseudogene, t, was longer than the outgroup, t, in model 7 were omitted because such a phylogeny indicated that the assumption of pseudogene emergence after the primate-rodent split was not met. We used BLASTN to identify the genomic position of 67 of the human pseudogene-functional gene pairs using the NCBI-curated sequences of the Human Genome Project. All genomic GC content measures are for the NCBI GenBank entry for the BAC (>100,000 bp) containing the functional gene.

Results and Discussion

In the Additional Information, which can be accessed via the Molecular Biology and Evolution web site, we summarize the maximum likelihood parameter estimates and the results of the individual likelihood ratio tests for each gene. The estimated number of silent substitutions per silent site between human and murid genes, as estimated from the full model, ranged from 0.143 to 2.427. Likewise, the estimated number of replacement substitutions per replacement site ranged from 0 to 0.447. The mean and standard deviation of these distances (d̅s̅_f̅+̅d̅s̅_o̅ = 0.596 ± 0.295; d̅n̅_f̅+̅d̅n̅_o̅ = 0.052 ± 0.064) are comparable to those previously cited for the human-mouse comparison (Matassi, Sharp, and Gautier 1999 ). The estimated distance between a pseudogene and its respective functional gene under the full model ranged from 0 to 0.684 silent substitutions per silent site (d̅s̅_f̅+̅d̅n̅_ψ̅ = 0.128 ± 0.106) with 0–0.257 replacement substitutions per replacement site (d̅n̅_f̅+̅d̅n̅_ψ̅ = 0.0645 ± 0.0513).

One key conclusion from our analysis is that, overall, the majority of pseudogene-gene pairs (>70%) do not reject either test. This suggests that, as a first approximation, the most constrained model (i.e., equal codon frequencies and mutation rates in all lineages accompanied by neutral evolution in both pseudogenes and silent sites in functional genes) seems to fit the majority of the data. We did find that the PELRT is significant at the 5% significance level for 13 of the 121 genes (11%). Among those genes for which the PELRT is not significant (n = 108), 25 genes (23%) have significant SSELRTs. Given the large number of tests, we would expect approximately 5% of the genes to reject either null hypothesis if the deviations were simply the result of chance. However, for both tests the proportion of genes that reject the null hypothesis is significantly greater than 5% by the BLRT (PELRT: P < 0.012; SSELRT: P ≪ 0.0000001). The source of these discrepancies is discussed in the following sections.

Implications of the Results for Pseudogene Evolution

To learn if the GC content correlates with an increase in the rejection rate of either test, we divided the data set into two equal groups based on the GC content of fourfold redundant sites in the functional paralogs of the pseudogenes (GC4). The rationale for using GC4 rather than overall GC content is that replacement sites are presumably constrained by purifying selection on amino acid changes. For 67 of the human genes, we were also able to estimate the GC content of the surrounding genomic region for the functional gene using the NCBI sequences of the Human Genome Project. Not surprisingly, GC4 was highly correlated with the surrounding genomic GC content (r = 0.72), and the conclusions are the same if one analyzes the flanking GC content of the functional gene rather than GC4.

In table 1 , we report the rejection rates, 95% confidence intervals for the rejection rates, and the results of the pooled likelihood ratio tests for gene pairs grouped on the basis of the GC content of the functional gene for both the PELRT and the SSELRT. We find, by applying Fisher's Exact Test of homogeneity, that regardless of whether we classify humans genes on the basis of the regional GC content or GC4 of the functional gene, the processed pseudogenes derived from GC-rich genes reject the PELRT more frequently than do pseudogenes derived from GC-poor genes (Genomic GC, P < 0.027; GC4, P < 0.011). Likewise, a pooled likelihood ratio test implies that pseudogenes derived from GC-rich human genes, in general, have unequal rates of substitution at silent and replacement sites (P < 0.0001). This result is unexpected, but the same phenomenon is also observed in the rodent pseudogenes (P < 0.0032).

The inequality in substitution rates between silent and replacement sites is not observed in pseudogenes derived from GC-poor regions. From table 1 , we see that the rejection rate of the PELRT is not significantly different from that expected by chance for human genes as classified by genomic GC (1/33; P ≈ 0.5766), for human genes classified by GC4 (1/39; P ≈ 0.4430), or for rodent genes (1/22; P ≈ 0.9209). Furthermore, we do not reject the pooled PELRT for GC-poor genes in either humans or rodents (P > 0.45 for all; see table 1 for details). This tentatively suggests that in the cases where pseudogenes land in genomic regions similar to those in which their respective functional genes have evolved, the equal codon frequencies model and pseudogene neutrality is supported.

To investigate whether these results are because of our dichotomous classification of GC groups based on GC4, we used a logistic regression with GC4 in the functional gene as a continuous predictor variable and whether the pseudogene rejected or failed to reject PELRT as the outcome variable. As illustrated in figure 2 , we found a strong positive association between GC4 in the functional gene and the probability of rejecting the PELRT (β_GC4 = 6.374, α = −6.156; P < 0.0016). However, for pseudogenes derived from genes with GC4 < 0.60, the 95% confidence intervals for the probability of rejecting the PELRT contain 5%. A logistic regression for human pseudogenes based on the local GC content yielded comparable results (β_{Genomic GC} = 21.118, α = −11.864; P < 0.0058).

Why should pseudogenes derived from high-GC functional genes tend to have unequal rates of silent and replacement substitutions, whereas pseudogenes derived from low-GC functional genes conform to the expectations of neutral theory? One logical hypothesis is that selection for regional GC content (or differential mutation rates away from G or C) accounts for the difference in substitution rates at silent and replacement sites in processed pseudogenes derived from high-GC functional genes. Because mammalian functional genes tend to be nonrandomly distributed with respect to the GC content, such that a majority of genes are found in GC-rich regions, although a majority of the genome is composed of AT-rich regions (Mouchiroud et al. 1991 ; Saccone et al. 1997 ), if processed pseudogenes are randomly incorporated into the genome, the majority of pseudogenes will move to areas of lower GC content. Consequently, pseudogenes derived from genes in relatively GC-poor coding regions will usually land in genomic regions that are similar to the region in which their functional gene has evolved, whereas pseudogenes derived from GC-rich regions will usually land in regions of much lower GC content.

A prediction of this hypothesis is that, in the pseudogene lineage, the substitution rate per silent site should be elevated relative to the substitution rate per replacement site. In other words, the silent sites in the pseudogenes that move from regions of high GC content to regions of low GC content should be changing faster than the neutral rate. The reason that silent sites in pseudogenes are disproportionately affected is that it is the silent sites in functional genes that are most affected by local GC content because replacement sites are predominantly under purifying selection. Therefore, processed pseudogenes that move from an area of high GC to one of low GC have an excess of G or C at their silent sites, relative to their new genomic neighborhoods. This will cause silent sites to be more strongly affected by either mutation or selection pressure for GC content, leading to an excess of silent substitutions vis à vis replacement substitutions. The same phenomenon would hold for genes that move from low- to high-GC regions, but given the high abundance of AT-rich regions and the fact that most genes are in GC-rich regions, these events would occur infrequently.

Our data afford strong support for this hypothesis. For pseudogenes derived from high-GC genes, the average number of replacement substitutions per replacement site, d̅n̅, relative to the average number of silent substitutions per silent site, d̅s̅, is 0.0537/0.0935 = 0.5743; this is significantly different from unity by the bootstrap test (P < 0.00001). For pseudogenes derived from low-GC genes, d̅n̅ = 0.0624 and d̅s̅ = 0.0619, yielding a ratio of 1.008, which is extremely close to the expected ratio of 1.0 (P ≈ 0.9900). Furthermore, a higher rate of silent substitution in pseudogenes derived from high-GC functional genes is evident from application of an unpaired t-test. The average replacement substitution rate does not vary between pseudogenes derived from high- and low-GC groups (t = −1.027; P ≈ 0.3062), whereas the average silent substitution rate for pseudogenes derived from high-GC genes is significantly higher than the average rate for pseudogenes derived from low-GC genes (t = 2.1808; P < 0.0312). Together, these results suggest that silent and replacement sites in pseudogenes derived from low-GC genes, as well as replacement sites in genes derived from high-GC genes, are evolving at or near the neutral mutation rate, whereas silent sites in pseudogenes derived from high-GC genes are evolving at a rate faster than the neutral mutation rate.

In addition, pseudogenes derived from genes found in high-GC regions tend to decrease in GC4 content (Wilcoxon signed rank test [WSRT]: Z = 3.561, P < 0.0005) relative to their respective functional genes, whereas pseudogenes derived from low-GC regions tend to maintain the same level of GC4 content (WSRT: Z = 0.950, P ≈ 0.3422). These findings corroborate the hypothesis that positive selection for GC content, or large differences in mutation rates among genomic regions, drives the observed difference in substitution rate between silent and replacement sites in a significant number of processed pseudogenes.

Implications of Results for Silent Site Evolution

We find that, for functional genes derived from both GC-rich and GC-poor regions, a significant number of pseudogenes evolve faster than silent sites in the genes from which they were derived. This comparison is statistically significant whether one considers the proportion of genes that reject the SSELRT or whether one considers the pooled SSELRT across genes (table 1 ). For functional genes in GC-rich regions, the high rate of rejection is, in part, attributable to the excess of silent site fixation events along the pseudogene branch in the gene tree because of selection or mutation bias in the novel genomic region. This explanation does not apply to relatively GC-poor genes because we have shown that pseudogenes derived from these genes do not reject the PELRT more than expected by chance. In this class of genes, the substitution rates of silent and replacement sites in the pseudogenes are roughly equal. Furthermore, a logistic regression with GC4 of the functional gene as a predictor variable and whether the gene rejects the SSELRT as the outcome variable is not significant (β = 1.471, α = −2.040; P ≈ 0.3434), suggesting that GC4 alone cannot account for the overall level (∼25%) of rejection for the SSELRT.

How does the detection of non-neutral evolution at silent sites in functional genes relate to the rate of substitution? To estimate the overall difference in the average rates of substitution between silent sites and pseudogene sites, we compare the ratio of the average rate of substitution per synonymous site in functional paralogs to the average rate of substitution per site in the pseudogenes (d̅s̅_f̅ /d̅s̅). Figure 3 illustrates the distributions of 100,000 nonparametric bootstrap estimates of d̅s̅_f̅ /d̅s̅ from model 6 (pseudogene neutrality model) and model 7 (free model) for gene-pseudogene pairs derived from GC-poor regions, and from model 7 only for gene-pseudogene pairs derived from GC-rich regions. We did not perform the bootstrap analysis for parameter estimates based on model 6 for gene-pseudogene pairs derived from GC-rich regions because this class of genes rejects model 6 as a group based on the pooled likelihood ratio test. Table 2 reports the mean and the 2.5% and 97.5% quantiles of these distributions. There is approximately a 30% reduction in the average substitution rate of silent sites in functional genes relative to silent sites in pseudogenes. We also note that the mean of the distribution of d̅s̅_f̅ /d̅s̅ from GC-rich regions for model 7 (0.62) differs slightly from the mean of the distribution for models 6 and 7 for GC-poor regions (0.69–0.70), which reflects the accelerated substitution rate at silent sites in the pseudogene noted previously. The means of these distributions are all significantly different from unity because the interval between the 2.5% and 97.5% quantiles of these distributions does not contain 1 (table 2 ).

Using Kimura's result presented in the Statistical Methods section, one can transform the distributions of d̅s̅_f̅ /d̅s̅ in figure 3 into a distribution for S. From table 2 , we see that the means of these distributions are in the interval [−0.6753, −0.8814]. This suggests that the average strength of selection at synonymous sites is in the range where genetic drift and weak negative selection are important in determining the rate of substitution. However, if there is considerable variation within genes in selection intensity at silent sites, or if the mutation rates at silent sites differ considerably between pseudogenes and functional genes, then the above result underestimates the effect of selection. To estimate the effect of selection in these more complex scenarios, comparative data within and between species will be needed.

Adam Eyre-Walker, Reviewing Editor

Keywords: pseudogene evolution synonymous site evolution codon-based models maximum likelihood neutral evolution

Abbreviations: BLRT, binomial likelihood ratio test; PELRT, pseudogene equality likelihood ratio test; SSELRT, silent site equality likelihood ratio test; WSRT, Wilcoxon signed rank test.

Address for correspondence and reprints: Daniel L. Hartl, 16 Divinity Avenue, Cambridge, Massachusetts 02138. dhartl@oeb.harvard.edu .

Table 1 Proportion of Genes that Reject Null Hypothesis for Pooled Data and Results of Pooled Likelihood Ratio Tests

Open in new tab

Table 2 Average Ratio of Substitution Rates at Synonymous Sites in Functional Genes and in Pseudogenes and Corresponding values for 4Ns

Open in new tab

Open in new tab Download slide

Fig. 1.—A graphical illustration of the model assumed in this paper. f is the codon sequence in the functional gene, ψ is the codon sequence in the pseudogene, o is the codon sequence in the outgroup and t_f, t_ψ, and t_o are the respective branch lengths of the lineages leading to these sequences

Open in new tab Download slide

Fig. 2.—Probability of rejecting the PELRT as a function of GC4 content of the functional gene. The solid line is the predicted rate of rejection from the logistic regression analysis and the gray lines are the 95% confidence intervals. The locations of the genes that reject the test are denoted by open triangles, and the locations of the genes that do not reject the test are denoted by hatched marks. The dotted line traces out the expected 5% rejection rate

Open in new tab Download slide

Fig. 3.—Distributions of d̅s̅_f̅ /d̅s̅ estimated from 100,000 nonparametric bootstrap samples of estimates from model 6 (solid line) and model 7 (dotted line) from GC-poor gene-pseudogene pairs, and from model 7 (thick solid line) from the GC-rich pairs

The authors thank H. Akashi and D. Petrov as well as two anonymous reviewers for comments on earlier drafts of this manuscript. This work was supported by a Howard Hughes Medical Institute award to C.D.B.

References

Altschul S. F., W. Gish, W. Miller, E. W. Myers, D. J. Lipman,

1990

Basic local alignment search tool

J. Mol. Biol

215

:

403

-410

Andersson S. G. E., C. G. Kurland,

1990

Codon preferences in free-living microorganisms

Microbiol. Rev

54

:

198

-210

Akashi H.,

1996

Molecular evolution between Drosophila melanogaster and D. simulans: reduced codon bias, faster rates of amino acid substitution, and larger proteins in D. melanogaster

Genetics

144

:

1297

-1307

———.

1997

Codon bias evolution in Drosophila. Population genetics of mutation-selection drift

Gene

205

:

269

-278

Akashi H., S. W. Schaeffer,

1997

Natural selection and the frequency distributions of “silent” DNA polymorphism in Drosophila

Genetics

146

:

295

-307

Bensasson D., D. A. Petrov, D. Zhang, D. L. Hartl, G. M. Hewitt,

2001

Genomic gigantism: DNA loss is slow in mountain grasshoppers

Mol. Biol. Evol

18

:

246

-253

Bernardi B.,

2000

The compositional evolution of vertebrate genomes

Gene

259

:

31

-43

Christensen R.,

1997

Log-linear model and logistic regression Springer, New York

Goldman N., Z. Yang,

1994

A codon-based model of nucleotide substitution for protein-coding DNA sequences

Mol. Biol. Evol

11

:

725

-736

Gojobori T., W. H. Li, D. Graur,

1982

Patterns of nucleotide substitution in pseudogenes and functional genes

J. Mol. Evol

18

:

360

-369

Graur D., Y. Shuali, W.-H. Li,

1989

Deletions in processed pseudogenes accumulate faster in murids than in humans

J. Mol. Evol

28

:

279

-285

Hartl D. L., M. Moriyama, S. A. Sawyer,

1994

Selection intensity for codon bias

Genetics

138

:

227

-234

Hartl D. L., E. F. Boyd, C. D. Bustamante, S. A. Sawyer,

2000

The glean machine: what can we learn from DNA sequence polymorphism

Symp. Genomics Proteomics

4

:

37

-49

Kimura M.,

1962

On the probability of fixation of mutant genes in a population

Genetics

47

:

713

-719

Li W. H., T. Gojobori, M. Nei,

1981

Pseudogenes as a paradigm of neutral evolution

Nature

292

:

237

-239

Li W. H., C. I. Wu, C. C. Luo,

1984

Nonrandomness of point mutation as reflected in nucleotide substitutions in pseudogenes and its evolutionary implications

J. Mol. Evol

21

:

58

-71

Matassi G., P. M. Sharp, C. Gautier,

1999

Chromosomal location effects on gene sequence evolution in mammals

Curr. Biol

9

:

786

-791

McVean G. A., J. Vieira,

2001

Inferring parameters of mutation, selection and demography from patterns of synonymous site evolution in Drosophila

Genetics

157

:

245

-257

Mouchiroud D., G. D'Onofrio, B. Aissani, G. Macaya, C. Gautier, G. Bernardi,

1991

The distribution of genes in the human genome

Gene

100

:

181

-187

Muse S. V., B. S. Gaut,

1994

A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome

Mol. Biol. Evol

11

:

715

-724

Nielsen R., Z. Yang,

1998

Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene

Genetics

148

:

929

-936

Ophir R., D. Graur,

1997

Patterns and rates of indel evolution in processed pseudogenes from humans and murids

Gene

205

:

191

-202

Ophir R., T. Itoh, D. Graur, T. Gojobori,

1999

A simple method for estimating the intensity of purifying selection in protein-coding genes

Mol. Biol. Evol

16

:

49

-53

Petrov D. A., E. Lozovskaya, D. L. Hartl,

1996

High intrinsic rate of DNA loss in Drosophila

Nature

384

:

346

-349

Petrov D. A., D. L. Hartl,

1999

Patterns of nucleotide substitution in Drosophila and mammalian genomes

Proc. Natl. Acad. Sci. USA

96

:

1475

-1479

Petrov D. A., T. A. Sangster, J. S. Johnston, D. L. Hartl, K. L. Shaw,

2000

Evidence for DNA loss as a determinant of genome size

Nature

287

:

1060

-1062

Saccone S., S. Caccio, P. Perani, L. Andreozzi, A. Rapisarda, S. Motta, G. Bernardi,

1997

Compositional mapping of mouse chromosomes and identification of the gene-rich regions

Chromosome Res

5

:

293

-300

Sharp P. M., W.-H. Li,

1986

An evolutionary perspective on synonymous codon usage in unicellular organims

J. Mol. Evol

24

:

28

-38

Smith N. G., L. D. Hurst,

1999

The causes of synonymous rate variation in the rodent genome. Can substitution rates be used to estimate the sex bias in mutation rate?

Genetics

152

:

661

-673

Stuart A., J. K. Ord, S. Arnold,

1987

Kendall's advanced theory of statistics Vol. 2A: classical inference and the linear model. Sixth edition Oxford University Press, New York

Xia X.,

2000

DAMBE: data analysis in molecular biology and evolution Department of Ecology and Biodiversity, University of Hong Kong

Yang Z.,

1998

Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution

Mol. Biol. Evol

15

:

568

-573

Yang Z., R. Nielsen,

1998

Synonymous and nonsynonymous rate variation in nuclear genes of mammals

J. Mol. Evol

46

:

409

-418

———.

2000

Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models

Mol. Biol. Evol

17

:

32

-43

Download all slides

Month:	Total Views:
November 2016	1
February 2017	7
March 2017	4
April 2017	4
May 2017	1
July 2017	3
September 2017	2
October 2017	2
November 2017	1
December 2017	20
January 2018	8
February 2018	27
March 2018	32
April 2018	16
May 2018	23
June 2018	12
July 2018	18
August 2018	12
September 2018	21
October 2018	1
November 2018	7
December 2018	7
January 2019	4
February 2019	15
March 2019	14
April 2019	11
May 2019	12
June 2019	7
July 2019	19
August 2019	8
September 2019	12
October 2019	18
November 2019	15
December 2019	7
January 2020	16
February 2020	10
March 2020	7
April 2020	14
May 2020	8
June 2020	4
July 2020	8
August 2020	9
September 2020	16
October 2020	17
November 2020	11
December 2020	11
January 2021	8
February 2021	9
March 2021	8
April 2021	15
May 2021	10
June 2021	8
July 2021	9
August 2021	18
September 2021	9
October 2021	28
November 2021	22
December 2021	26
January 2022	14
February 2022	10
March 2022	8
April 2022	13
May 2022	31
June 2022	14
July 2022	23
August 2022	8
September 2022	12
October 2022	10
November 2022	23
December 2022	8
January 2023	13
February 2023	12
March 2023	9
April 2023	8
May 2023	18
June 2023	13
July 2023	3
August 2023	8
September 2023	10
October 2023	18
November 2023	10
December 2023	25
January 2024	15
February 2024	11
March 2024	17
April 2024	4

Article Contents

A Maximum Likelihood Method for Analyzing Pseudogene Evolution: Implications for Silent Site Evolution in Humans and Rodents

Abstract

Introduction

Statistical Methods

The Model

Likelihood Ratio Tests

Pooled Likelihood Ratio Test

Binomial Likelihood Ratio Test

Logistic Regression

Bootstrap Test

Estimating Effective Level of Selection on Synonymous Sites

Materials and Methods

Data

Results and Discussion

Implications of the Results for Pseudogene Evolution

Implications of Results for Silent Site Evolution

References

Citations

Views

Altmetric

Email alerts

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

A Maximum Likelihood Method for Analyzing Pseudogene Evolution: Implications for Silent Site Evolution in Humans and Rodents

Abstract

Introduction

Statistical Methods

The Model

Likelihood Ratio Tests

Pooled Likelihood Ratio Test

Binomial Likelihood Ratio Test

Logistic Regression

Bootstrap Test

Estimating Effective Level of Selection on Synonymous Sites

Materials and Methods

Data

Results and Discussion

Implications of the Results for Pseudogene Evolution

Implications of Results for Silent Site Evolution

References

Citations

Views

Altmetric

Email alerts

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only