CodeML is a program from the package PAML.
It estimates various parameters (Ts/Tv, dN/dS, branch length) on the codon (nucleotide) alignment, according to the phylogenetic tree. The parameter of interest to us is ω = dN/dS, which indicates selective pressure.
Selectome uses the branch-site model, which estimates different dN/dS values among branches and among sites.
A branch of interest is selected and called "foreground branch". All other branches in the tree are the "background" branches. The background branches share the same distribution of ω = dN/dS value among sites, whereas different values can apply to the foreground branch.
Two models are computed: a null model (H0), in which the foreground branch may have different proportions of sites under neutral selection than the background (i.e. relaxed purifying selection), and an alternative model (H1), in which the foreground branch may have a proportion of sites under positive selection.
As the alternative model is the general case, it is easier to present it first.
Alternative model (H1): three classes of sites on the foreground branch
ω0: dN/dS < 1
ω1: dN/dS = 1
ω2: dN/dS ≥ 1
K0 : Proportion of sites that are under purifying selection (ω0 < 1) on both foreground and background branches.
K1 : Proportion of sites that are under neutral evolution (ω1 = 1) on both foreground and background branches.
K2a: Proportion of sites that are under positive selection (ω2 ≥ 1) on the foreground branch and under purifying selection (ω0 < 1) on background branches.
K2b: Proportion of sites that are under positive selection (ω2 ≥ 1) on the foreground branch and under neutral evolution (ω1 = 1) on background branches.
For each category, we get the proportion of sites and the associated dN/dS values.
Null model (H0) (ω2 fixed to 1):
K0 : Sites that are under purifying selection (ω0 < 1) on both foreground and background branches.
K1 : Sites that are under neutral evolution (ω1 = 1) on both foreground and background branches.
K2a: Sites that are under neutral evolution (ω2 = 1) on the foreground branch and under purifying selection (ω0 < 1) on background branches.
K2b: Sites that are under neutral evolution (ω2 = 1) on the foreground branch and under neutral evolution (ω1 = 1) on background branches.
For each model, we get the log likelihood value (lnL1 for the alternative and lnL0 for the null models), from which we compute the Likelihood Ratio Test (LRT).
The 2times(lnL1-lnL0) follows a χ² curve with degree of freedom of 1, so we can get a p-value for this LRT.
Four files are needed to run CodeML:
1. The multiple nucleotide (CDS) alignment, in PHYLIP format: ENSGT00390000005194.Euteleostomi.001.phy
2. The phylogenetic tree with the branch of interest specified by "#1": ENSGT00390000005194.Euteleostomi.001.034.nwk
3. A command file where all parameters to run CodeML under the alternative model are specified: ENSGT00390000005194.Euteleostomi.001.034.H1.ctl
4. A command file where all parameters to run CodeML under the null model are specified: ENSGT00390000005194.Euteleostomi.001.034.H0.ctl
In Phylip format. CodeML will strictly remove any position that contains an undefined "N" nucleotide.
The tree file is in classical NEWICK forms. We added a "#1" on the estimated foreground branch.
((((((((a001:0.080601,a002:0.087761):0.098029,a003:0.064073):0.128789,a004:0.151837):0.000005, a005:0.110686):0.121064,((a006:0.057627,a007:0.105676):0.062503,(((((((((a008:0.067174,a009:0.048846):0.074180, a010:0.025564):0.034071,a011:0.022204):0.054541,a012:0.022975):0.028882,a013:0.061145):0.022390,a014:0.073039):0.067948, a015:0.061092):0.029120,(((a016:0.068823,a017:0.038036):0.058143,a018:0.049684):0.026616,(a019:0.050394, a020:0.123071):0.084815):0.077086):0.018279,(((((a021:0.075107,a022:0.084427):0.048606,a023:0.100325):0.060762, ((a024:0.073681,a025:0.088203):0.074288,a026:0.097719):0.081522):0.060130,(a027:0.024757,a028:0.053627):0.049310):0.074308, a029:0.036030):0.023405):0.059574):0.090379):0.071366,a030:0.204436):0.037369,a031:0.118374):0.035100,((((a032:0.068424, a033:0.026283):0.027761,(a034:0.054111,a035:0.056049):0.090568):0.101341,a036:0.031963)#1:0.075484,a037:0.149411):0.055173):0.0;
We estimate the Ts/Tv ratio (fix_kappa = 0) and the dN/dS (fix_omega = 0).
The branch-site model is specified by setting the model parameter to 2 (different dN/dS for branches) and the NSsites values to 2 (which allows 3 categories for sites: purifying, neutral and positive selection).
seqfile = ENSGT00390000005194.Euteleostomi.001.phy * sequence data file name treefile = ENSGT00390000005194.Euteleostomi.001.034.nwk * tree structure file name outfile = ENSGT00390000005194.Euteleostomi.001.034.H1.mlc * main result file name noisy = 0 * 0,1,2,3,9: how much rubbish on the screen verbose = 1 * 1: detailed output, 0: concise output runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table ndata = 1 * specifies the number of separate data sets in the file clock = 0 * 0: no clock, unrooted tree, 1: clock, rooted tree aaDist = 0 * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a * 7:AAClasses model = 2 * models for codons: * 0:one, 1:b, 2:2 or more dN/dS ratios for branches * models for AAs or codon-translated AAs: * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F * 6:FromCodon, 8:REVaa_0, 9:REVaa(nr=189) NSsites = 2 * 0:one w; 1:neutral; 2:positive selection; 3:discrete; 4:freqs; * 5:gamma; 6:2gamma; 7:beta; 8:beta&w; 9:betaγ * 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1; * 13:3normal>0 icode = 0 * 0:universal code; 1:mammalian mt; 2-11:see below Mgene = 0 * 0:rates, 1:separate; fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated kappa = 2.05154 * initial or fixed kappa fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate omega = 1 getSE = 0 * 0: don't want them, 1: want S.E.s of estimates RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states (1 or 2) Small_Diff = .5e-6 * small value used in the difference approximation of derivatives cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)? fix_blength = 1 * 0: ignore, -1: random, 1: initial, 2: fixed method = 0 * 0: simultaneous; 1: one branch at a time
The command file for the null model is the same as for the alternative model, except for two parameters (in red):
1. The name of the outfile is different.
2. The dN/dS ratio is fixed to 1 (fix_omega = 1).
seqfile = ENSGT00390000005194.Euteleostomi.001.phy * sequence data file name treefile = ENSGT00390000005194.Euteleostomi.001.034.nwk * tree structure file name outfile = ENSGT00390000005194.Euteleostomi.001.034.H0.mlc * main result file name noisy = 9 * 0,1,2,3,9: how much rubbish on the screen verbose = 1 * 1: detailed output, 0: concise output runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table ndata = 1 * specifies the number of separate data sets in the file clock = 0 * 0: no clock, unrooted tree, 1: clock, rooted tree aaDist = 0 * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a * 7:AAClasses model = 2 * models for codons: * 0:one, 1:b, 2:2 or more dN/dS ratios for branches * models for AAs or codon-translated AAs: * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F * 6:FromCodon, 8:REVaa_0, 9:REVaa(nr=189) NSsites = 2 * 0:one w; 1:neutral; 2:positive selection; 3:discrete; 4:freqs; * 5:gamma; 6:2gamma; 7:beta; 8:beta&w; 9:betaγ * 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1; * 13:3normal>0 icode = 0 * 0:universal code; 1:mammalian mt; 2-11:see below Mgene = 0 * 0:rates, 1:separate; fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated kappa = 2.05154 * initial or fixed kappa fix_omega = 1 * 1: omega or omega_1 fixed, 0: estimate omega = 1 getSE = 0 * 0: don't want them, 1: want S.E.s of estimates RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states (1 or 2) Small_Diff = .5e-6 * small value used in the difference approximation of derivatives cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)? fix_blength = 1 * 0: ignore, -1: random, 1: initial, 2: fixed method = 0 * 0: simultaneous; 1: one branch at a time
In Unix (Linux, MacOSX), this will look like:
codeml ./ENSGT00390000005194.Euteleostomi.001.034.H1.ctl codeml ./ENSGT00390000005194.Euteleostomi.001.034.H0.ctl
1. Assign significance of the detection of positive selection on the selected branch:
Two outfiles are produced:
ENSGT00390000005194.Euteleostomi.001.034.H1.mlc (alternative model) and ENSGT00390000005194.Euteleostomi.001.034.H0.mlc (null model).
In the ENSGT00390000005194.Euteleostomi.001.034.H0.mlc, we retrieve only the likelihood lnL0 value in the following line:
lnL(ntime: 72 np: 76): -23384.936701 +0.000000
lnL(ntime: 72 np: 77): -23372.806179 +0.000000
We can construct the LRT: 2×(lnL1 - lnL0) = 2×(-23372.806179 - (-23384.936701)) = 24.261044
The degree of freedom is 1 (np1 - np0 = 77-76).
p-value = 0.00000084123 (under χ²) => significant.
(In SELECTOME, we add another test to control the False Discovery Rate (FDR) < 0.10: the q-value. But it is relevant only for multiple analysis.)
2. If significant, retrieve sites under positive selection detected by Bayes Empirical Bayes (BEB):
In the ENSGT00390000005194.Euteleostomi.001.034.H1.mlc, we can retrieve sites under positive selection using the BEB method:Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol. Biol. Evol. 22:1107-1118) Positive sites for foreground lineages Prob(w>1): 257 A 0.947 293 T 0.620 467 C 0.998** 616 D 0.991** 617 R 0.989* 647 G 0.693 676 P 0.971* 688 S 0.741 738 E 0.532 866 C 0.997** 903 N 0.948 992 I 0.570
Amino acids refer to the first sequence in the alignment.
Position 467 has high probability (99.8%) to be under positive selection. Position 688 has a lower probability (between 50% & 95%) to be under positive selection.
Only positions above 50% are reported.
Different kinds of quality filtering are used in Selectome to improve CodeML computations and discard low quality families and MSA regions: