CodeML:Behind Selectome
Theoretical principles:

CodeML is a program from the package PAML.
It estimates various parameters (Ts/Tv, dN/dS, branch length) on the codon (nucleotide) alignment, according to the phylogenetic tree. The parameter of interest to us is ω = dN/dS, which indicates selective pressure.

Selectome uses the branch-site model, which estimates different dN/dS values among branches and among sites.
A branch of interest is selected and called "foreground branch". All other branches in the tree are the "background" branches. The background branches share the same distribution of ω = dN/dS value among sites, whereas different values can apply to the foreground branch.
Two models are computed: a null model (H0), in which the foreground branch may have different proportions of sites under neutral selection than the background (i.e. relaxed purifying selection), and an alternative model (H1), in which the foreground branch may have a proportion of sites under positive selection.
As the alternative model is the general case, it is easier to present it first.

Alternative model (H1): three classes of sites on the foreground branch
ω0: dN/dS < 1
ω1: dN/dS = 1
ω2: dN/dS ≥ 1

K0 : Proportion of sites that are under purifying selection (ω0 < 1) on both foreground and background branches.
K1 : Proportion of sites that are under neutral evolution (ω1 = 1) on both foreground and background branches.
K2a: Proportion of sites that are under positive selection (ω2 ≥ 1) on the foreground branch and under purifying selection (ω0 < 1) on background branches.
K2b: Proportion of sites that are under positive selection (ω2 ≥ 1) on the foreground branch and under neutral evolution (ω1 = 1) on background branches.

For each category, we get the proportion of sites and the associated dN/dS values.

Null model (H0) (ω2 fixed to 1):

K0 : Sites that are under purifying selection (ω0 < 1) on both foreground and background branches.
K1 : Sites that are under neutral evolution (ω1 = 1) on both foreground and background branches.
K2a: Sites that are under neutral evolution (ω2 = 1) on the foreground branch and under purifying selection (ω0 < 1) on background branches.
K2b: Sites that are under neutral evolution (ω2 = 1) on the foreground branch and under neutral evolution (ω1 = 1) on background branches.

For each model, we get the log likelihood value (lnL1 for the alternative and lnL0 for the null models), from which we compute the Likelihood Ratio Test (LRT).
The 2times(lnL1-lnL0) follows a χ² curve with degree of freedom of 1, so we can get a p-value for this LRT.


File Preparation

Four files are needed to run CodeML:

1. The multiple nucleotide (CDS) alignment, in PHYLIP format: ENSGT00390000005194.Euteleostomi.001.phy

2. The phylogenetic tree with the branch of interest specified by "#1": ENSGT00390000005194.Euteleostomi.001.034.nwk

3. A command file where all parameters to run CodeML under the alternative model are specified: ENSGT00390000005194.Euteleostomi.001.034.H1.ctl

4. A command file where all parameters to run CodeML under the null model are specified: ENSGT00390000005194.Euteleostomi.001.034.H0.ctl

Alignment File

In Phylip format. CodeML will strictly remove any position that contains an undefined "N" nucleotide.

Tree File

The tree file is in classical NEWICK forms. We added a "#1" on the estimated foreground branch.

    ((((((((a001:0.080601,a002:0.087761):0.098029,a003:0.064073):0.128789,a004:0.151837):0.000005,
    a005:0.110686):0.121064,((a006:0.057627,a007:0.105676):0.062503,(((((((((a008:0.067174,a009:0.048846):0.074180,
    a010:0.025564):0.034071,a011:0.022204):0.054541,a012:0.022975):0.028882,a013:0.061145):0.022390,a014:0.073039):0.067948,
    a015:0.061092):0.029120,(((a016:0.068823,a017:0.038036):0.058143,a018:0.049684):0.026616,(a019:0.050394,
    a020:0.123071):0.084815):0.077086):0.018279,(((((a021:0.075107,a022:0.084427):0.048606,a023:0.100325):0.060762,
    ((a024:0.073681,a025:0.088203):0.074288,a026:0.097719):0.081522):0.060130,(a027:0.024757,a028:0.053627):0.049310):0.074308,
    a029:0.036030):0.023405):0.059574):0.090379):0.071366,a030:0.204436):0.037369,a031:0.118374):0.035100,((((a032:0.068424,
    a033:0.026283):0.027761,(a034:0.054111,a035:0.056049):0.090568):0.101341,a036:0.031963)#1:0.075484,a037:0.149411):0.055173):0.0;
Run Command File (alternative model H1)

We estimate the Ts/Tv ratio (fix_kappa = 0) and the dN/dS (fix_omega = 0).
The branch-site model is specified by setting the model parameter to 2 (different dN/dS for branches) and the NSsites values to 2 (which allows 3 categories for sites: purifying, neutral and positive selection).

  • We keep sites with ambiguity data (cleandata = 1). CodeML treats them as missing data.
  • We adjust genetic code if required. E.g. mammalian mitochondrial genes will use icode = 1.
  • In order to improve CodeML speed and convergence, we run an M0 model before the branch-site model.
    This allows us to use an initial/starting-point value for the kappa parameter, as well as already estimated branch lengths for the tree fix_blength = 1.
     seqfile = ENSGT00390000005194.Euteleostomi.001.phy        * sequence data file name
    treefile = ENSGT00390000005194.Euteleostomi.001.034.nwk    * tree structure file name
     outfile = ENSGT00390000005194.Euteleostomi.001.034.H1.mlc * main result file name

       noisy = 0   * 0,1,2,3,9: how much rubbish on the screen
     verbose = 1   * 1: detailed output, 0: concise output
     runmode = 0   * 0: user tree;  1: semi-automatic;  2: automatic
                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise

     seqtype = 1   * 1:codons; 2:AAs; 3:codons-->AAs
   CodonFreq = 2   * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
       ndata = 1   * specifies the number of separate data sets in the file
       clock = 0   * 0: no clock, unrooted tree, 1: clock, rooted tree

      aaDist = 0   * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a
                   * 7:AAClasses

       model = 2   * models for codons:
                        * 0:one, 1:b, 2:2 or more dN/dS ratios for branches
                   * models for AAs or codon-translated AAs:
                        * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F
                        * 6:FromCodon, 8:REVaa_0, 9:REVaa(nr=189)

     NSsites = 2   * 0:one w; 1:neutral; 2:positive selection; 3:discrete; 4:freqs;
                   * 5:gamma; 6:2gamma; 7:beta; 8:beta&w; 9:beta&gamma;
                   * 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1;
                   * 13:3normal>0

       icode = 0   * 0:universal code; 1:mammalian mt; 2-11:see below
       Mgene = 0   * 0:rates, 1:separate;

   fix_kappa = 0   * 1: kappa fixed, 0: kappa to be estimated
       kappa = 2.05154 * initial or fixed kappa

   fix_omega = 0   * 1: omega or omega_1 fixed, 0: estimate
       omega = 1

       getSE = 0   * 0: don't want them, 1: want S.E.s of estimates
RateAncestor = 0   * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)
  Small_Diff = .5e-6 * small value used in the difference approximation of derivatives

   cleandata = 0   * remove sites with ambiguity data (1:yes, 0:no)?
 fix_blength = 1   * 0: ignore, -1: random, 1: initial, 2: fixed
      method = 0   * 0: simultaneous; 1: one branch at a time
Run Command File (null model H0)

The command file for the null model is the same as for the alternative model, except for two parameters (in red):
1. The name of the outfile is different.
2. The dN/dS ratio is fixed to 1 (fix_omega = 1).

     seqfile = ENSGT00390000005194.Euteleostomi.001.phy        * sequence data file name
    treefile = ENSGT00390000005194.Euteleostomi.001.034.nwk    * tree structure file name
     outfile = ENSGT00390000005194.Euteleostomi.001.034.H0.mlc * main result file name

       noisy = 9   * 0,1,2,3,9: how much rubbish on the screen
     verbose = 1   * 1: detailed output, 0: concise output
     runmode = 0   * 0: user tree;  1: semi-automatic;  2: automatic
                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise

     seqtype = 1   * 1:codons; 2:AAs; 3:codons-->AAs
   CodonFreq = 2   * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
       ndata = 1   * specifies the number of separate data sets in the file
       clock = 0   * 0: no clock, unrooted tree, 1: clock, rooted tree

      aaDist = 0   * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a
                   * 7:AAClasses

       model = 2   * models for codons:
                        * 0:one, 1:b, 2:2 or more dN/dS ratios for branches
                   * models for AAs or codon-translated AAs:
                        * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F
                        * 6:FromCodon, 8:REVaa_0, 9:REVaa(nr=189)

     NSsites = 2   * 0:one w; 1:neutral; 2:positive selection; 3:discrete; 4:freqs;
                   * 5:gamma; 6:2gamma; 7:beta; 8:beta&w; 9:beta&gamma;
                   * 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1;
                   * 13:3normal>0

       icode = 0   * 0:universal code; 1:mammalian mt; 2-11:see below
       Mgene = 0   * 0:rates, 1:separate;

   fix_kappa = 0   * 1: kappa fixed, 0: kappa to be estimated
       kappa = 2.05154 * initial or fixed kappa

   fix_omega = 1   * 1: omega or omega_1 fixed, 0: estimate
       omega = 1

       getSE = 0   * 0: don't want them, 1: want S.E.s of estimates
RateAncestor = 0   * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)
  Small_Diff = .5e-6 * small value used in the difference approximation of derivatives

   cleandata = 0   * remove sites with ambiguity data (1:yes, 0:no)?
 fix_blength = 1   * 0: ignore, -1: random, 1: initial, 2: fixed
      method = 0   * 0: simultaneous; 1: one branch at a time
Launch CodeML

In Unix (Linux, MacOSX), this will look like:

    codeml ./ENSGT00390000005194.Euteleostomi.001.034.H1.ctl
    codeml ./ENSGT00390000005194.Euteleostomi.001.034.H0.ctl
Analyze results

1. Assign significance of the detection of positive selection on the selected branch:

Two outfiles are produced:
ENSGT00390000005194.Euteleostomi.001.034.H1.mlc (alternative model) and ENSGT00390000005194.Euteleostomi.001.034.H0.mlc (null model).

In the ENSGT00390000005194.Euteleostomi.001.034.H0.mlc, we retrieve only the likelihood lnL0 value in the following line:

lnL(ntime: 72  np: 76): -23384.936701      +0.000000

In the ENSGT00390000005194.Euteleostomi.001.034.H1.mlc, we retrieve again the likelihood value lnL1 in the following line:
lnL(ntime: 72  np: 77): -23372.806179      +0.000000

We can construct the LRT: 2×(lnL1 - lnL0) = 2×(-23372.806179 - (-23384.936701)) = 24.261044
The degree of freedom is 1 (np1 - np0 = 77-76).
p-value = 0.00000084123 (under χ²) => significant.

(In SELECTOME, we add another test to control the False Discovery Rate (FDR) < 0.10: the q-value. But it is relevant only for multiple analysis.)

2. If significant, retrieve sites under positive selection detected by Bayes Empirical Bayes (BEB):

In the ENSGT00390000005194.Euteleostomi.001.034.H1.mlc, we can retrieve sites under positive selection using the BEB method:
Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol. Biol. Evol. 22:1107-1118)
Positive sites for foreground lineages Prob(w>1):
   257 A 0.947
   293 T 0.620
   467 C 0.998**
   616 D 0.991**
   617 R 0.989*
   647 G 0.693
   676 P 0.971*
   688 S 0.741
   738 E 0.532
   866 C 0.997**
   903 N 0.948
   992 I 0.570

Amino acids refer to the first sequence in the alignment.
Position 467 has high probability (99.8%) to be under positive selection. Position 688 has a lower probability (between 50% & 95%) to be under positive selection.
Only positions above 50% are reported.

Filtering steps

Different kinds of quality filtering are used in Selectome to improve CodeML computations and discard low quality families and MSA regions:

  • Discard families without targeted taxa sequences (E.g. without Euteleostomi taxa)
  • Discard families with less than 6 sequences
    • Increase statistical power of codeml and the following methods
  • Filter by sequence with MaxAlign
    • Replace "X" amino acids by gaps "-" in all sequences for this step
    • Then MaxAlign removes short sequences that disrupt the alignment
  • Re-do 2
  • Align sequences with Pagan, a phylogeny(-indel)-aware alignment method using partial-order graphs
    • Isolation of non-homologous regions
  • Compute MCoffee scores on the Pagan alignment
    • Replace residues with low score (< 8, from 0 to 7; and keep from 8 to 9) by "x" (in lowercase for distinction with original X/undefined bases)
    • Follow the same MCoffee methods used by Ensembl Compara
  • Compute Guidance scores on the Pagan alignment, with Pagan aligner
    • Replace residues with low score (≤ 0.93) by "x"
  • Merge MCoffee and Guidance scores
  • Map masking on related nucleotide MSA
    • Replace Selenocysteins/stop codons by "x" because codeml cannot handled them
  • TrimAl removes columns with less than 4 residues
    • Discard poor alignment columns
  • Run M0 model
    • Pre-compute some parameters for faster branch-site model convergence
    • Shift to mitochondrial genetic code if required
  • Run Branch-site model with slimcodeml
    • Use branch lengths from the M0 model now on, as well as M0 initial parameters for branch-site model computations
Software releases