Methods - Selectome

Godon - Behind Selectome

Theoretical principles:

Godon estimates various parameters (Ts/Tv, dN/dS, branch length) on the codon (nucleotide) alignment, according to the phylogenetic tree. The parameter of interest to us is ω = dN/dS, which indicates selective pressure. The model we use also estimates rate variation among codons, to account for selection at the DNA level (Davydov et al. 2019).

Godon is a codon models software. Godon strengths: rate variation in the Branch-Site model; a heuristic avoids LRT statistics overestimation, which can cause false positives in other implementations, and also corrects for LRT underestimation; multithreading; checkpoints for long computations; and it efficiently implements evolutionary rate variation at the DNA level (see above).

Selectome uses the branch-site model, which estimates different dN/dS values among branches and among sites.
A branch of interest is selected and called "foreground branch". All other branches in the tree are the "background" branches. The background branches share the same distribution of ω = dN/dS value among sites, whereas different values can apply to the foreground branch.
Two models are computed: a null model (H0), in which the foreground branch may have different proportions of sites under neutral selection than the background (i.e. relaxed purifying selection); and an alternative model (H1), in which the foreground branch may have a proportion of sites under positive selection.
As the alternative model is the general case, it is easier to present it first.

Alternative model (H1): three classes of sites on the foreground branch
ω₀: dN/dS < 1
ω₁: dN/dS = 1
ω₂: dN/dS ≥ 1

K0: Proportion of sites that are under purifying selection (ω₀ < 1) on both foreground and background branches.
K1: Proportion of sites that are under neutral evolution (ω₁ = 1) on both foreground and background branches.
K2a: Proportion of sites that are under positive selection (ω₂ ≥ 1) on the foreground branch and under purifying selection (ω₀ < 1) on background branches.
K2b: Proportion of sites that are under positive selection (ω₂ ≥ 1) on the foreground branch and under neutral evolution (ω₁ = 1) on background branches.

For each category, we get the proportion of sites and the associated dN/dS values.

Null model (H0) (ω₂ fixed to 1):

K0: Sites that are under purifying selection (ω₀ < 1) on both foreground and background branches.
K1: Sites that are under neutral evolution (ω₁ = 1) on both foreground and background branches.
K2a: Sites that are under neutral evolution (ω₂ = 1) on the foreground branch and under purifying selection (ω₀ < 1) on background branches.
K2b: Sites that are under neutral evolution (ω₂ = 1) on the foreground branch and under neutral evolution (ω₁ = 1) on background branches.

For each model, we get the log likelihood value (lnL₁ for the alternative and lnL₀ for the null models), from which we compute the Likelihood Ratio Test (LRT).
The 2×(lnL₁-lnL₀) follows a χ² curve with a degree of freedom of 1, so we can get a p-value for this LRT.

File Preparation

Two files are needed to run Godon:

1. The multiple nucleotide (CDS) alignment, in PHYLIP or FASTA format: ENSGT00550000074985.Euteleostomi.001.cds.fas

2. The related phylogenetic tree, in Newick format: ENSGT00550000074985.Euteleostomi.001.nwk

Alignment File

In Phylip format or FASTA format. Godon will strictly remove any position that contains an undefined "N" nucleotide.

Tree File

The tree file is in classical NEWICK format. You can tag a specific branch with #1 to force Godon to evaluate only that foreground branch. Otherwise Godon will evaluate branches one after the other.

Godon command arguments and options

By default Godon will compute H1 and H0 at the same time.
1. Specify the branch-site model by using the BSG model.
2. Correct for LRT statistics overestimation with godon test.
3. Specify the number of discrete codon categories for the gamma distribution with --ncat-codon-rate 4.
4. Compute quickly branch lengths with --m0-tree.
5. Estimate only BEB posterior probabilities with --no-neb.
6. Evaluate all internal branches but no terminal branches (leaves), which are the most sensitive to sequencing or annotation errors, with --all-branches --no-leaves.
7. Adjust genetic code if required, e.g. for mammalian mitochondrial genes with --gcode=2.

Launch Godon

In Unix (Linux, MacOSX), this will look like:

    godon test BSG --ncat-codon-rate 4 \
         --json=ENSGT00550000074985.Euteleostomi.001.godonBSG.json \
         --m0-tree --no-neb --all-branches --no-leaves \
         ENSGT00550000074985.Euteleostomi.001.cds.fas \
         ENSGT00550000074985.Euteleostomi.001.nwk \
          >ENSGT00550000074985.Euteleostomi.001.godonBSG.std.out \
         2>ENSGT00550000074985.Euteleostomi.001.godonBSG.std.err

Analyze results

1. Output files and estimated tree:

Main information is sent to the stdout file (ENSGT00550000074985.Euteleostomi.001.godonBSG.std.out); iteration steps and errors to the stderr file (ENSGT00550000074985.Euteleostomi.001.godonBSG.std.err).
Complete and advanced information is provided in the JSON file (ENSGT00550000074985.Euteleostomi.001.godonBSG.json) if you ask for it on the command line.

The tree with estimated branch lengths is available in the JSON output: "globalOptimizations" -> [0] -> "finalTree"

2. Assign significance of the detection of positive selection on the selected branch:

Estimated likelihoods (H0 and H1) for each branch are available in the stdout file, as well as in the JSON output.
For branch 17 (labelled with a #1 sign in the branch 17 tree in the stdout; test number 16 (17-1) in the JSON output) we retrieve those likelihoods:

"tests" -> [16] -> "H0" -> "maxLnL" -> -68562.13268087893
"tests" -> [16] -> "H1" -> "maxLnL" -> -68559.1967296804

We can construct the LRT: 2×(lnL₁ - lnL₀) = 2×(-68559.1967296804 - (-68562.13268087893)) = 5.87190239706
The LRT value is provided in the stdout file under the name Final D= for each branch.
The degree of freedom is 1 (number_of_parameters₁ - number_of_parameters₀).
So in that case p-value = 0.015384 (under χ²) => significant.

In SELECTOME, we add another test to control the False Discovery Rate (FDR) < 0.10: the q-value. But it is relevant only for multiple analysis.

3. If significant, retrieve sites under positive selection detected by Bayes Empirical Bayes (BEB):

We can retrieve sites under positive selection using the BEB method.
The JSON file contains all sites with their posterior probability; the stdout file contains only positions above 0.5.
In Selectome amino acids with posterior probability between 0.5 and 0.95 are reported as weakly significant; those between 0.95 and 0.99 are reported as significant; and those >= 0.99 are reported as very significant.

BEB analysis
pos  codon  aa  p
55   GGA    G   0.946*
613  GTA    V   0.995**
722  GTT    V   0.691
779  ACG    T   0.986*
794  GTC    V   0.645
1060 AAA    K   0.683
1182 TTT    F   0.725
1276 GTT    V   0.697
1337 GTC    V   0.606

aa (Amino Acids) refer to the first sequence in the alignment.
Position 613 has a high probability (0.995) to be under positive selection. Position 722 has a lower probability (between 0.5 and 0.95) to be under positive selection.

References

Filtering steps

Different kinds of quality filtering are used in Selectome to improve Godon computations and discard low quality families and MSA regions:

Discard families without targeted taxa sequences (E.g. without Euteleostomi taxa)
Discard families with less than 10 sequences
- Increase statistical power of Godon and the following methods
Filter by sequence with MaxAlign
- Replace "X" amino acids by gaps "-" in all sequences for this step
- Then MaxAlign removes short sequences that disrupt the alignment
Re-do 2
Align sequences with MAFFT and the allowshift option to better isolate non-homologous regions
Compute MCoffee scores on the MAFFT alignment
- Replace residues with low score (< 6, from 0 to 5; and keep from 6 to 9) by "x" (in lowercase for distinction with original X/undefined bases)
- MCoffee uses: mafft_msa, muscle_msa, clustalo_msa, t_coffee_msa
Map masking on related nucleotide MSA
- Replace Selenocysteins/stop codons by "x" because Godon cannot handled them
TrimAl removes columns with less than 4 residues
- Discard poor alignment columns
Run Branch-Site model with Godon
- Shift to mitochondrial genetic code if required

Software releases