Godon estimates various parameters (Ts/Tv, dN/dS, branch length) on the codon (nucleotide) alignment, according to the phylogenetic tree. The parameter of interest to us is ω = dN/dS, which indicates selective pressure. The model we use also estimates rate variation among codons, to account for selection at the DNA level (Davydov et al. 2019).
Godon is a codon models software. Godon strengths: rate variation in the Branch-Site model; a heuristic avoids LRT statistics overestimation, which can cause false positives in other implementations, and also corrects for LRT underestimation; multithreading; checkpoints for long computations; and it efficiently implements evolutionary rate variation at the DNA level (see above).
Selectome uses the branch-site model, which estimates different dN/dS values among branches and among sites.
A branch of interest is selected and called "foreground branch". All other branches in the tree are the "background" branches. The background branches share the same distribution of ω = dN/dS value among sites, whereas different values can apply to the foreground branch.
Two models are computed: a null model (H0), in which the foreground branch may have different proportions of sites under neutral selection than the background (i.e. relaxed purifying selection); and an alternative model (H1), in which the foreground branch may have a proportion of sites under positive selection.
As the alternative model is the general case, it is easier to present it first.
Alternative model (H1): three classes of sites on the foreground branch
ω0: dN/dS < 1
ω1: dN/dS = 1
ω2: dN/dS ≥ 1
K0: Proportion of sites that are under purifying selection (ω0 < 1) on both foreground and background branches.
K1: Proportion of sites that are under neutral evolution (ω1 = 1) on both foreground and background branches.
K2a: Proportion of sites that are under positive selection (ω2 ≥ 1) on the foreground branch and under purifying selection (ω0 < 1) on background branches.
K2b: Proportion of sites that are under positive selection (ω2 ≥ 1) on the foreground branch and under neutral evolution (ω1 = 1) on background branches.
For each category, we get the proportion of sites and the associated dN/dS values.
Null model (H0) (ω2 fixed to 1):
K0: Sites that are under purifying selection (ω0 < 1) on both foreground and background branches.
K1: Sites that are under neutral evolution (ω1 = 1) on both foreground and background branches.
K2a: Sites that are under neutral evolution (ω2 = 1) on the foreground branch and under purifying selection (ω0 < 1) on background branches.
K2b: Sites that are under neutral evolution (ω2 = 1) on the foreground branch and under neutral evolution (ω1 = 1) on background branches.
For each model, we get the log likelihood value (lnL1 for the alternative and lnL0 for the null models), from which we compute the Likelihood Ratio Test (LRT).
The 2×(lnL1-lnL0) follows a χ² curve with a degree of freedom of 1, so we can get a p-value for this LRT.
Two files are needed to run Godon:
1. The multiple nucleotide (CDS) alignment, in PHYLIP or FASTA format: ENSGT00550000074985.Euteleostomi.001.cds.fas
2. The related phylogenetic tree, in Newick format: ENSGT00550000074985.Euteleostomi.001.nwk
In Phylip format or FASTA format. Godon will strictly remove any position that contains an undefined "N" nucleotide.
The tree file is in classical NEWICK format. You can tag a specific branch with #1 to force Godon to evaluate only that foreground branch. Otherwise Godon will evaluate branches one after the other.
By default Godon will compute H1 and H0 at the same time.
1. Specify the branch-site model by using the BSG model.
2. Correct for LRT statistics overestimation with godon test.
3. Specify the number of discrete codon categories for the gamma distribution with --ncat-codon-rate 4.
4. Compute quickly branch lengths with --m0-tree.
5. Estimate only BEB posterior probabilities with --no-neb.
6. Evaluate all internal branches but no terminal branches (leaves), which are the most sensitive to sequencing or annotation errors, with --all-branches --no-leaves.
7. Adjust genetic code if required, e.g. for mammalian mitochondrial genes with --gcode=2.
In Unix (Linux, MacOSX), this will look like:
godon test BSG --ncat-codon-rate 4 \ --json=ENSGT00550000074985.Euteleostomi.001.godonBSG.json \ --m0-tree --no-neb --all-branches --no-leaves \ ENSGT00550000074985.Euteleostomi.001.cds.fas \ ENSGT00550000074985.Euteleostomi.001.nwk \ >ENSGT00550000074985.Euteleostomi.001.godonBSG.std.out \ 2>ENSGT00550000074985.Euteleostomi.001.godonBSG.std.err
1. Output files and estimated tree:
Main information is sent to the stdout file (ENSGT00550000074985.Euteleostomi.001.godonBSG.std.out); iteration steps and errors to the stderr file (ENSGT00550000074985.Euteleostomi.001.godonBSG.std.err).
Complete and advanced information is provided in the JSON file (ENSGT00550000074985.Euteleostomi.001.godonBSG.json) if you ask for it on the command line.
The tree with estimated branch lengths is available in the JSON output: "globalOptimizations" -> [0] -> "finalTree"
Estimated likelihoods (H0 and H1) for each branch are available in the stdout file, as well as in the JSON output.
For branch 17 (labelled with a #1 sign in the branch 17 tree in the stdout; test number 16 (17-1) in the JSON output) we retrieve those likelihoods:
"tests" -> [16] -> "H0" -> "maxLnL" -> -68562.13268087893
"tests" -> [16] -> "H1" -> "maxLnL" -> -68559.1967296804
We can construct the LRT: 2×(lnL1 - lnL0) = 2×(-68559.1967296804 - (-68562.13268087893)) = 5.87190239706
The LRT value is provided in the stdout file under the name Final D= for each branch.
The degree of freedom is 1 (number_of_parameters1 - number_of_parameters0).
So in that case p-value = 0.015384 (under χ²) => significant.
In SELECTOME, we add another test to control the False Discovery Rate (FDR) < 0.10: the q-value. But it is relevant only for multiple analysis.
3. If significant, retrieve sites under positive selection detected by Bayes Empirical Bayes (BEB):
BEB analysis pos codon aa p 55 GGA G 0.946* 613 GTA V 0.995** 722 GTT V 0.691 779 ACG T 0.986* 794 GTC V 0.645 1060 AAA K 0.683 1182 TTT F 0.725 1276 GTT V 0.697 1337 GTC V 0.606
aa (Amino Acids) refer to the first sequence in the alignment.
Position 613 has a high probability (0.995) to be under positive selection. Position 722 has a lower probability (between 0.5 and 0.95) to be under positive selection.
Different kinds of quality filtering are used in Selectome to improve Godon computations and discard low quality families and MSA regions: