1. What is PTP and bPTP?
PTP can give species delimitation hypothesis based on a gene trees inferred from molecular sequences. PTP stands for the Poisson Tree Processes model. In PTP, we model speciations or branching events in terms of number of mutations. So it only requires a phylogenetic input tree, for example the output of RAxML - the branch lengths should represent number of mutations. Our numerous tests show PTP outperforms GMYC. More importantly, PTP is very easy to use, since it can use the phylogenetic tree directly without the difficult and error prone procedures of time calibration required by GMYC. bPTP is a Bayesian implementation of the PTP model.
2. How can I get the phylogenetic tree?
Google "RAxML" or "Phyml" or "Mrbayes". Want to use a web server? Try: http://embnet.vital-it.ch/raxml-bb/ or http://www.atgc-montpellier.fr/phyml/
The server will only accept Newick format or NEXUS format with no annotations on the tree.
bPTP supports multifurcating trees and tolerants zero branch length.
Here is an example of Newick format and an example of NEXUS format.
Special note on taxa name: A valid taxa name should only contains letters, numbers, _ and -; and should NOT be pure numbers. Any other characters, in particular space, $, #, %, ( and ), will cause errors!
3. How many generations of MCMC should I run?
As long as possible, but the server has a limit of 500,000 generations. For small trees (<50 taxa), 100,000 generations is usually enough. For large trees, you should run the MCMC analysis much longer, and always check for convergence (I will talk about this in the following). If your tree is larger, say a few hundred taxa, you can start with 100,000 generations and the analysis will be fast (< 30 min), if it is not enough, change the seed and run longer. In case you have very large trees - more than a thousand taxa, you should download the stand alone version of bPTP and try 1 million generations at least. Alternatively, you can use the maximal likelihood solution, see below.
4. Why should I always, always check for convergence?
The answer is simple, if the MCMC chains did not converge, the results are wrong. All those Bayesian support values will be meaningless if the MCMC chains did not reach the equilibrium distribution.
5. How do I check for convergence?
We only care about the species delimitation, so visual checking the likelihood plot
of each delimitation is sufficient. Due to the nature of this model and my implementation,
mixing is not an issue. Upon convergence, the chain should stay at high
likelihood locations most of the time and some times explores low likelihood locations.
I will show you two examples so you can have an intuitive feelings about it.
A typical example of a converged MCMC chain:
A typical example of a NOT converged MCMC chain:
6. What if my MCMC chain does not converge no matter how I try?
This could happen a. your tree is really large and the server has a limit on number of MCMC runs. b. PTP is not a good model for your data. c. The likelihood surface is really rough and you had bad lucks. If getting the Bayesian results is frustrating, remember you can always fall back to the maximal likelihood solution. A flat prior is given to each possible delimitations, so if a single tree is used, there is always a maximal likelihood solution. However, remember Bayesian supports for delimited species is meaningless without a converged MCMC chain.
7. What if I do not like Bayesian approach?
See question 6. Note there is also a bootstrap version of PTP in my GitHub.
8. How can I get the maximal likelihood solution?
See question 6.
9. How long does it take for the analysis to finish?
This depends on your tree size and shape, plus the MCMC generation you specified. 30 taxa and 100,000 generations needs only 1 min, 300 taxa and 500,000 generations might take no more than a few hours. 1000 taxa and 500,000 generations might take days and 5000 taxa might take weeks.
10. What does Bayesian support value mean?
Support values shown on the tree plot are computed as the "number of occurrence of all the descendants under this node"/ "number of samples from MCMC sampling". They are the posterial probabilities of those taxa form one species under the PTP model and a flat prior. From tests on simulated data, support values are strongly correlated with the accuracy of the delimitation, r = 0.91.
11. How do I cite PTP and bPTP?
12. Questions, bugs and suggestions?
Please post them on the Google group . If you think there is a bug, please e-mail me your job id, e-mail address, the parameters you used to run the job, and attach your input tree file.