The goal of this project is to facilitate the use and development of Bayesian methods to infer phylogenies. One of the least understood aspects of many Bayesian methods is that they rely on stochastic processes that need to converge. Simply put, results from simulations that have failed to converge cannot be trusted; yet most users pay little to no attention to this critical element of Bayesian inference and do not take available measures to ensure that a Bayesian analysis has converged. In order to accomplish the goal of this project, we are focusing our efforts in in the following four areas.
- Create easy-to-use software tools designed to diagnose convergence of Monte Carlo simulations used for Bayesian phylogenetic inference.
- Create software tools that will convert output data of Markov chain Monte Carlo (MCMC) simulations in formats that can easily be imported into other software packages used to diagnose MCMC convergence. Easy access to these data formats is intended to promote greater participation and collaboration by theoreticians in non-biology domains.
- Assemble and analyze diverse empirical data sets in order to better understand the limits of this popular method for inferring phylogeny and to draw attention to best practices in its use.
- Implement new methods for diagnosing convergence of Markov chains used for phylogeny inference
Background
Bayesian methods are being used by a growing number of researchers to infer phylogenies because of the computational efficiency of these methods (Huelsenbeck, et al. 2001). However, the stochastic process underlying the implementation of these methods is not “one-size fits all” and one of the least understood aspects of these statistical methods is the concept of convergence. In practice, Bayesian inference of phylogeny relies on Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of free parameters (e.g., tree topology and substitution model). In simple terms, if the MCMC algorithm used to produce the Markov chain is allowed to run long enough, then sampling from the chain will give the correct posterior probabilities for the parameters of interest. On the other hand, if the Markov chain fails to converge in a reasonable amount of time then the results cannot be trusted and the purported value of this increasingly popular method is lost.
To obtain a Markov chain that converges in a reasonable time, most Bayesian phylogenetic software packages provide ways for the user to tune the MCMC algorithm. The problem is that most practitioners are unfamiliar with the potential pitfalls associated with the implementation of Bayesian methods to infer phylogenies and remain ill equipped to diagnose potential problems in the application of Bayesian phylogenetic methods. Work conducted under this proposal will put easy-to-use tools into the hands of practitioners, so that potential problems with MCMC implementations can be diagnosed and modifications to the MCMC algorithm can be made before results are published. This same software will be used to study convergence for diverse empirical data sets and results from this work will be published in biology and CSE journals in order to draw further attention to the need for caution when using Bayesian methods to infer phylogenies. The individuals identified in this proposal are particularly well suited to accomplish the proposed work. Investigators have firsthand experience in life sciences and in the development of software for Bayesian inference of phylogeny and for the exploration of MCMC convergence in Bayesian phylogenetics. In addition, the team has at their disposal powerful computational and visualization resources, as well as access to a talented pool of graduate students, system administrators, and faculty in a newly formed interdisciplinary computational sciences program.
This project is being funded by the National Science Foundation (EF-0849861). 


