The genetic basis of evolution is now accessible via whole genome sequencing. Not only is it feasible to sequence a mammalian genome, it is also possible to sequence in depth the transcripts that come from living material; either the whole organism or cell lines. To understand such data, phylogenetic or evolutionary tree based methods are a natural model. I am particularly interested in developing and understanding such methods. Earlier work included the development of the first likelihood methods to model unequal rates across sites (e.g., Steel et al. 1993, Waddell and Penny 1996).
An ongoing interest is in robust ways to assess fit of data to model and to take into account both model and sampling uncertainty (e.g. Waddell et al. 2002, Waddell 2006). Another related theme is to explore patterns of coevolution amongst genes (Waddell et al. 2007).
Mammalian Comparative Genomics
There remain unresolved issues of the phylogeny, divergence times and ancestral population genetics of the deepest parts of the placental mammal tree. My work with colleagues in Japan erected the first predominantly correct classification and phylogeny of the placental mammals. This included the first identification of the four main groups of placental mammals Laurasiatheria, Supraprimates (also called Euarchontoglires), Afrotheria and Xenathra (Waddell et al. 1999, 2001) and the relationships between them. The sequence data for mammals typically do not fit phylogenetic models well, and consequently the trees are often wrong, even when using millions of base pairs. Thus, in order to test our hypotheses of mammalian evolution, we developed a number of tests based on more conservative characters such as SINE/LINE insertions (Waddell et al. 2001).
There remains considerable uncertainty as to which characters are most reliable in inferring/testing deep relationships (e.g., Swofford et al. 1996, Waddell and Shelly 2003) and current research includes examining and comparing different types of slowly evolving data. Equally uncertain are the exact divergence times and how to best model the process of rate change and integrate fossil evidence (e.g., Waddell and Penny 1996, Kitazoe et al. 2007).
Bioinformatics/ Computational Biology
Bioinformatics means different things to different people, but phylogenetics is one core area. It sometimes involves elements of computer science such as the number of operations required to compute a problem. An example of this is least squares fitting of distances to trees, and interesting area with close parallels to regression analysis. One result with David Bryant was to describe algorithms that are time optimal (Bryant and Waddell 1998).
A separate aspect of bioinformatics I have researched is gene expression analysis, including work for private industry. Joint research with Hirohisa Kishino developed a number of important methods for the analysis of microarray data. One of these is correspondence analysis, which was first used for expression data by Kishino and Waddell (2000). This paper also showed how partial correlations, the basis of graphical modeling, could be robustly estimated when the number of observations was less than the number of variables.
Another important technique in expression analysis is graphical modeling, which was first used on this type of data by Waddell and Kishino (2000). A surprising and cautionary result of these analyses was that while graphical modeling uses fit statistics, biological data sets that clearly could not fit the model, were not rejected by this likelihood ratio statistic. Another part of the paper looks at the meta-analysis of gene expression clustering and importance of considering a wide variety of distance and clustering methods (in contrast to the popular approach at the time of basing everything on a single UPGMA analysis). Another example of our developing novel methods was the application of correspondence analysis to gene expression data (Kishino and Waddell 2000).
Statistics is essential to understanding biological data, including DNA sequences and gene expression. I am interested in developing tests that more reliably show expected errors for inferences from data. One of the major problems in analyzing genomic data is that our models are too simple and often do not fit the data very well. If doing a Bayesian analysis, for example, this results in credibility intervals that are typically far too narrow (Waddell et al., 2001). It is in situations like this that resampling techniques such as the bootstrap have utility (Waddell et al. 2002). However, these too, as typically applied to phylogenetic data, for example, result in credibility intervals that rapidly go towards zero, as more data is added. They are recording one type of stochastic error but are ignoring potentially huge systematic errors. The field needs techniques that monitor both types of error and give us at least a semi-realistic estimate of the scale of the problems. Mammalian phylogenetics is littered with estimates of a phylogeny which not only contain errors, but also contain extremely inaccurate and misleading estimates of their own accuracy.