HPC Solutions for Studying Genome Macroevolution
Phylogenetic comparative methods have become an essential computational tool to study problems across biology, linguistics, and paleontology, among other fields. They can be used to investigate important questions, such as the tempo (rate) and mode (gradual/punctuated) of evolution, correlated change, directional evolution, adaptive radiations, and ancestral character states.
A major challenge for current comparative methods is the vast growth in the size of datasets. Researchers are currently limited to models and software originally designed for datasets smaller by one to two orders-of-magnitude. In the early 2000’s a dataset of 100 to 200 taxa was considered large. Now researchers have access to datasets of 10,000 birds and 55,000 plants, for example. Datasets for genomic traits are even larger. Big data allows evolution to be studied in exceptional detail, but inherently contains greater amounts of heterogeneity. For example, the comparative analysis of gene trees in vertebrate genomes is a massive dataset that could be used to determine how much evolutionary change is clock-like versus punctuated. This is a basic, yet unaddressed, question in biology, in part because scalable models and high performance computing (HPC) software are not available. Standard comparative methods assume uniform evolutionary processes, which will misestimate model parameters in even moderately-sized datasets and lead researchers to erroneous conclusions about how their systems are evolving with no way to identify and correct the error. Both variable rate models and HPC code are needed to keep pace with accelerating data generation. We are working with Andrew Mead and Chris Venditti at the University of Reading (UK) to develop HPC solutions to address this problem.
As an example, we used a novel variable rate model to study Dmrt1 exon data. Dmrt1 differentiates sex in animals as distantly related as fruit flies, nematodes, and humans (Ferguson-Smith 2007). We found substantial support for a variable rate model with most rapid shifts in evolution occurring in terminal branches, suggesting that the number of exons is part of species-specific adaptation. Notably, the rate of exon gain/loss is much slower in fish (top, blue in Figure) than in amniotes (bottom, red in Figure) (p-value < 0.0001). Analysis using this method suggests that fish are evolving more slowly in how Dmrt1 exons are being recruited during the evolution of sex determination – despite the greater amount of nucleotide evolution in fish compared with amniotes.