# Day 3 first session - continuous trait evolution.
# This tutorial will hopefully provide an introduction to analyzing the evolution of continuous traits using the geiger and ape packages in R.
library(geiger)# note this will automatically load ape if it is not already loaded.
mytree<-read.nexus("parrottree.nex")
mydata<-read.table("parrotfishes.txt")
# It is important to have names associated with the data that match the tip labels, if not ape will assume they are in the same order as the tip labels in the tree and you may or may not get a warning this is what it has done! This is one of the most common mistakes when running Phylogenetic Comparative Methods in R - if you are getting errors or odd answers one of the first things to check is that the function is reading the species names correctly from the data object and they match the tip names in the tree - the name.check function in geiger as used in Intro2Phylo_S2.R is useful for this. There is a variety of ways to sort your data to match the tip labels, this is one of them:
mysorteddata <- mydata[mytree$tip.label, ]
# The most basic model for trait evolution is a Brownian model. Under this model a trait evolves by small increments, the direction is random i.e. for each time step there is an equal probability for a decrease or an increase in trait value. when a trait evolving on a tree reaches a bifurcation (node), each branch starts evolving independently, starting from the common value at the node.
# I have created a simple simulation of Brownian evolution on a tree in the file "Brownian trait simulations.r" which you can find on the course website http://www.eve.ucdavis.edu/~wainwrightlab/links.html
#simple predictions can be derived from this model: the difference in trait values between each species pair will be proportional to their distance on the phylogeny (i.e. the time separating them) and the "step size" of the Brownian model. this latter parameter is the rate of BM (Brownian motion) evolution: it is the average time-corrected change in trait values.
#you can intuitively see the connection between the rate of BM evolution, the tree age, and the phylogenetic signal in a trait. for a young ("short") tree with low BM rate, the phylogenetic signal is expected to be strong (differences in trait values correpond to phylogenetic distance), whereas for an old ("long") tree with high BM rate, the phylogenetic signal is expected to be weak.
# (1) Fitting BM models using maximum likelihood with the fitContinues function in geiger
# With continuous traits, one can ask what is the rate of BM evolution along the tree. The statistics of interest is the trait variance/covariance. given a tree and a vector of trait values, we can use:
log_gape.BM<-fitContinuous(mytree, mysorteddata$log_gape, model="BM")
# This returns a list that includes $lnl the log likelihood of the model, and $beta which is the rate of BM evolution, the AIC score and the AICc - AIC corrected for small sample size.
# note that the rate here is constant with respect to time and the location on the tree.
# fitContinuous will accept multiple traits, as long as they are continuous. in this case, it will "spit" an output for your n traits
all.BM<-fitContinuous(mytree, mysorteddata[,1:5], model="BM")
#this will not work because we have NAs in our data
all.BM<-fitContinuous(mytree, mysorteddata[,1:4], model="BM") #note that we cannot have NAs
#(2) Testing for changes in BM rates over time with the fitContinuous function in geiger. If rates decrease over time it is a signature of adaptive radiation if the traits are ecologically relevant.
# What if the rate changes through time? fitContinuous can fit a time-variable BM model. as we saw in the discrete character session, this is done by transforming the tree.
log_gape.delta<-fitContinuous(mytree, mysorteddata$log_gape, model="delta")
# the output includes Pagel’s delta which raises all node depths to the power delta. If delta is less than one, evolution in concentrated early in the tree; delta > 1 concentrates evolution towards the tips. you can visualize the transformation to the tree:
my.delta.tree<-deltaTree(mytree, log_gape.delta$Trait1$delta)#transforms the tree according to the delta parameter
par(mfrow=c(1,2))#remember from day1 session2 that this sets the graphical parameters so that the plotting device has 1 row and 2 columns, so we can now plot two trees next to each other.
plot(mytree)
plot(my.delta.tree)
# How is this model doing compared to BM model?
p.delta=1-pchisq(2*(log_gape.delta$Trait1$lnl-log_gape.BM$Trait1$lnl),1)
c.delta=log_gape.delta$Trait1$aic-log_gape.BM$Trait1$aic
# which one would you prefer?
#another time-dependent model is the early burst model- which fits a model where the rate of evolution increases or decreases exponentially through time, under the model r(t) = ro * exp(r * t), where ro is the initial rate and r is the rate change parameter. The actual parameter estimated, endRate, is the proportion of the initial rate represented by the end rate.
log_gape.eb<-fitContinuous(mytree, mysorteddata$log_gape, model="EB")
# How is this model doing compared to BM model?
p.eb=1-pchisq(2*(log_gape.eb$Trait1$lnl-log_gape.BM$Trait1$lnl),1)
c.eb=log_gape.eb$Trait1$aic-log_gape.BM$Trait1$aic
# which one would you prefer?
# (3) Testing for phylogenetic signal using Pagel's lambda with the fitContinuous function in geiger.
# Pagel's lambda is a tree transformation that assesses the degree of phylogenetic signal within the trait by multiplying the internal branches of the tree by values between 0 and 1 - which is the lambda parameter. A lambda of 1 indicates complete phylogenetic patterning as it returns the branch lengths untransformed while 0 indicates no patterning as it leads the tree collapsing to a single large polytomy. To visualize this we can transform the tree with lambda=0 using the lambdaTree function in geiger and compare it to the original tree
log_gape.lam<-fitContinuous(mytree, mysorteddata$log_gape, model="lambda")
my.lambda.tree<-lambdaTree(mytree, log_gape.lam$Trait1$lambda)#transforms the tree topology to one that has all internal branch lengths multiplied by 0 (i.e. lambda=0) creating one giant basal polytomy
par(mfrow=c(1,2))#remember from day1 session2 that this sets the graphical parameters so that the plotting device has 1 row and 2 columns, so we can now plot two trees next to each other.
plot(mytree)
plot(my.lambda.tree)
# Likelihood ratio test approximated by a chi-squared distribution
p.lambda=1-pchisq(2*(log_gape.lam$Trait1$lnl-log_gape.BM$Trait1$lnl),1)
c.lambda=log_gape.lam$Trait1$aic-log_gape.BM$Trait1$aic
# (4) Testing for a speciational rate change with the fitContinous function in geiger.
# This one manipulates the tree as a test of "speciational" models. Each branch length in the tree is raised to the power kappa. Kappa = 1 is a Brownian motion model (so the tree is returned unchanged), while kappa = 0 is a speciational model where all branch lengths are equal.
log_gape.kap<-fitContinuous(mytree, mysorteddata$log_gape, model="kappa")
my.kappa.tree<-kappaTree(mytree, log_gape.lam$Trait1$lambda)#transforms the tree topology according to the observed kappa value
par(mfrow=c(1,2))#remember from day1 session2 that this sets the graphical parameters so that the plotting device has 1 row and 2 columns, so we can now plot two trees next to each other.
plot(mytree)
plot(my.kappa.tree)
# Likelihood ratio test approximated by a chi-squared distribution
p.kappa=1-pchisq(2*(log_gape.kap$Trait1$lnl-log_gape.BM$Trait1$lnl),1)
c.kappa=log_gape.kap$Trait1$aic-log_gape.BM$Trait1$aic
# (5) Other models in fitContinous function in geiger.
# a model of white noise- brownian motion with no phylogenetic structure
log_gape.WH<-fitContinuous(mytree, mysorteddata$log_gape, model="white")
p.WH=1-pchisq(2*(log_gape.WH$Trait1$lnl-log_gape.BM$Trait1$lnl),1)
c.WH=log_gape.WH$Trait1$aic-log_gape.BM$Trait1$aic
########lets compare the models and select the best one
{cat("tlambda\tdelta\tkappa\teb\twhite_noise\n")
cat(p.lambda,p.delta,p.kappa,p.eb,p.WH,'\n')
cat(c.lambda,c.delta,c.kappa,c.eb,c.WH)}
# (4) Ancestral State Reconstruction using maximum likelihood with the ace function in ape.
#the default model is Brownian motion where characters evolve randomly following a random walk. This model can be fitted by maximum likelihood (the default,Schluter et al. 1997), least squares (method = "pic", Felsenstein 1985), or generalized least squares (method = "GLS", Martins and Hansen 1997, Cunningham et al. 1998). In the latter case, the specification of phy and model are actually ignored: it is instead given through a correlation structure with the option corStruct.
anc_log_gape<-ace( mysorteddata$log_gape, mytree, type="cont")
# You may have received a warning message about NaNs (Not-a-Number) being produced and the standard error was not calculated for this reason. This should obviously inspire further investigation. It may have been caused by an underflow issue while calculating likelihoods.
# The object anc_log_gape is a list that contains the overall likelihood score ($loglik), the maximum likelihood estimate of the Brownian parameter ($sigma2), and the confidence intervals for ace ($95CI). note the confidence intervals at the nodes...
library(Hmisc)
errbar(branching.times(mytree),anc_log_gape$ace,anc_log_gape$CI95[,2],anc_log_gape$CI95[,1])
points(rep(0,length(mysorteddata$log_gape)),mysorteddata$log_gape,pch=21,cex = .5,bg='black')
# a simple visualization of the ancestral reconstruction
my.disp.tree=mytree
my.disp.tree$tip.label=format(mysorteddata$log_gape,digits=2)
plot(my.disp.tree)
nodelabels(format(anc_log_gape$ace,digits=2), frame="none", bg="white")
## phylogenetic independent contrasts:
# When we have species-level data on any set of traits, we are often interested in the correlation between traits. such correlation (given a model) can help us understand the causes and effects. for example, we can test the correlation between body size and brain size, assuming that larger body size provides the resources needed to evolve large brains. However, species are not independent of each other, and this dependence (summeriesed by the phylogeny) is validating one of the core assumptions of regression/correlation. Felsenstein (1985. Phylogenies and the comparative method. American Naturalist 125: 1–15) proposed a method that fully takes phylogeny into account in the analysis of comparative data. The idea behind the “contrasts” method is that, if we assume that a continuous trait evolves randomly in any direction (i.e., the Brownian motion model), then the “contrast” between two species is expected to have a distribution centered on zero and a variance proportional to the time since divergence. If the contrasts are scaled with the latter, then they have a variance equal to one. This can be done also for internal nodes because under the assumptions of the Brownian model the ancestral state of the variable can be calculated; a rescaling of the internal branches eventually occurs.
#In this formulation, the tree needs to be binary (fully dichotomous), and a contrast is computed for each node. Thus for n species, n − 1 contrasts will be computed. The contrasts are independent with respect to the phylogeny (unlike the original values of x), and standard statistical methods for continuous variables can be used.
pic.log_gape=pic(mysorteddata$log_gape,mytree)
#pic's are calculated for each trait separately. they are point estimates for the rate of change in trait values at the node, so the mean of pic.log_gape should be similar to the brownian rate estimated from the fitContinous model
mean(pic.log_gape)
log_gape.BM$Trait1$beta
#pics are used in comparative studies, typically when comparing two traits in a regression-like analysis. let's add a second variable
pic.log_protrusion=pic(mysorteddata$log_protrusion,mytree)
# Garland et al. (1993, Phylogenetic analysis of covariance by computer simulation. Systematic Biology 42: 265–292) recommended that linear regressions with PICs should be done through the origin (i.e. the intercept is set to zero). this can be done using linear regression model (GLM) or using major axis regression using the smatr module in R
# linear regression analysis of the raw data. slope is NOT forced through the origin. the order of variables is important as residuals are calculated on the Y dimension only
lm.raw=lm(mysorteddata$log_gape~mysorteddata$log_protrusion)
summary(lm.raw)
# linear regression analysis of the PICs. slope is forced through the origin (by adding -1 in the formula). the order of variables is important as residuals are calculated on the Y dimension only
lm.pic=lm(pic.log_gape~pic.log_protrusion-1)
summary(lm.pic)
#an alternative test is the major axis regression, recommended when there is no cause and effect. residuals are calculated based on X and Y distance from the regression line. intercept is forced through the origin (intercept=F). Y variable comes first. the method by which regression line is fitted can be 'OLS' or 0 for linear regression, 'SMA' or 1 for standardized major axis (this is the default), and 'MA' or 2 for major axis
library(smatr)
ma.pic=slope.test(pic.log_protrusion,pic.log_gape, test.value = 0, intercept = F,method = 2 )
ma.pic
# positive correlation between the variables would be interpreted as a correlated evolution between the two traits such that evolutionary increases in one trait are associated with evolutionary increases in another. here, we see no such correlation, indicating that the observed correlation is an artifact of the phylogeny. however often with contrasts, very short branches turn out to be outliers. you can try to remove contrasts for very short branches
short.pic.gape=pic.log_gape[-which((pic.log_gape)==min(pic.log_gape))]
short.pic.prot=pic.log_protrusion[-which((pic.log_gape)==min(pic.log_gape))]
ma.pic.short=slope.test(short.pic.prot,short.pic.gape, test.value = 0, intercept = F,method = 2 )
par(mfrow=c(2,2))#remember from day1 session2 that this sets the graphical parameters so that the plotting device has 2 row and 2 columns, so we can now plot the different contrast plots.
plot(pic.log_gape~pic.log_protrusion)
#we can plot the regression line from the lm (full line)
abline(a=0,b=summary(lm.pic)$coefficients[1])
#and plot the regression line from the MA (dashed line)
abline(a=0,b=ma.pic$b,lty = 2)
plot(mysorteddata$log_gape~mysorteddata$log_protrusion)
#we can plot the regression line
abline(a=summary(lm.raw)$coefficients[1,1], b=summary(lm.raw)$coefficients[2,1])
plot(short.pic.prot,short.pic.gape)
#we can plot the regression line
abline(a=0, b=ma.pic.short$b)
#####
##simulating traits on a tree- it could be useful to simulate traits under your tree to generate null distributions of trait values, or to test the generality of your findings. evolve.phylo (in ape) allows simulation of the (independent) evolution of one or several continuous characters along a given phylogenetic tree under a homogeneous Brownian model. you would have to specify a tree, value for the ancestral state (or a vector of n states for n traits) and trait variance (or a vector of n states for n traits)
my.BM.traits=evolve.phylo(mytree,c(0,1),c(0.2,0.3))
str(my.BM.traits)
plot(my.BM.traits)
# the output is the tree, with nodes ($node.character) and trait ($trait.character) values
# you can access the node and tip values of the simulated traits
my.BM.traits$node.character$V1
my.BM.traits$tip.character$V2
# OK, but what if you want to account for correlations between traits? Geiger will let you do that using sim.char
my.sim.chars=sim.char(mytree,ic.sigma(mytree,mysorteddata[1:4]),nsims=2)
# the arguments are the tree, the evolutionary variance-covariance matrix from my original data, and the number of simulations.
#the output is a MULTIDIMENSIONAL array with [,,1] being the result for the first simulation
my.sim.chars[,,1]
#and
my.sim.chars[,,2]
#the result of the second one