Skip to content

Ecology and Multivariate Statistics Redundancy Analysis (RDA)

Notifications You must be signed in to change notification settings

katarinastuart/EnvComputing_RDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ecology and Multivariate Statistics

Redundancy Analysis (RDA)

Redundancy analysis (RDA) is a dimension reduction statistical approach for multivariate data sets. Dimension reduction techniques are often used on large, noisy, data sets to strip away any unnecessary components, leaving only those part that best capture any explanatory/predictor power. Even if you have not heard of RDA, it is very possible that you have encounter one of the main linear techniques for dimensionality reduction; principle component analysis (PCA).

Some background reading relevant to what we will be doing:

A quick PCA and RDA overview

PCA is used to discern underlying structures (i.e. principle components) that capture variance in a data set. For more information on the fundamentals of PCA I would recommend visiting Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction.

RDA is similar to PCA, however it incorporates the uses of response and explanatory variables, whereas analysis such as PCA doesn’t incorporate the explanatory variables.

RDA is traditionally used on response variables such as:

For this example, I have used the vignettes available on the above two links, as well as additional notes from Natalie Hofmeister and myself.

How can we use RDA in other ways?

Below I apply the approach to a multivariate phenotypic dataset. The aim is to see if the environmental data associated with the areas from which the wild-caught toads were collected carry any explanatory power for various phenotypic measures. I.e. we want to explore the link between environment and phenotype. As listed above, RDA can help you pick out environmental correlations in genotype data, and could conceivably be used to link genotype measures to that of phenotype.

ScreenShot

Use of RDA for the below analysis can be either quantitative or qualitative. The direct RDA outputs may be useful for you, or they may simply serve as an exploratory tool to direct further analysis.

Introducing the data set: Cane Toads!

The following data is one generated from my honours research. It contains phenotype variables for a number of cane toads sourced from one of three states: QLD, NT, or WA.

Let’s import the data and have a quick look at it

setwd("C:/Users/z5188231/Desktop/Coding/Files/EnvironmentalComputing")
ToadData <- read.csv("CaneToad_Phenotypes.csv",stringsAsFactors=TRUE,sep=",")
str(ToadData)
## 'data.frame':    121 obs. of  19 variables:
##  $ Exercise                               : int  1 0 1 1 1 1 1 1 1 0 ...
##  $ Diet                                   : int  0 1 1 1 1 0 1 0 0 0 ...
##  $ Source.Location                        : Factor w/ 3 levels "NT","QLD","WA": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Clutch                                 : Factor w/ 8 levels "Innisfail 1",..: 4 4 4 5 5 5 5 5 5 5 ...
##  $ Individual.number                      : int  442 445 453 402 403 404 405 406 407 409 ...
##  $ Body.Length                            : num  28.7 25.4 29.3 28.9 32 32.1 31.9 54.3 24.1 37.5 ...
##  $ Leg.length..mm.                        : num  0.376 0.343 0.334 0.36 0.347 ...
##  $ Arm.length..mm.                        : num  0.254 0.22 0.242 0.235 0.234 ...
##  $ Head.Width..mm.                        : num  0.345 0.331 0.341 0.356 0.338 ...
##  $ Heart.Mass..g.                         : num  0.000627 0.000276 0.000614 0.000519 0.000812 ...
##  $ Liver.Mass..g.                         : num  0.003066 0.000906 0.002628 0.001419 0.002875 ...
##  $ Struggle..number.of.kicks.             : int  8 3 9 3 0 2 3 3 2 4 ...
##  $ Righting.time..s.                      : int  1 1 1 1 3 1 3 1 1 1 ...
##  $ Distance.traveled.in.1.min..cm.        : int  573 496 721 775 534 890 365 1140 450 512 ...
##  $ Spontaneous.Acitivty...95.quantile..cm.: num  14.4 10.3 31.1 62.6 32.8 ...
##  $ exhaustion.score                       : num  4.33 4.96 3.89 4.42 4.45 ...
##  $ Stamina.score                          : num  -0.263 0.301 -0.316 -0.132 -0.135 ...
##  $ Recovery.score                         : num  0.1553 -0.2401 0.1488 -0.0193 0.2098 ...
##  $ ln.RadioTelemetry.distance             : num  NA NA NA NA 6.8 NA 7.1 10.2 NA NA ...

We see there is a total of 121 toads in this data set. The first 5 columns of these data denote information about the treatment levels (exercise/diet), source location, clutch, and and individual ID for each toad. The remaining 14 columns the contain phenotypic data.

Prepare Response and Explanatory data sets

Step 1.1 Preparing the response dataset (phenotype data)

First we check the data for any missingness, and make sure our matrix does not have any NA’s.

sum(is.na(ToadData)) # we see there is 91 missing data
## [1] 91
colnames(ToadData)[colSums(is.na(ToadData)) > 0] #explore data to identify where they are located
## [1] "ln.RadioTelemetry.distance"
ToadData <- ToadData[c(1:18)] #remove offending data
sum(is.na(ToadData)) #0 missing data
## [1] 0

In the above chunk we discovered 91 pieces of missing information. This is because some variables were not measured across all toads. It appears all the NA’s were in the final variable column - ln.RadioTelemetry.distance. So to prepare our data for the RDA, we need to remove this offending variable. Make sure to find the most appropriate way to resolve this should your data set have missing values. You may be able to remove rows, remove columns, generate an average etc. but you don’t want to reduce your data set drastically or remove too many variables/individuals.

Next we retain only the phenotype data columns, as these are the ones we are interested on conducting the RDA on.

ToadVars <- ToadData[c(6:18)]
str(ToadVars)
## 'data.frame':    121 obs. of  13 variables:
##  $ Body.Length                            : num  28.7 25.4 29.3 28.9 32 32.1 31.9 54.3 24.1 37.5 ...
##  $ Leg.length..mm.                        : num  0.376 0.343 0.334 0.36 0.347 ...
##  $ Arm.length..mm.                        : num  0.254 0.22 0.242 0.235 0.234 ...
##  $ Head.Width..mm.                        : num  0.345 0.331 0.341 0.356 0.338 ...
##  $ Heart.Mass..g.                         : num  0.000627 0.000276 0.000614 0.000519 0.000812 ...
##  $ Liver.Mass..g.                         : num  0.003066 0.000906 0.002628 0.001419 0.002875 ...
##  $ Struggle..number.of.kicks.             : int  8 3 9 3 0 2 3 3 2 4 ...
##  $ Righting.time..s.                      : int  1 1 1 1 3 1 3 1 1 1 ...
##  $ Distance.traveled.in.1.min..cm.        : int  573 496 721 775 534 890 365 1140 450 512 ...
##  $ Spontaneous.Acitivty...95.quantile..cm.: num  14.4 10.3 31.1 62.6 32.8 ...
##  $ exhaustion.score                       : num  4.33 4.96 3.89 4.42 4.45 ...
##  $ Stamina.score                          : num  -0.263 0.301 -0.316 -0.132 -0.135 ...
##  $ Recovery.score                         : num  0.1553 -0.2401 0.1488 -0.0193 0.2098 ...

We have now finished preparing the phenotype data (the response variables) for our RDA analysis.

Some points to consider:

  • Make sure that your response variables are independent. For instance, with this toad data set, I originally had measurements of Body Length, Arm Length, Leg Length etc. These measures would presumably be largely correlated with one another.
    • Solution: To deal with this, either omit redundant response data, or compute it in such a way that it represents a more independent aspect of the phenology (in this case I calculated limb/head lengths as a ratio of total body length).
  • The data should be normally distributed.
    • Solution: Transform variables that do not comply. The transformation type needed depends on the data, so make sure to do your research. A good place to start is the Handbook of Biological Statistics.

Step 2.1 Preparing the explanatory dataset (environmental data)

The first step here is to grab global climate data from WorldClim using the raster package.

library(raster)
climdata <- getData('worldclim',download=TRUE,var='bio',res=5)

This contains 19 different climate variables, measured globally. From this we want to select only the locations that are relevant for our study. Your metadata file should be just two columns: latitude and longitude for each individual (121), written in the format as below (and in the same order as your response data set).

setwd("C:/Users/z5188231/Desktop/Coding/Files/EnvironmentalComputing")
popcoords <- read.csv("metadata.csv",stringsAsFactors=TRUE,sep=",")
head(popcoords)
##       long      lat
## 1 131.3253 -12.5648
## 2 131.3253 -12.5648
## 3 131.3253 -12.5648
## 4 131.3253 -12.5648
## 5 131.3253 -12.5648
## 6 131.3253 -12.5648
str(popcoords)
## 'data.frame':    121 obs. of  2 variables:
##  $ long: num  131 131 131 131 131 ...
##  $ lat : num  -12.6 -12.6 -12.6 -12.6 -12.6 ...

How we have all the sample sites, we can pull out the climate data from global dataset for each individual coordinate, and label them in a human readable format (variable names obtained from the WorldClim website.

points <- SpatialPoints(popcoords, proj4string=climdata@crs)
values <- extract(climdata,points)
envdata <- cbind.data.frame(popcoords,values)
str(envdata)
## 'data.frame':    121 obs. of  21 variables:
##  $ long : num  131 131 131 131 131 ...
##  $ lat  : num  -12.6 -12.6 -12.6 -12.6 -12.6 ...
##  $ bio1 : num  273 273 273 273 273 273 273 273 273 273 ...
##  $ bio2 : num  111 111 111 111 111 111 111 111 111 111 ...
##  $ bio3 : num  59 59 59 59 59 59 59 59 59 59 ...
##  $ bio4 : num  1851 1851 1851 1851 1851 ...
##  $ bio5 : num  352 352 352 352 352 352 352 352 352 352 ...
##  $ bio6 : num  164 164 164 164 164 164 164 164 164 164 ...
##  $ bio7 : num  188 188 188 188 188 188 188 188 188 188 ...
##  $ bio8 : num  281 281 281 281 281 281 281 281 281 281 ...
##  $ bio9 : num  244 244 244 244 244 244 244 244 244 244 ...
##  $ bio10: num  292 292 292 292 292 292 292 292 292 292 ...
##  $ bio11: num  244 244 244 244 244 244 244 244 244 244 ...
##  $ bio12: num  1395 1395 1395 1395 1395 ...
##  $ bio13: num  334 334 334 334 334 334 334 334 334 334 ...
##  $ bio14: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ bio15: num  103 103 103 103 103 103 103 103 103 103 ...
##  $ bio16: num  887 887 887 887 887 887 887 887 887 887 ...
##  $ bio17: num  4 4 4 4 4 4 4 4 4 4 ...
##  $ bio18: num  391 391 391 391 391 391 391 391 391 391 ...
##  $ bio19: num  4 4 4 4 4 4 4 4 4 4 ...
colnames(envdata) <- c("long","lat",
      "AnnualMeanTemp","MeanDiurnalRange",
      "Isothermality","TempSeasonality",
      "MaxTempofWarmestMonth","MinTempofColdestMonth",
      "TempAnnualRange","MeanTempofWettestQuarter",
      "MeanTempofDriestQuarter","MeanTempofWarmestQuarter",
      "MeanTempofColdestQuarter","AnnualPrecipitation",
      "PrecipitationofWettestMonth","PrecipitationofDriestMonth",
      "PrecipitationSeasonality","PrecipitationofWettestQuarter",
      "PrecipitationofDriestQuarter","PrecipitationofWarmestQuarter",
      "PrecipitationofColdestQuarter")
str(envdata)
## 'data.frame':    121 obs. of  21 variables:
##  $ long                         : num  131 131 131 131 131 ...
##  $ lat                          : num  -12.6 -12.6 -12.6 -12.6 -12.6 ...
##  $ AnnualMeanTemp               : num  273 273 273 273 273 273 273 273 273 273 ...
##  $ MeanDiurnalRange             : num  111 111 111 111 111 111 111 111 111 111 ...
##  $ Isothermality                : num  59 59 59 59 59 59 59 59 59 59 ...
##  $ TempSeasonality              : num  1851 1851 1851 1851 1851 ...
##  $ MaxTempofWarmestMonth        : num  352 352 352 352 352 352 352 352 352 352 ...
##  $ MinTempofColdestMonth        : num  164 164 164 164 164 164 164 164 164 164 ...
##  $ TempAnnualRange              : num  188 188 188 188 188 188 188 188 188 188 ...
##  $ MeanTempofWettestQuarter     : num  281 281 281 281 281 281 281 281 281 281 ...
##  $ MeanTempofDriestQuarter      : num  244 244 244 244 244 244 244 244 244 244 ...
##  $ MeanTempofWarmestQuarter     : num  292 292 292 292 292 292 292 292 292 292 ...
##  $ MeanTempofColdestQuarter     : num  244 244 244 244 244 244 244 244 244 244 ...
##  $ AnnualPrecipitation          : num  1395 1395 1395 1395 1395 ...
##  $ PrecipitationofWettestMonth  : num  334 334 334 334 334 334 334 334 334 334 ...
##  $ PrecipitationofDriestMonth   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ PrecipitationSeasonality     : num  103 103 103 103 103 103 103 103 103 103 ...
##  $ PrecipitationofWettestQuarter: num  887 887 887 887 887 887 887 887 887 887 ...
##  $ PrecipitationofDriestQuarter : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ PrecipitationofWarmestQuarter: num  391 391 391 391 391 391 391 391 391 391 ...
##  $ PrecipitationofColdestQuarter: num  4 4 4 4 4 4 4 4 4 4 ...

Step 2.2 Check which variables are correlated

We mentioned earlier on that we wanted to remove non-independent measures from our response data set. A similar process must be done for the explanatory dataset. Some of our environmental measurement are likely to be highly correlated across our sampling sites. There are several ways of doing this, which will also be informed by your experimental model/questions. If you are particularly interested in specific environmental measures, you may want to leave those in only, and run the RDA. You could also conceivably make the decision based on the ecology and climate of the ecosystems. If you are interested in a more exploratory approach (like we are doing now), then I would recommend using one of the below methods:

Note: Where possible it is probably best to use method two, however for studies where the sample environments are more similar method one is more appropriate.

Method one: PCA to identify correlated environmental predictors

For this method we use a PCA to determine which variables are colinear. We will run the analysis, then go through and remove any unnecessary environmental variables whose variance is represented elsewhere in another measure.

env.PCA <- princomp(envdata)
summary(env.PCA)
## Importance of components:
##                             Comp.1     Comp.2       Comp.3       Comp.4
## Standard deviation     1393.425566 468.663169 84.353531286 1.215425e+01
## Proportion of Variance    0.895337   0.101284  0.003281146 6.812006e-05
## Cumulative Proportion     0.895337   0.996621  0.999902097 9.999702e-01
##                              Comp.5       Comp.6       Comp.7       Comp.8
## Standard deviation     7.843926e+00 1.517241e+00 8.709465e-01 7.591008e-06
## Proportion of Variance 2.837174e-05 1.061520e-06 3.497856e-07 2.657161e-17
## Cumulative Proportion  9.999986e-01 9.999997e-01 1.000000e+00 1.000000e+00
##                              Comp.9      Comp.10      Comp.11      Comp.12
## Standard deviation     5.004443e-06 2.258427e-06 1.877824e-06 1.708714e-06
## Proportion of Variance 1.154863e-17 2.351966e-18 1.626031e-18 1.346350e-18
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.13      Comp.14      Comp.15 Comp.16
## Standard deviation     9.439264e-07 8.605142e-07 8.118830e-07       0
## Proportion of Variance 4.108613e-19 3.414562e-19 3.039526e-19       0
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00       1
##                        Comp.17 Comp.18 Comp.19 Comp.20 Comp.21
## Standard deviation           0       0       0       0       0
## Proportion of Variance       0       0       0       0       0
## Cumulative Proportion        1       1       1       1       1

It appears the first two PCA components capture nearly all the data variability. However, I will grab the variable with the highest loading from the first 4 principle components (so it can be compared to the below method two)

env.PCA$loadings[,1:4]
##                                      Comp.1        Comp.2       Comp.3
## long                           0.0048404979  4.761069e-03  0.048835380
## lat                           -0.0007300049 -3.850558e-03 -0.009170369
## AnnualMeanTemp                -0.0141177073 -6.303686e-03 -0.127762783
## MeanDiurnalRange              -0.0132345224 -9.097271e-04 -0.073499737
## Isothermality                 -0.0013768578 -3.828411e-03 -0.009191644
## TempSeasonality               -0.0448161066  9.820835e-01 -0.114544286
## MaxTempofWarmestMonth         -0.0223526615  1.505883e-05 -0.190967918
## MinTempofColdestMonth         -0.0028274595 -1.037636e-02 -0.094961542
## TempAnnualRange               -0.0195252020  1.039142e-02 -0.096006376
## MeanTempofWettestQuarter      -0.0106398206  7.489046e-03 -0.098423045
## MeanTempofDriestQuarter       -0.0059839507 -1.281836e-02 -0.117094478
## MeanTempofWarmestQuarter      -0.0136925588  4.429645e-03 -0.126836652
## MeanTempofColdestQuarter      -0.0126194600 -1.932741e-02 -0.123305899
## AnnualPrecipitation            0.8276645875  5.084426e-05 -0.361742741
## PrecipitationofWettestMonth    0.1272388027 -2.860341e-02  0.023180736
## PrecipitationofDriestMonth     0.0269402527  1.618841e-02 -0.023552139
## PrecipitationSeasonality      -0.0136866807 -4.736583e-04 -0.013179815
## PrecipitationofWettestQuarter  0.3797219678 -5.242919e-02  0.002339598
## PrecipitationofDriestQuarter   0.0818568962  5.225970e-02 -0.070568217
## PrecipitationofWarmestQuarter  0.3572417953  1.481986e-01  0.835899853
## PrecipitationofColdestQuarter  0.1247560261  7.789082e-02 -0.131792843
##                                     Comp.4
## long                           0.031557960
## lat                           -0.008248749
## AnnualMeanTemp                -0.076010988
## MeanDiurnalRange              -0.113298480
## Isothermality                 -0.011999364
## TempSeasonality               -0.089897185
## MaxTempofWarmestMonth         -0.137352232
## MinTempofColdestMonth          0.012207835
## TempAnnualRange               -0.149560066
## MeanTempofWettestQuarter      -0.040115110
## MeanTempofDriestQuarter       -0.038223218
## MeanTempofWarmestQuarter      -0.062890539
## MeanTempofColdestQuarter      -0.061909046
## AnnualPrecipitation            0.034152563
## PrecipitationofWettestMonth    0.196372533
## PrecipitationofDriestMonth    -0.009702084
## PrecipitationSeasonality      -0.047680531
## PrecipitationofWettestQuarter -0.479111735
## PrecipitationofDriestQuarter   0.037799171
## PrecipitationofWarmestQuarter  0.027677406
## PrecipitationofColdestQuarter  0.803751378

Looks like TempSeasonality, Annual Precipitation, PrecipitationofWarmestQuarter, and PrecipitationofWarmestQuarter are the most important.

PCApredictions <- envdata[c(6,14,18:19)]
str(PCApredictions)
## 'data.frame':    121 obs. of  4 variables:
##  $ TempSeasonality              : num  1851 1851 1851 1851 1851 ...
##  $ AnnualPrecipitation          : num  1395 1395 1395 1395 1395 ...
##  $ PrecipitationofWettestQuarter: num  887 887 887 887 887 887 887 887 887 887 ...
##  $ PrecipitationofDriestQuarter : num  4 4 4 4 4 4 4 4 4 4 ...

Method two: R squared to identify correlated environmental predictors

For more information on using this approach, use this RDA vignette. The r squared values generated by the below approach should guide your variable choice. For variable pairs that are highly colinear, remove one, preferencing the exclusion of variables that are colinear with multiple other measures (r squared value of > 0.7 is considered “high”). As noted above though, when the sampled environments are more similar, as is the case with this cane toad dataset, we will see more collinearity.

Note: given we are using so many environmental variables, some may be cut off on your plot. Either adjust the plot or progressively remove variables.

library(psych)
pairs.panels(envdata[,3:19], scale=T)

ScreenShot

You may end up with something like this (though note the strong relationship annual mean temp has too nearly all the environmental variables):

VIFpredictions <- envdata[c(3,6,8,14)]
str(VIFpredictions)
## 'data.frame':    121 obs. of  4 variables:
##  $ AnnualMeanTemp       : num  273 273 273 273 273 273 273 273 273 273 ...
##  $ TempSeasonality      : num  1851 1851 1851 1851 1851 ...
##  $ MinTempofColdestMonth: num  164 164 164 164 164 164 164 164 164 164 ...
##  $ AnnualPrecipitation  : num  1395 1395 1395 1395 1395 ...
pairs.panels(VIFpredictions, scale=T)

ScreenShot

Method three: Variance Inflation Factors to identify correlated environmental predictors

I would recommend using one of the above two methods to remove variables that are supper collinear, then use this last method to refine predictor choice. Simple use the vif.cca function to check variance inflation factors for each predictor, and sequentially remove the highest one until you have an appropriate number of predictors.

HOWEVER, this needs to be run after the RDA has been ordinated (notice how it calls ToadVars.rda). So jump back up here once you have finished off step 3.2.

vif.cca(ToadVars.rda) #this code will reappear below in step 3.2

We have now finished preparing the environmental data (the explanatory variables) for our RDA analysis. Depending on the method used, we either have the environmental dataset PCApredictions, or VIFpredictions.

Conducting the RDA

Step 3.1 Ordinating the data/do the RDA

library(vegan)
ToadVars.rda <- rda(ToadVars ~ AnnualMeanTemp + TempSeasonality + MinTempofColdestMonth + AnnualPrecipitation, data=VIFpredictions, scale=T)
ToadVars.rda
## Call: rda(formula = ToadVars ~ AnnualMeanTemp + TempSeasonality +
## MinTempofColdestMonth + AnnualPrecipitation, data =
## VIFpredictions, scale = T)
## 
##                Inertia Proportion Rank
## Total         13.00000    1.00000     
## Constrained    0.89205    0.06862    4
## Unconstrained 12.10795    0.93138   13
## Inertia is correlations 
## 
## Eigenvalues for constrained axes:
##   RDA1   RDA2   RDA3   RDA4 
## 0.5232 0.1982 0.1340 0.0366 
## 
## Eigenvalues for unconstrained axes:
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8    PC9   PC10 
## 2.7133 1.9616 1.3353 1.1553 0.9783 0.9078 0.7673 0.7193 0.5474 0.3590 
##   PC11   PC12   PC13 
## 0.2879 0.2192 0.1561

Step 3.2 Check the RDA

We can now check how much variation is explained by out RDA.

RsquareAdj(ToadVars.rda)
## $r.squared
## [1] 0.06861938
## 
## $adj.r.squared
## [1] 0.03650281

3.6% is not large, but then consider that phenotype is the end result of a large number of factors (genetics, epigenetics etc…), of which environmental effects are only one! And our environmental data set does not even use all environment data.

screeplot(ToadVars.rda) #to determine how many RDA axes there are / are important if there are many

ScreenShot

signif.full <- anova.cca(ToadVars.rda, parallel=getOption("mc.cores"))
signif.full # test for model significance
## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = ToadVars ~ AnnualMeanTemp + TempSeasonality + MinTempofColdestMonth + AnnualPrecipitation, data = VIFpredictions, scale = T)
##           Df Variance      F Pr(>F)    
## Model      4   0.8921 2.1366  0.001 ***
## Residual 116  12.1079                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vif.cca(ToadVars.rda) # HERE IT IS! Get variance inflation factors for each predictor (refer to step 2.2, method three) 
##        AnnualMeanTemp       TempSeasonality MinTempofColdestMonth 
##             46.660123              2.297345             17.553396 
##   AnnualPrecipitation 
##             19.625697

The last quality check repots back some high VIF values. This is not entirely unexpected as our VIFpredictions dataset had high collinearity for AnnualMeanTemp. If we were to redo this with PCApredictions we may have lower numbers, indicating it would be better to use the variables generated by the PCA.

Step 3.3 Visualise the RDA

We will proceed with graphing our RDA.

plot(ToadVars.rda, scaling=3)

ScreenShot

RDA may be interpreted similarly to PCA, with each ordination axis capturing some of the data variance. The blue arrows indicate the magnitude and direction of the environmental predictors. The arrangement of the circles (the toads), the red crosses (our phenotypic measurements) and the environmental predictors against the different axis informs us what is loaded on each of said axis.

Now we can format it to make it more intuitive.

location <- c((rep("NT",30)),(rep("QLD",57)),(rep("WA",34)))

VIFpredictions <- cbind(VIFpredictions,location) str(VIFpredictions)

## 'data.frame':    121 obs. of  5 variables:
##  $ AnnualMeanTemp       : num  273 273 273 273 273 273 273 273 273 273 ...
##  $ TempSeasonality      : num  1851 1851 1851 1851 1851 ...
##  $ MinTempofColdestMonth: num  164 164 164 164 164 164 164 164 164 164 ...
##  $ AnnualPrecipitation  : num  1395 1395 1395 1395 1395 ...
##  $ location             : Factor w/ 3 levels "NT","QLD","WA": 1 1 1 1 1 1 1 1 1 1 ...
levels(VIFpredictions$location) <- c("NT","QLD","WA")
eco <- VIFpredictions$location
bg <- c("#2ACBB3","#602ACB","#A2C61C")

plot(ToadVars.rda, type="n", scaling=3)
points(ToadVars.rda, display="species", pch=20, cex=0.7, col="gray32", scaling=3)           # the phenotypes
points(ToadVars.rda, display="sites", pch=21, cex=1.3, col="gray32", scaling=3, bg=bg[eco]) # the toads
text(ToadVars.rda, scaling=3, display="bp", col="#0868ac", cex=1)                           # the environmental predictors
legend("bottomright", legend=levels(eco), bty="n", col="gray32", pch=21, cex=1, pt.bg=bg)

ScreenShot

Trying to label the phenotypic variables

plot(ToadVars.rda, type="n", scaling=3, choices=c(1,3))
points(ToadVars.rda, display="species", pch=20, cex=0.7, col="gray32", scaling=3)           # the phenotypes
#text (ToadVars.rda, display="species", cex=0.65, pos=3,col="red")
points(ToadVars.rda, display="species", pch=21, cex=1.3, col="gray32", scaling=3) # the toads
points(ToadVars.rda, display="sites", pch=21, cex=1.3, col="gray32", scaling=3, bg=bg[eco]) # the toads
text(ToadVars.rda, scaling=3, display="bp", col="#0868ac", cex=1)                           # the environmental predictors
legend("bottomright", legend=levels(eco), bty="n", col="gray32", pch=21, cex=1, pt.bg=bg)

ScreenShot

load.rda <- scores(ToadVars.rda, choices=c(1:3), display="species")
outliers <- function(x,z){
  lims <- mean(x) + c(-1, 1) * z * sd(x)     # find loadings +/-z sd from mean loading     
  x[x < lims[1] | x > lims[2]]               # locus names in these tails
}

# check for normal distribution
hist(load.rda[,1], main="Loadings on RDA1")

ScreenShot

hist(load.rda[,2], main="Loadings on RDA2")

ScreenShot

hist(load.rda[,3], main="Loadings on RDA3")

ScreenShot

# get number of candidates > 1.5 SDs from mean loading
cand1 <- outliers(load.rda[,1],1.5) 
cand2 <- outliers(load.rda[,2],1.5) 
cand3 <- outliers(load.rda[,2],1.5)
cand1
## Leg.length..mm.  Recovery.score 
##      -0.7828865       0.3592628
cand2
## Distance.traveled.in.1.min..cm.                exhaustion.score 
##                       0.3529830                      -0.4115757
cand3
## Distance.traveled.in.1.min..cm.                exhaustion.score 
##                       0.3529830                      -0.4115757

About

Ecology and Multivariate Statistics Redundancy Analysis (RDA)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages