Data-Based Models, How to Analyse Data and Which Test to Use

Data-Based Models

Statistical packages like R work by fitting models to data

Require you to use an appropriate model for samples and variables under investigation before they estimate parameter values that best fit data

Standard convention for presenting statistical models - response variable(s) = explanatory variable(s)

= sign is statment of hypothesised relationship between variables

Chosen statistic quantifies the relationship of response variable to explanatory variables

3 main types of data:

One variable, one sample - chi-squared, G-test, Kolmogorov-Smirnov
Two variables, one sample
- Categorical responses (contingency tables) - chi-squared, G-test for independence
- Continous response and predictor - linear regression/correlation
One or more predictors, two or more samples - ANOVA or GLM

1 of 13

One Variable, One Sample

Look for goodness-of-fit frequencies (observed compared to expected)

Chi-squared or G-test of association
For continuous data, use Kolmogorov-Smirnov

Assumptions:

Data are nominal (not continous)
Frequencies are independent from each other
No cell has expected values <5

2 of 13

Two Variables, One Sample - Categorical Responses

For data of this kind, look for a dependent relationship between variables

Contingency tables used to look for interaction between variables

Ch-squared or G-test
For cells with expected values <5, use Fisher's exact test

Model formula: colour:behaviour ~ response

Assumptions:

Categorical data
Frequencies independent
No cell with expected values <5 (if not Fisher's exact test)
Correction for continuity

3 of 13

Two Variables, One Sample - Continuous Response an

Plot response variable on y-axis and explanatory variable on x-axis

Linear regression should be used

If no clear functional relationship, use correlation to calculate r

Mdel formula: Response ~ Explanatory

Assumptions:

Random sampling
Independent errors
Homogeneity of variances
Normal distribution of errors
Linearity

If variance increases with response there is no linearity and data must be transformed

4 of 13

One-Way Classification of Two+ Samples - 1 Categor

Look for a difference between sample means

With one categorical predictor:

t-test for two groups
ANOVA for more than two groups
Repeated measures ANOVA for repeated measures on subjects
Transform data that violate asumptions
Kruskal Wallis for non-parametric ANOVA
Mann-Whitney for non-parametric t-test

Assumptions:

Random sampling
Independent errors
Homogeneity of variances
Normal distribution of errors

Model: Response ~ Explanatory

5 of 13

Selecting and Fitting Models to Data

R offers alternative commands for ANOVA

aov suits mode straightforward analyses with normally distributed residuals
glm = General Linear Model - accomodate ANOVA on data with inherently non-normal distributions e.g. proportions (binomial) or frequencies of rare events (Poisson)

6 of 13

One-Way Classification of Two+ Samples - 2 Continu

Look for differences between regression slops

ANOVA should be used with regression analysis on different slopes

Model formula: Response ~ Explanatory 1 + Explanatory 2 + Explanatory 1:Explanatory 2

Assumptions:

Random sampling
Independent errors
Homogeneity of variances
Normal distribution of errors
Linearity

If regression plot shows two lines cross over = interaction between variables

7 of 13

Two-way Classification of Samples

Look for two-way differences between means

ANOVA or GLM (in non-normal error structures) should be used

Model formula: Response ~ Explanatory 1 + Explanatory 2 + Explanatory 1:Explanatory 2

Assumptions:

Random sampling
Independent errors
Homogeneity of variances
Normal distribution of errors

If data is unbalanced (samples have different numbers in them) use a GLM

8 of 13

Calculating Degrees of Freedom - Chi-squared

Method depends entirely on test statistic

d.f. = no. pieces of information had - no. required to calculate variation

Chi-squared test:

Theoretical distributions n - 2 (usually)
- n = no. cateogries for explanatory variable
- 2 OR no. bits information needed to calculate expected distribution
Contingency table = (c -1) x (r -1)
- c = no. columns
- r = no. rows

9 of 13

Calculating Degrees of Freedom - ANOVA/Linear Regr

ANOVA:

Test = a - 1
- a = no. sample means
Error = n - a
- n = no. observations
- a = no. sample means

Linear regression:

Test = 1
- (Slope and intercept) 2 - 1 grand mean
Error = n - 2
- n = sample size
- 2 = slope and intercept

10 of 13

Experimental Theory

Define test hypothesis
Identify model components
- Response
- Explanatory factor and levels
- Sampling unit
- Population samples
Define model
Degrees of freedom
Collect data
Input to R
Run model and check assumptions

11 of 13

Meeting Model Assumptions

Always plot data first to check it meets model assumptions

Significance tells nothing about size or precision of effect

For all analyses:

Significance (p-value) - identifies evidence of pattern
Effect size (difference between sample means/regression slope) - gives magnitude
Error bars/coefficient of determination (r) - gives precision

Shape of pattern depends on parameters

Theoretical mathematical models - used to work out how to transform data

Use biology of species to help understanding

Once collected, data only suits one model - R can run any model on data

Each model produces a unique set of results pertinent to particular design
Only one model will represent experiment design - must know what it is before collecting data

12 of 13

Which Test to Use?

Seek difference between averages of 2+ samples

Parametric ANOVA
Parametric t-test for two samples
Non-parametric Kruskal-Wallis
Non-parametric Mann-Whitney U for two samples

Identify trends between two continuous variables in 1+ samples

Parametric regression
Polynomial regressin on non-linear data
Parametric Pearson product-moment correlation on data that you're not looking for regression with
Non-parametric Spearman's rank for correlation

Identify a relation between frequencies in categorical classes of one sample

Chi-square/G-test on frequencies
Any expected frequencies <3, pool classes or Fisher exact test

13 of 13

Get Revising

Data-Based Models, How to Analyse Data and Which Test to Use