Data-Based Models, How to Analyse Data and Which Test to Use

?
  • Created by: rosieevie
  • Created on: 11-01-18 15:27

Data-Based Models

Statistical packages like R work by fitting models to data

  • Require you to use an appropriate model for samples and variables under investigation before they estimate parameter values that best fit data

Standard convention for presenting statistical models - response variable(s) = explanatory variable(s)

  • = sign is statment of hypothesised relationship between variables

Chosen statistic quantifies the relationship of response variable to explanatory variables

3 main types of data:

  • One variable, one sample - chi-squared, G-test, Kolmogorov-Smirnov
  • Two variables, one sample
    • Categorical responses (contingency tables) - chi-squared, G-test for independence
    • Continous response and predictor - linear regression/correlation
  • One or more predictors, two or more samples - ANOVA or GLM
1 of 13

One Variable, One Sample

Look for goodness-of-fit frequencies (observed compared to expected)

  • Chi-squared or G-test of association
  • For continuous data, use Kolmogorov-Smirnov

Assumptions:

  • Data are nominal (not continous)
  • Frequencies are independent from each other
  • No cell has expected values <5
2 of 13

Two Variables, One Sample - Categorical Responses

For data of this kind, look for a dependent relationship between variables

Contingency tables used to look for interaction between variables

  • Ch-squared or G-test
  • For cells with expected values <5, use Fisher's exact test

Model formula: colour:behaviour ~ response

Assumptions:

  • Categorical data
  • Frequencies independent 
  • No cell with expected values <5 (if not Fisher's exact test)
  • Correction for continuity
3 of 13

Two Variables, One Sample - Continuous Response an

Plot response variable on y-axis and explanatory variable on x-axis

Linear regression should be used

If no clear functional relationship, use correlation to calculate r

Mdel formula: Response ~ Explanatory

Assumptions:

  • Random sampling
  • Independent errors
  • Homogeneity of variances
  • Normal distribution of errors
  • Linearity

If variance increases with response there is no linearity and data must be transformed

4 of 13

One-Way Classification of Two+ Samples - 1 Categor

Look for a difference between sample means

With one categorical predictor:

  • t-test for two groups
  • ANOVA for more than two groups
  • Repeated measures ANOVA for repeated measures on subjects
  • Transform data that violate asumptions
  • Kruskal Wallis for non-parametric ANOVA
  • Mann-Whitney for non-parametric t-test

Assumptions:

  • Random sampling
  • Independent errors
  • Homogeneity of variances
  • Normal distribution of errors

Model: Response ~ Explanatory

5 of 13

Selecting and Fitting Models to Data

R offers alternative commands for ANOVA

  • aov suits mode straightforward analyses with normally distributed residuals
  • glm = General Linear Model - accomodate ANOVA on data with inherently non-normal distributions e.g. proportions (binomial) or frequencies of rare events (Poisson)
6 of 13

One-Way Classification of Two+ Samples - 2 Continu

Look for differences between regression slops

ANOVA should be used with regression analysis on different slopes

Model formula: Response ~ Explanatory 1 + Explanatory 2 + Explanatory 1:Explanatory 2

Assumptions:

  • Random sampling
  • Independent errors
  • Homogeneity of variances
  • Normal distribution of errors
  • Linearity

If regression plot shows two lines cross over = interaction between variables

7 of 13

Two-way Classification of Samples

Look for two-way differences between means

ANOVA or GLM (in non-normal error structures) should be used

Model formula: Response ~ Explanatory 1 + Explanatory 2 + Explanatory 1:Explanatory 2

Assumptions:

  • Random sampling
  • Independent errors
  • Homogeneity of variances
  • Normal distribution of errors

If data is unbalanced (samples have different numbers in them) use a GLM

8 of 13

Calculating Degrees of Freedom - Chi-squared

Method depends entirely on test statistic

d.f. = no. pieces of information had - no. required to calculate variation

Chi-squared test:

  • Theoretical distributions n - 2 (usually)
    • n = no. cateogries for explanatory variable
    • 2 OR no. bits information needed to calculate expected distribution
  • Contingency table = (c -1) x (-1)
    • c = no. columns
    • r no. rows
9 of 13

Calculating Degrees of Freedom - ANOVA/Linear Regr

ANOVA:

  • Test = a - 1
    • a = no. sample means
  • Error = n - a
    • = no. observations
    • a = no. sample means

Linear regression:

  • Test = 1
    • (Slope and intercept) 2 - 1 grand mean
  • Error = n - 2
    • n = sample size
    • 2 = slope and intercept
10 of 13

Experimental Theory

  • Define test hypothesis
  • Identify model components
    • Response
    • Explanatory factor and levels
    • Sampling unit
    • Population samples
  • Define model
  • Degrees of freedom
  • Collect data
  • Input to R
  • Run model and check assumptions
11 of 13

Meeting Model Assumptions

Always plot data first to check it meets model assumptions

Significance tells nothing about size or precision of effect

For all analyses:

  • Significance (p-value) - identifies evidence of pattern
  • Effect size (difference between sample means/regression slope) - gives magnitude
  • Error bars/coefficient of determination (r) - gives precision

Shape of pattern depends on parameters

Theoretical mathematical models - used to work out how to transform data

  • Use biology of species to help understanding

Once collected, data only suits one model - R can run any model on data

  • Each model produces a unique set of results pertinent to particular design
  • Only one model will represent experiment design - must know what it is before collecting data
12 of 13

Which Test to Use?

Seek difference between averages of 2+ samples

  • Parametric ANOVA
  • Parametric t-test for two samples
  • Non-parametric Kruskal-Wallis
  • Non-parametric Mann-Whitney U for two samples

Identify trends between two continuous variables in 1+ samples

  • Parametric regression
  • Polynomial regressin on non-linear data
  • Parametric Pearson product-moment correlation on data that you're not looking for regression with
  • Non-parametric Spearman's rank for correlation

Identify a relation between frequencies in categorical classes of one sample

  • Chi-square/G-test on frequencies
  • Any expected frequencies <3, pool classes or Fisher exact test
13 of 13

Comments

No comments have yet been made

Similar Biology resources:

See all Biology resources »See all Statistics resources »