Correlation and Transformation

?
  • Created by: rosieevie
  • Created on: 11-01-18 14:05

Correlation and Transformation

Sometimes wish to seek correlation without making predictions about how one variable is influenced by the other

  • Otherwise restrict the data to seeking an inter-dependency or association between 2 continuous variables

Regression and correlation do 2 diff things

Regression makes predictions about response y to factor x 

Correlation measures the association/covariation of and y when neither variable is identified as the response

1 of 14

Expanding Regression - Categorical Second Factor

You can expand regression from simple to multiple regression by introducing a second factor

Second factor may be categorical

Plot response variable against continous factor 

  • Calculate one regression line for each level of categorical factor

If regression lines are not horizontal = significant continous factor

If regression lines do not coincide = significant categorical factor

If regression lines have different slopes = signficiant interaction effect

Similar interaction plots to two-way ANOVA:

  • x-axis represents a continous variable
  • Lines joining sample means become regression lines for each level of categorical factor
2 of 14

Expanding Regression - Continuous Second Factor

Need to illustrate data in 3D graph

  • Response on y-axis
  • Two continous factors on orthogonal horizontal axes 
  • Best-fit model will be plane through the data as opposed to lines through data

With these more complicated models ANOVA should have balanced design = Same number of observations recorded at each combination of factor levels

  • Design becomes unbalanced by missing data or by using correlated explanatory factors - non-orthological

e.g. if variation in body height modelled against right-leg length and against left-leg length

  • Second entered explanatory variable will appear to have no power to explain height
  • First entered explanatory variable will appear highly significant
  • Problem is that two variables are correlated with each other = unbalanced design
  • Variables are not orthogonal to each other
  • Little variation left over for 2nd variable due to variation explained by 1st entered factor
  • Better analysed with a one-factor regression on single explanatory variable - leg length
3 of 14

Incorrect Regression

When performing a regression, know which factor is explanatory and which is response

Response factor must be on y-axis

Slope and intercept will be different if you swap and factors but statistics will remain the same

  • Ratioinale for correlation - tells the relationship between two variables regardless of which one affects which
4 of 14

Incorrect Regression

When performing a regression, know which factor is explanatory and which is response

Response factor must be on y-axis

Slope and intercept will be different if you swap and factors but statistics will remain the same

  • Ratioinale for correlation - tells the relationship between two variables regardless of which one affects which
5 of 14

Correlation

Correlation determines association, regression makes predictions

Same graph for correlation and regression - don't put regression line through it for correlation

  • Don't know which one of variables is true predictor

Two types of correlation coefficient:

  • Pearson product-moment correlation coefficient, r
    • √**(Explained)/**(Total)
    • Sign indicates the direction of covariation
  • Non-parametric Spearman correlation coefficient, rs
  • calculated on ranked data
  • Less power

Magnitude and sign tell of linearity and direction of  correlation - give value of r with its sign

  • High positive number = positive correlation 
  • High negative number = negative correlation
  • Low numbers = no correlation
6 of 14

Correlation 2

Non-parametric Spearmans rank is equivalent but done on ranked data

  • Less efficient than parametric r - same values on y-axis, regardless of actual value
  • Still assumes linearity, normality and homogeneity of residuals from ranks

Correlation coeffcicient assumes that two variables have linear relation to each other

  • Perfect curved relation = r<1
  • Therefore, need to transform one/both axis in order to linearize relation
7 of 14

Assumptions and Transformations

Transformations help meet assumptions of statistical analyses 

  • Many types of transformations but none help with missing data or many 0s in data

Assumptions of Statistical Analyses:

  • Random sampling - all
    • Design consideration - transformation cannot help
    • Not met data must be resampled
  • Independence - all
    • Design consideration - transformation cannot help
    • Not met = resample data or factor out with new expanatory factor
  • Homogeneity of variances - ANOVA, regression and correlation
    • Violated by observatons that cannot take -ve values e.g. length as likely to have variance that increases with mean
    • Transforming response y can help
    • e.g. Log(y) as response
8 of 14

Assumptions and Transformations 2

  • Normality - ANOVA, regression and correlation
    • About normality of residuals
    • Problem for responses constrained between limits e.g. can't go below/above 1 and for counts
    • Transform response y
    • e.g. counts use √y, % use arcsine-root transformation
  • Linearity - regression and correlation
    • Problem for relationships with different dimensions (e.g. weight to length)
    • Comparing 3 dimensions to 1 dimensions leads to cubic relationship, not linear
    • Transforming y and/or help 
    • e.g. Log(y), Log(x), Sin(x), 1/x

If assumptions 3-5 not met - don't abandon use of parametric stats

  • Command 'glm' runs a General Linear Model - accomodate ANOVA on data with inherently non-normal distributions
  • eg. proportions (binomial distribution), frequencies of rare events (Poisson distribution) and variance increasing with mean
9 of 14

Transformations

Alternative route to meeting assumptions is transformation 

Less desirable than modelling of error structure w/ glm - transformation changes nature of test question

Understanding underlying biology suggests transformations 

  • Not cheating - planned in advance and same conversion applied to all observations

Idea to reduce complexity by converting a non-linear relation to a linear one

Data requiring transformations:

  • Response exponential = natural logs
  • Response and predictor have different dimensions = logging both axes
  • Response saturates = inverse one or both axes
  • Response cyclic = circular function e.g. sin, cos
10 of 14

Transforming a Non-Linear Relation

EXAMPLE - population per capita growth rate shows orifinally dramatic decline with pop size

Plotting against ln(pop size) draws out data from close to y-axis = linear relation

11 of 14

Transforming a Cyclic Relation

If data is cyclical in changes of response to continous variable, think of circular statistics 

Use sine(x) to linearize it

12 of 14

Multiple Zeros in Data

Always plot data first to see any obscurities in it

Data may be skewed by multiple 0s

If 0s don't say anything about the relation then you could remove them from the data

13 of 14

Steps in Analysing Data

Multiple steps that precede the observation stage:

  • Planning experimental design
    • Response?
    • How many factors?
    • How many samples?
    • How many replicates?
    • What statistical analysis?
  • Collecting data
    • What collection schedule to follow?
    • How to control for extraneous variation?
      • If you can't control it, factor it in

Always plan data collection and stats before collecting data

14 of 14

Comments

No comments have yet been made

Similar Biology resources:

See all Biology resources »See all Statistics resources »