Correlation and Transformation

Sometimes wish to seek correlation without making predictions about how one variable is influenced by the other

Otherwise restrict the data to seeking an inter-dependency or association between 2 continuous variables

Regression and correlation do 2 diff things

Regression makes predictions about response y to factor x

Correlation measures the association/covariation of x and y when neither variable is identified as the response

1 of 14

Expanding Regression - Categorical Second Factor

You can expand regression from simple to multiple regression by introducing a second factor

Second factor may be categorical

Plot response variable against continous factor

Calculate one regression line for each level of categorical factor

If regression lines are not horizontal = significant continous factor

If regression lines do not coincide = significant categorical factor

If regression lines have different slopes = signficiant interaction effect

Similar interaction plots to two-way ANOVA:

x-axis represents a continous variable
Lines joining sample means become regression lines for each level of categorical factor

2 of 14

Expanding Regression - Continuous Second Factor

Need to illustrate data in 3D graph

Response on y-axis
Two continous factors on orthogonal horizontal axes
Best-fit model will be plane through the data as opposed to lines through data

With these more complicated models ANOVA should have balanced design = Same number of observations recorded at each combination of factor levels

Design becomes unbalanced by missing data or by using correlated explanatory factors - non-orthological

e.g. if variation in body height modelled against right-leg length and against left-leg length

Second entered explanatory variable will appear to have no power to explain height
First entered explanatory variable will appear highly significant
Problem is that two variables are correlated with each other = unbalanced design
Variables are not orthogonal to each other
Little variation left over for 2nd variable due to variation explained by 1st entered factor
Better analysed with a one-factor regression on single explanatory variable - leg length

3 of 14

Incorrect Regression

When performing a regression, know which factor is explanatory and which is response

Response factor must be on y-axis

Slope and intercept will be different if you swap y and x factors but statistics will remain the same

Ratioinale for correlation - tells the relationship between two variables regardless of which one affects which

4 of 14

Incorrect Regression

When performing a regression, know which factor is explanatory and which is response

Response factor must be on y-axis

Slope and intercept will be different if you swap y and x factors but statistics will remain the same

Ratioinale for correlation - tells the relationship between two variables regardless of which one affects which

5 of 14

Correlation

Correlation determines association, regression makes predictions

Same graph for correlation and regression - don't put regression line through it for correlation

Don't know which one of variables is true predictor

Two types of correlation coefficient:

Pearson product-moment correlation coefficient, r
- √**(Explained)/**(Total)
- Sign indicates the direction of covariation
Non-parametric Spearman correlation coefficient, r_s
r calculated on ranked data
Less power

Magnitude and sign tell of linearity and direction of correlation - give value of r with its sign

High positive number = positive correlation
High negative number = negative correlation
Low numbers = no correlation

6 of 14

Correlation 2

Non-parametric Spearmans rank is equivalent but done on ranked data

Less efficient than parametric r - same values on y-axis, regardless of actual value
Still assumes linearity, normality and homogeneity of residuals from ranks

Correlation coeffcicient assumes that two variables have linear relation to each other

Perfect curved relation = r<1
Therefore, need to transform one/both axis in order to linearize relation

7 of 14

Assumptions and Transformations

Transformations help meet assumptions of statistical analyses

Many types of transformations but none help with missing data or many 0s in data

Assumptions of Statistical Analyses:

Random sampling - all
- Design consideration - transformation cannot help
- Not met data must be resampled
Independence - all
- Design consideration - transformation cannot help
- Not met = resample data or factor out with new expanatory factor
Homogeneity of variances - ANOVA, regression and correlation
- Violated by observatons that cannot take -ve values e.g. length as likely to have variance that increases with mean
- Transforming response y can help
- e.g. Log(y) as response

8 of 14

Assumptions and Transformations 2

Normality - ANOVA, regression and correlation
- About normality of residuals
- Problem for responses constrained between limits e.g. can't go below/above 1 and for counts
- Transform response y
- e.g. counts use √y, % use arcsine-root transformation
Linearity - regression and correlation
- Problem for relationships with different dimensions (e.g. weight to length)
- Comparing 3 dimensions to 1 dimensions leads to cubic relationship, not linear
- Transforming y and/or x help
- e.g. Log(y), Log(x), Sin(x), 1/x

If assumptions 3-5 not met - don't abandon use of parametric stats

Command 'glm' runs a General Linear Model - accomodate ANOVA on data with inherently non-normal distributions
eg. proportions (binomial distribution), frequencies of rare events (Poisson distribution) and variance increasing with mean

9 of 14

Transformations

Alternative route to meeting assumptions is transformation

Less desirable than modelling of error structure w/ glm - transformation changes nature of test question

Understanding underlying biology suggests transformations

Not cheating - planned in advance and same conversion applied to all observations

Idea to reduce complexity by converting a non-linear relation to a linear one

Data requiring transformations:

Response exponential = natural logs
Response and predictor have different dimensions = logging both axes
Response saturates = inverse one or both axes
Response cyclic = circular function e.g. sin, cos

10 of 14

Transforming a Non-Linear Relation

EXAMPLE - population per capita growth rate shows orifinally dramatic decline with pop size

Plotting against ln(pop size) draws out data from close to y-axis = linear relation

11 of 14

Transforming a Cyclic Relation

If data is cyclical in changes of response to continous variable, think of circular statistics

Use sine(x) to linearize it

12 of 14

Multiple Zeros in Data

Always plot data first to see any obscurities in it

Data may be skewed by multiple 0s

If 0s don't say anything about the relation then you could remove them from the data

13 of 14

Steps in Analysing Data

Multiple steps that precede the observation stage:

Planning experimental design
- Response?
- How many factors?
- How many samples?
- How many replicates?
- What statistical analysis?
Collecting data
- What collection schedule to follow?
- How to control for extraneous variation?
  - If you can't control it, factor it in

Always plan data collection and stats before collecting data

14 of 14

Get Revising

Correlation and Transformation