Correlation and Transformation
- Created by: rosieevie
- Created on: 11-01-18 14:05
Correlation and Transformation
Sometimes wish to seek correlation without making predictions about how one variable is influenced by the other
- Otherwise restrict the data to seeking an inter-dependency or association between 2 continuous variables
Regression and correlation do 2 diff things
Regression makes predictions about response y to factor x
Correlation measures the association/covariation of x and y when neither variable is identified as the response
Expanding Regression - Categorical Second Factor
You can expand regression from simple to multiple regression by introducing a second factor
Second factor may be categorical
Plot response variable against continous factor
- Calculate one regression line for each level of categorical factor
If regression lines are not horizontal = significant continous factor
If regression lines do not coincide = significant categorical factor
If regression lines have different slopes = signficiant interaction effect
Similar interaction plots to two-way ANOVA:
- x-axis represents a continous variable
- Lines joining sample means become regression lines for each level of categorical factor
Expanding Regression - Continuous Second Factor
Need to illustrate data in 3D graph
- Response on y-axis
- Two continous factors on orthogonal horizontal axes
- Best-fit model will be plane through the data as opposed to lines through data
With these more complicated models ANOVA should have balanced design = Same number of observations recorded at each combination of factor levels
- Design becomes unbalanced by missing data or by using correlated explanatory factors - non-orthological
e.g. if variation in body height modelled against right-leg length and against left-leg length
- Second entered explanatory variable will appear to have no power to explain height
- First entered explanatory variable will appear highly significant
- Problem is that two variables are correlated with each other = unbalanced design
- Variables are not orthogonal to each other
- Little variation left over for 2nd variable due to variation explained by 1st entered factor
- Better analysed with a one-factor regression on single explanatory variable - leg length
Incorrect Regression
When performing a regression, know which factor is explanatory and which is response
Response factor must be on y-axis
Slope and intercept will be different if you swap y and x factors but statistics will remain the same
- Ratioinale for correlation - tells the relationship between two variables regardless of which one affects which
Incorrect Regression
When performing a regression, know which factor is explanatory and which is response
Response factor must be on y-axis
Slope and intercept will be different if you swap y and x factors but statistics will remain the same
- Ratioinale for correlation - tells the relationship between two variables regardless of which one affects which
Correlation
Correlation determines association, regression makes predictions
Same graph for correlation and regression - don't put regression line through it for correlation
- Don't know which one of variables is true predictor
Two types of correlation coefficient:
- Pearson product-moment correlation coefficient, r
- √**(Explained)/**(Total)
- Sign indicates the direction of covariation
- Non-parametric Spearman correlation coefficient, rs
- r calculated on ranked data
- Less power
Magnitude and sign tell of linearity and direction of correlation - give value of r with its sign
- High positive number = positive correlation
- High negative number = negative correlation
- Low numbers = no correlation
Correlation 2
Non-parametric Spearmans rank is equivalent but done on ranked data
- Less efficient than parametric r - same values on y-axis, regardless of actual value
- Still assumes linearity, normality and homogeneity of residuals from ranks
Correlation coeffcicient assumes that two variables have linear relation to each other
- Perfect curved relation = r<1
- Therefore, need to transform one/both axis in order to linearize relation
Assumptions and Transformations
Transformations help meet assumptions of statistical analyses
- Many types of transformations but none help with missing data or many 0s in data
Assumptions of Statistical Analyses:
- Random sampling - all
- Design consideration - transformation cannot help
- Not met data must be resampled
- Independence - all
- Design consideration - transformation cannot help
- Not met = resample data or factor out with new expanatory factor
- Homogeneity of variances - ANOVA, regression and correlation
- Violated by observatons that cannot take -ve values e.g. length as likely to have variance that increases with mean
- Transforming response y can help
- e.g. Log(y) as response
Assumptions and Transformations 2
- Normality - ANOVA, regression and correlation
- About normality of residuals
- Problem for responses constrained between limits e.g. can't go below/above 1 and for counts
- Transform response y
- e.g. counts use √y, % use arcsine-root transformation
- Linearity - regression and correlation
- Problem for relationships with different dimensions (e.g. weight to length)
- Comparing 3 dimensions to 1 dimensions leads to cubic relationship, not linear
- Transforming y and/or x help
- e.g. Log(y), Log(x), Sin(x), 1/x
If assumptions 3-5 not met - don't abandon use of parametric stats
- Command 'glm' runs a General Linear Model - accomodate ANOVA on data with inherently non-normal distributions
- eg. proportions (binomial distribution), frequencies of rare events (Poisson distribution) and variance increasing with mean
Transformations
Alternative route to meeting assumptions is transformation
Less desirable than modelling of error structure w/ glm - transformation changes nature of test question
Understanding underlying biology suggests transformations
- Not cheating - planned in advance and same conversion applied to all observations
Idea to reduce complexity by converting a non-linear relation to a linear one
Data requiring transformations:
- Response exponential = natural logs
- Response and predictor have different dimensions = logging both axes
- Response saturates = inverse one or both axes
- Response cyclic = circular function e.g. sin, cos
Transforming a Non-Linear Relation
EXAMPLE - population per capita growth rate shows orifinally dramatic decline with pop size
Plotting against ln(pop size) draws out data from close to y-axis = linear relation
Transforming a Cyclic Relation
If data is cyclical in changes of response to continous variable, think of circular statistics
Use sine(x) to linearize it
Multiple Zeros in Data
Always plot data first to see any obscurities in it
Data may be skewed by multiple 0s
If 0s don't say anything about the relation then you could remove them from the data
Steps in Analysing Data
Multiple steps that precede the observation stage:
- Planning experimental design
- Response?
- How many factors?
- How many samples?
- How many replicates?
- What statistical analysis?
- Collecting data
- What collection schedule to follow?
- How to control for extraneous variation?
- If you can't control it, factor it in
Always plan data collection and stats before collecting data
Related discussions on The Student Room
- What are the 5 concepts of business intelligence? »
- Daphnia heart rate Practical »
- PMCC »
- What A levels should I pick for medicine »
- Is artificial intelligence worth stydying? »
- Geography NEA, statistics tests »
- Lse Economics and Economic history…help please »
- oxford maths with A*A*A »
- Wjec a level unit 4 maths »
- Applying for biological natural science at Cambridge without chemistry? »
Comments
No comments have yet been made