# Stage 5- Analyse and interpret the data

Geography Skills

Stage 5- Analyse and interpret the data

## Analysing the data

Qualitative

• No use of statistics
• There may be no need as pattern is so obvious- e.g. scatter graphs, compare graphs

Quantitative

• Use quantitative to be objective, know exact level or direction of relationship/trend, know exact pattern, identify anomalies
• Allows the analysis to go further
1 of 26

## Types of quantitative data analysis

Descriptive

• Descriptive techniques- central tendency, e.g. mean, mode (most common), median (middle value), scatter graphs
• Deviation from the central tendency- range, interquartile range, standard deviation
• Frequencies (graph)- Kurtosis, skew

Normal distribution

• Mean, mode and median coincide- 95% of values within 2 standard deviations of mean

Types of data pattern

• Horizontal e.g. nearest neighbour- clustered, regular or scattered
• Vertical e.g.Lorenz curve- cumulative frequency
• Networks- beta index, alpha index, centrality
• Interactions e.g. gravity model
2 of 26

## Types of quantitative data analysis

Tests

• Statistical test for difference e.g. Mann Whitney, Chi-squared
• Statistical test for correlation or association e.g. Spearman's rank, Chi-squared
• In tests remember null hypothesis and degrees of freedom influence the result and 95% is the minimum expected accuracy. This is found from tables of significance

Interpretation

• Look at results in context of original question
• Try to offer explanations for patterns/links/trends and any anomalies
3 of 26

## Central tendency- mean

Measures of central tendency represent data sets by a middle value around which other values cluster

Mean

• Meaning- add the quantities together (sum of values) and divide by the number of quantities
• Most widely used, features in other statistical calculations
• Limitations- Distorted by extreme values and involves calculation, mean weights each value according to its magnitude, different distributions give similar mean values
• The mean provides an accurate summary where data have a normal distribution and a narrow range of values but are often unrepresentative when distributions are skewed
4 of 26

## Central tendency- median and mode

Median

• Meaning- the central value when values are put in order
• The median gives equal weight to each value meaning it is more representative measure than the mean for data sets that are skewed
• Limitations- gives no idea of other values
• Wildly different data sets can give similar median values
• Cannot be used in any further statistical calculation because it has no true mathematical properties

Mode

• Meaning- The value that occurs most frequently
• Limitations- Gives no idea of other values
• Depends entirely on the arbitrary choice of class interval
5 of 26

## Dispersion

Range

• Meaning- difference between highest and lowest values
• Limitations- Distorted by extreme values, only uses two values in the data set

Interquartile range

• Meaning- Difference between inner half of the data around the median
• Used alongside the median as a statement of dispersion
• Each half is split into four equal parts- quartiles, the upper quartile is the boundary for the upper 25% of values, the lowest 25% of values is the lower quartiles and the difference between the upper and lower quartile is the interquartile range
• Limitations- Ignores values above and below this
6 of 26

## Dispersion- Standard deviation

Standard deviation

• Meaning- shows spread of all values around the mean
• Incorporates all the values in a data set
• The standard deviation has a precise relationship with data sets which follow a normal frequency distribution
• To convert values into a unit of standard deviation we subtract the mean and then divide by the standard deviation
• standard deviate= (value-mean) / standard deviation
• Limitations- Uses a formula and calculation distorted by extreme values

Coefficient of variation

• The value of standard deviation is strongly influenced by the magnitude of the mean
• This is a problem when comparing dispersion in two different data sets with very different means
• Using the coefficient of variation overcomes this problem as it expresses the standard deviation as a percentage of the mean
7 of 26

## Tests for differences

Students' t test

• Compares the arithmetic means of two samples to determine the likelihood that any difference could have occurred by chance
• It is a parametric test meaning it should only be applied where samples are derived from populations that have a normal frequency distribution

Mann-Whitney U statistic

• Meaning- Compares medians and ranks to see if data set differs
• Makes no assumptions about the normality of the population from which sample data are drawn
• It can be applied to small data sets, data measured on an ordinal scale and to data sets containing unequal values of numbers
• Limitations- Uses a formula, calculation and significance table, can only be applied to two data sets
8 of 26

## Tests for difference

Chi-squared

• Meaning- Compares observed and expected frequencies, used to determine whether an observed frequency distribution differs significantly from the frequencies that might be expected if the distribution were random
• Limitations- Uses a formula, calculation and significance table and how is expected frequency determined?
•  There are two versions of the chi-squared test: one sample version and a test for two or more sample distributions. Conditions apply:
• data are in frequencies- test is invalid for percentages or proportions
• there should not be many categories for which expected frequencies are small
9 of 26

## Calculating the U statistic

• 1. Arrange the values in the two data sets in rank order of size for both sample together
• 2. Where the values are tied the mean ranking is used
• 3. Sum the rank values for each of the data sets separately and then calculate the U statistic using the equations. The smaller of the two values for U is used in statistical tables to estimate significance
• U is significant if it is less than the critical values listed in the tables
10 of 26

## Calculating the one-sample chi-squared

• 1. If we assume the distribution is random we can generate an expected distribution based on frequency
• 2. The formula is then used to calculate
• 3. The significance of the chi-squared is found in statistical tables. Degrees of freedom are obtained by multiplying the number of columns (k) minus one, by the number of rows (r) minus one.
11 of 26

## Calculating the two-squared chi-squared

• 1. Sum the row values, column values and the total number of values in the data set
• 2. Calculate the expected frequencies for each cell by multiplying its row value by its column value and dividing by the total number of values
• 3. Substitute the expected values for each cell in the chi-squared formula and sum the results
• The significance of the chi-squared value is checked in statistical tables
12 of 26

## Tests for association

Spearman's rank

• Meaning- measures strength of relationship between two sets of ranked data
• Limitation- only uses ranks of data and uses a significance table

Chi-squared

• Meaning- compares observed and the frequency expected given a certain hypothesis
• Limitations- Uses a formula and calculation and significance table, how is expected frequency determined?
13 of 26

## Trends and relationships

Often shown on graphs, especially scatter graphs and tell you:

• The direction of trend/relationship- positive, negative or neutral
• The strength of trend/relationship- strong, weak or non-existent
• The shape of the trend/relationship- linear, parabolic, exponential, unclear
• If there are values that are anomalies
14 of 26

## Spearmans rank correlation

• Non-parametric test so has the advantage of being distribution free
• 1.Rank the values x and y from 1 (largest) to 12 (smallest or however many there are) If two values are equal allocate to them the same average ranking
• 2. For each pair of values find the difference in rank between them (d) and square each difference (d2)
• 3. Sum the square of the differences
• 4. Complete the calculation of the Spearman correlation using the formula
• 5. The significance of the correlation coefficient is obtained from tables
15 of 26

## Pearson's product moment correlation

• Its outcome is a coefficient of correlation that has exactly the same properties as the Spearman's rank correlation coefficient
• It is a parametric test so should only be used when sample data are drawn from a statistical population that has normal distribution
16 of 26

## Testing relationship between data sets

Correlation

• Correlation measures the statistical association between two variables, x and y.
• Variable x is known as the independent variable and is responsible for changes in the dependent variable y
• Identifying independent and dependent variables is not always straightforward

Correlation coefficients

• Correlation coefficients measure the strength of a relationship or association between two variables
• They vary on a scale of +1 to -1, where +1 is a perfect positive correlation and -1 perfect negative or inverse correlation. A correlation coefficient close to 0 suggests little or no relationship
17 of 26

## Patterns

These are often shown on maps. This also covers morphology (shape) Patterns can be:

• Nucleated or clustered together
• Linear
• Cuneiform or Cruciform (cross-shaped)
• Regular
• Concentric
• Random or scattered/dispersed or amorphous
18 of 26

## Networks

In most forms of network analysis there are some key limitations:

• They treat all routes equally, regardless of their quality
• They focus on linkages rather than time or distance of journeys
• They look at planar (flat) networks- no flyovers etc
• They ignore who uses that route
19 of 26

## Network analysis

Any network can be broken down into its main elements. Different terms can be used for the same thing which can cause confusion

• Centres in the network are called nodes or vertices (V)
• The routes are called routes or edges (E)
• Independent/unconnected parts are called sub-graphs (G)
• A completely linked set of nodes and routes is called a circuit

Simplest measure of networks is the Beta index= E / V The network with a complete circuit will give a score of 1. The maximum result possible is 3

More complex measure is the Alpha index which compares the actual number of circuits with the maximum possible within the network- E - V + G / 2V - 5

Another measure is centrality which tells us how central or accessible a place is in the network.

20 of 26

## Caution

When using statistical tests there are some key aspects must remember:

• To state null hypothesis
• To state your alternative hypothesis (if null is disproved)
• To calculate the degrees of freedom
• To know the level of significance you are prepared to accept
21 of 26

## Inferential statistics

Are used to infer population values from sample values. This leads us to the concept of statistical significance and the probability that the outcomes of investigation based on sample data are due to chance

Standard error of the mean

• Used to assess the value of the population mean from sample data sets
• The logic is that if you took a large number of samples from a population, calculated the mean from each sample and then plotted them as a frequency curve, they would follow a normal distribution
• Enables us to estimate the limits of the population mean because its relationship to the sampling distribution is the same as standard deviation to the normal frequency distribution
• Standard error is related to the square of sample size

Standard error of the percentage

• Often used when estimating the proportions of land-use types in an area from sample
22 of 26

## Coefficient of determination

• The coefficient of determination is the product moment correlation coefficient squared, expressed as a percentage
• It measures the statistical variation in y 'explained' by x
23 of 26

## Simple linear regression

• Simple linear regression, involving two variables, x and y, is a technique for fitting a straight line to points on a scatter chart
• The regression line is known as 'least squares' because it minimises the sum of the squares of the deviations from the line, and is statistically the 'best fit'
• Regression allows us to predict a value of y from a known value of x
• A regression equation provides us with precise model of the relationship between two variables and allows us to make comparisons with the same variables in other geographical locations
• Regression models are inappropriate where data trends are curvilinear
24 of 26

## Spatial statistics

Index of dissimilarity

• The index of dissimiliarity is usually applied to the study of segregation among ethnic groups
• It measures the unevenness with which two groups are distributed within small spatial units such as wards or census tracts
• The index ranges from 0 to 1. The higher the score the more segregated the groups are
• An index of zero means that the proportion of group B's population in each census tract is exactly the same as the proportion of group W's population
25 of 26

## Nearest neighbour analysis and location quotients

• Technique measuring point patterns in space
• Gives precise descriptions to rural settlement patterns
• The nearest neighbour index ranges from 0. where all the points form a single cluster, to 2.15 which is a perfectly uniform pattern
• The technique is based on finding the average distance between points and their nearest neighbour. Taking each point in turn, the distance to thenearest neighbouring point is measured using the formula

Location quotients

• Most often used to measure the concentration of an economic activity in an area or region compared to the national average
• A location quotient of 1 shows that the activity is represented in exactly the same proportion as nationally
• Less than 1 suggests that activity is more imoportant locally than nationally
• More than 1 indictates that the activity is less important locally compared to the national average
26 of 26