# Statistics

- Created by: CommanderWuffels
- Created on: 21-12-19 10:13

## Variables

A variable = something that can vary!

- Categorical (non-numerical) - e.g., breed of dog
- Continuous (numerical) - e.g. amount of dog saliva in ml
- Discrete (numerical) - e.g. no. of times dog salivated

## Samples and Populations

Definitions

- A
**population**consists of all possible people or items who/which have a particular characteristic. - A
**sample**refers to a selection of individual people or items from a population.

**Population Parameters** - population mean, population standard deviation

**Sample Statistics** - sample mean, sample standard deviation

## Descriptive Statistics

- Data can be very complex and there can be lots of it.
- It's therefore useful to be able to summarise it.
- With this in mind we might ask ourselves some questions:
- What's a typical score in the data? Often (but not always) this is one from near the middle.
- How variable are my data? Gives an indication of how spread out the data are.
- What do my data look like? Do the data have a characteristic shape?
- Answering these would give us a good starting summary of our data.

Measures of central tendency:

- Mean - add up the sample scores and divide by the number of scores.
- Median - order (rank) the scores and find the one in the middle
- Mode - the most frequently occuring score in the sample.

Measures of spread/variability

- Range - difference between the minimum and maximum score. Range doesn't always change for distributions with different shapes.

## Average Deviation

**Step 1**: Calculate the mean.**Step 2:**Calculate deviation of each score from the mean.**Step 3:**Calculate average deviation.

Not very useful!

Problem is that the mean is usually somewhere near the centre so approximately half of deviations will be negative and half will be positive so when they are added together they will cancel each other out.

## Averaged Squared Deviation

**Step 1:**Calculate mean.**Step 2:**Calculate deviation of each score from the mean.**Step 3**: Sqaure deviation.**Step 4:**Calculate average squared deviation.

## Sample Variance

**Step 1:**Calculate mean.**Step 2:**Calculate deviation of each score from the mean.**Step 3:**Square deviation**Step 4**: Calculate**modified**average squared deviation (Divide by N-1 rather than N - when N is big this won't make much difference).

## Standard Deviation

**Step 1**: Calculate mean.**Step 2**: Calculate deviation of each score from the mean.**Step 3**: Square deviation**Step 4**: Calculate**modified**average squared deviation (**variance**)**Step 5**: Take square root.- The more "concentrated" data has a smaller standard deviation.
- The more spread out data has a large standard deviation.

## Frequency Histograms

Good way to inspect data

- Are there any odd looking scores?
- What is the mode?
- How are the scores spread out?
- How is the data distributed?
- Could also look at the data in terms of the proportion instead of frequency.

## Box Plot

- Get hinge positions by adding 1 to median position and dividing by two (i.e. lower hing is nth score from bottom, upper hinge is nth score from the top)
- h-spread = upper hinge-lower hinge
- Inner fences: lower hinge - 1.5 x h-spread (upper hinge + 1.5 x h-spread)
- Adjacent values - lowest and highest values falling within the inner fence.

## Scatter Plot

Each dot represents 1 participant, and their corresponding score on the x-axis and the y-axis.

## Basic Probability

- A major goal for us in statistics is to be able to make inferences/draw conclusions from data.
- We can never be certain about such inferences/conclusions so
**probabilities**will be fundamentally important. - Probability of some event occuring: P(event) - no. of possible outcomes consistent with event/no. of possible outcomes.
- Conditional probability - probability of an event given that something else is known.
- No. of possible outcomes consistent with event/no. of possible events given condition.

## Sampling Error

- Sampling error is the error associated with examining a sample rather than a population.
- Occurs because in our sample we do not have all the members of the population.

- Sampling error depends on the size of the sample
- Bigger sample - big sampling errors less likely
- Smaller sample - big sampling errors more likely
- With a large sample size the sample mean is more likely to be close to the population mean.

## Distributions of Data

- The way in which data are distributed is very important.
- Histogram/frequency plot shows you this with the shape.

Normally Distributed Data

- Many naturally occuring variables are normal (we say they are "normally distributed")
- Data clusters around a central "most likely" value with increasingly fewer data points with distance from the centre.
- If we don't have much data then the normality is difficult to see in the histogram.

Non-Normally Distributed Data

- Of course, not all data is normal
- Skewed - peaks on one side instead of the middle - mean is distorted by tails.
- Bimodal - two modes - mean not representative.

## Linking Probability and Distributions

- Histogram tells us about the data's "probability distribution".
- If each bin had width 1 then the area of the bar would be 1x bar height = fbin (probability of picking someone within this bin).
- Total area would then be the sum of all of the fbins. (probability of picking someone within the range of scores.
- Area in a range would be Sum(f1-f10). (probability of picking someone within this range).
- If there was an infinite amount of data then the bars would fall perfectly on the theoretical distribution of the data in the population from which the samples were drawn.
- If we had theoretical population distribution we could ask similar sorts of questions.
- It turns out that the probability of selecting someone at random within a particular range of the IV is based on the area under the appropriate distribution curve.

## The Normal Distribution

Particularly important "theoretical" distribution shape.

Properties:

- "bell-shaped"
- Symmetric about the mean
- Tails get to 0 at +/- infinity.
- Completely specified by mean and standard deviation hence N(pop. mean, pop. s.d.) commonly used to describe.
- The area under the curve is always equal to 1.
- Normal distribution is very close to 0 by the time it gets to 3 standard deviations away from the mean, but it never actually gets to 0.

## Area Under a Normal Distribution

Obviously the area (and hence probability of selecting someone at random who's in a particular range) will depend on mean and standard deviation of normal distribution

- Smaller standard deviation - less likely to get extreme values.
- Larger standard deviation - more likely to get extreme values.

How to calculate the shaded area under the normal distribution?

There is a table that can be used, but there can't be one for every possible normal distribution - so you must convert the score to a z-score.

## Z-Scores

A score sampled from a normally distributed population can be transformed to a z-score.

To do this, subtract the population mean from the score and then divide by the population standard deviation:

z = (x-pop mean)/pop s.d.

## Standard Normal Distribution

- All normally distributed data will follow the standard normal distribution when transformed to a z-score.
- If x - N(pop mean, pop s.d.) and z = (x - pop mean)/pop s.d. then z ~ N(0,1)

Why would we do this?

- Since then we can use a single distribution which tells us everything we need to know about any normally distributed population.
- In particular we can use a single "look up table" to tell us about areas underneath the curve in different regions defined by the z-score.

Crucial facts:

- Area under SND curve between any two z-scores represents the probability of obtaining a score in that range.

## SND Rules

- You can convert z-scores back to the equivalent x-score by multiplying the z-score by the standard deviation and then adding on the mean.
- Total area under the standard normal distribution is 1.
- p(z < a) - probability that z-score less than a = area under curve below a.
- p(z > a) = p(z < -a)
- p(z > a) = 1 - p(z < a)

## Sampling Distributions

- Repeatedly sample from a distribution and calculate a sample statistic (e.g. mean, s.d., median)
- Plot a histogram of the sample statistic and we get a sampling distribution.

## Sampling Distribution of the Mean

- Sampling distribution of the mean is a particularly important sampling distribution.
- Often we'll be interested in the mean of our sampling scores and we'll ask questions such as:

"Is it likely that I would have obtained a sample mean this large if it came from that population?"

"Are the means of these two groups very different?" - To answer these kind of questions we need to know about the distribution of sample means.
- Provided we know the parameters of the "parent population" from which data were sampled then we know about the parameters of the sampling distribution of the mean.
- Sampling distribution of the mean has the same mean of the population but a different standard deviation - pop s.d./ square root of population - standard error - must be smaller than the standard deviation of the parent population.
- SDM is a normal distribution.
- SDM is a distribution of sample means of samples from our scores not scores.
- Since SDM ~ N(pop mean, pop. s.d./SDM) we can convert our sample means data to z-scores by the usual method.

## Standard Error

The S.D. of the sampling distribtuion of the mea is usually just called the **standard error** (of the mean).

For normally distributed populations with mean and standard deviation, the standard error is just pop s.d/ N, where N is the size of the sample.

## The Central Limit Theorem

- Given a population mean μ and s.d. σ, the sampling distribution of the mean approaches a normal distribution with mean μ and s.d. σ/square root of N as N, the sample size, increases.
- Note that this is true regardless of the underlying distribution - so even if your population is not normal the distribution of means sampled from it will be.

## Confidence Intervals

- As noted, sample statistics and population parameters differ due to sampling error.
- Usually we'll have a sample statistic but really we want to make an inference about the population.
- Confidence intervals allow us to pin down the population parameter in question to a range of likely values.
- A 100% confidence interval would contain all possible population means.
- A 95% confidence interval narrows this down.
- The SDM is normal and like any other normal distribution we can convert it to the standard normal distribution by using z-scores.
- We know that for any normal distribution, 95% of scores are within 1.96 standard deviations of the mean.
- We can express a 95% level of confidence that a sample mean we have obtained is within 1.96 standard deviations of the mean.
- m +/- 1.96 x σ/square root of N
- If we sampled from the population many times and worked out a 95% CI each time then on about 95% of these occasions would my interval contain the population parameter.

## Related discussions on The Student Room

- Understanding Statistics »
- MA Big Data at KCL vs MSc Social Research Methods LSE »
- Is Undergrad Psychology hard? »
- How about LSE Statistics vs Accounting vs Quantitative Methods for Risk Management »
- A warning to psychology students, present and future. »
- PLS READ: sociology a level »
- Easy Maths modules at University »
- quant finance »
- Experimental Psychology at Oxford Uni »
- How much statistics is there in clinical assistant psychologist roles? »

## Comments

No comments have yet been made