# Statistics

- Created by: ancyaugustine321
- Created on: 25-11-15 16:34

## Types of Data

**quantitative** (numerical data) or

**qualitatitive** (non-numerical data).

Quantitative data can be

**discrete** (when it can only take certain values) or

**continuous** (can take any value in a certain range)

## Grouping Data

When spread of data is too big we often group the data in a **frequency table** using **class intervals. ** Grouping data can result in loss of accuracy in both calculations & presentations.

**Bivariate data** are used when you measure 2 related things e.g. height & weight of children.

**Categorical data** is data sorted into categories (groups)

**Ranked data** have values that can be ranked (put into order)

## Collecting Data

**Primary data** - collected by person who is going to use it. Not yet been processed.

Advantages: known accuracy, know how obtained.

Disadvantages time consuming, expensive.

**Secondary data** - collected by someone else.

Advantages cheap & easy to obtain.

Disadvantages – might be out of date, have mistakes, don’t know how collected.

**Experiments –** one variable is controlled (**explanatory** or **independent** variable) while its effect on the other variable (**response** or **dependent** variable) is observed.

## Collecting Data 2

**S****urveys** are useful if collecting personal data. Main survey methods are:

- Postal (advantage cheap; disadvantage poor response, limited data can be collected)
- Personal interviews (advantage good response; disadvantage expensive, interviewer can influence results)
- Telephone surveys (same as for personal interviews)
- Observations (advantage systematic; disadvantage results can depend on chance)

**Population & Sampling**

**Population** is the group you want to find out about (eg all girls in school; all cars in UK)

A **Census** is information about *every* member of the population.

## Collecting Data 3

A **sample** of data is collected from a part of the population in order to make conclusions about the whole population.

**Advantages**: practical, cheaper, quicker than doing a census.**Disadvantages**: don’t have information about every member of the population. May not be representative of the population.

Always try to ensure sample is free from **bias.**

## Types of sampling

Simple random sampling – every member of the population has an equal chance of being selected. Advantage unbiased. Number the population then use calculator or random number tables to generate random numbers. Pick people with those numbers.

· **Systematic sampling** – divide population size by sample size eg 2000 ÷ 50 = 40**. ** Randomly choose a start number between 1 & 40 say 3. Your sample will be the 3^{rd} person on the list then every 40^{th} person after that until you have 50 people. Should produce an unbiased sample unless there is some sort of pattern in the data.

· **Stratified Sampling** – split population into groups (strata) & choose a number from each group *proportional* to the size of the group in the population. Useful if you have easy to define categories eg age, gender.

·

## Types of sampling 2

· **Cluster sampling** – choose a random sample of small groups (clusters) then use all members from each selected cluster. Disadvantage: Can get a biased sample eg people living in same postal district could have similar incomes or employment.

· **Quota sampling** – often used in market research. Divide population into groups based on age, gender etc. Interview a certain number of people from each group. Disadvantage: Can easily be biased as sample chosen depends on interviewer.

· **Convenience sampling** – sample chosen for convenience. Taken from a section of the population present at one particular place & time eg outside a shop. Disadvantage: easily biased as no attempt made to make sample representative

## Questionnaire

A questionnaire is a set of questions designed to obtain data from a population.

Questions should be relevant

- Questions should be clear & easy to understand
- Have option boxes or an opinion scale
- Questions shouldn’t be leading
- Questions should be unambiguous

An **open** question has no suggested answers & gives people a chance to reply as they wish.

A **closed** question has a set of answers for the **respondent** to choose from.

**Opinion scales** – people are given a statement & asked to use a scale of say 1 to 5 to say how strongly they agree or disagree with the statement.

## Questionnaire 2

A **pilot survey** is a small scale replica of the actual survey or experiment that is to be carried out. Useful to test the design & methods of that survey. Also gives idea of response rate.

**Random response** – useful for sensitive questions eg:

Toss a coin. If it lands on heads tick the “yes” box. If it lands on tails answer the question “have you committed a crime in the past 12 months”. If 100 people surveyed you expect roughly 50 of them to get a “head”. But if there are 60 people who ticked “yes” then 10 out of 50 must have got a “tail” and committed a crime in the last 12 months so the proportion of people committing a crime is 10/50 or 20%

## EXPERIMENTAL DESIGN

**Before & after experiments**– investigate the same group of people before & after an event to see how they are affected by it.**Capture – recapture methods**eg catch 50 fish from a lake & mark them with a tag. Return them to lake. A little later catch 200 fish. If 10 have tag then estimate number of fish (n) in lake using proportions: 10/200 = 50/n. Solve this to find n. Note you are assuming the population of fish does not change (ie no births/deaths)**Control group**a group that is as similar as possible to the experimental group. Eg if testing new medicine randomly give ½ people the new medicine & ½ placebo.**Data logging –**mechanical or electronic way of collecting data at set intervals**Matched pairs**– use pairs of members of population eg identical twins. Put one member from each pair into each of 2 groups. Treat each group differently & then compare groups.

## Accuracy

Be aware of accuracy of measurements. If something (x) measured as 5 to nearest whole number then its lower and upper bounds are 4½ ≤ x < 5½.

## STATISTICAL DIAGRAMS

**Pictograms** – use symbols or pictures to represent items

**Bar charts**– bars may be vertical or horizontal**Multiple bar charts –**have more than one bar for each class**Compound bar charts**– have single bars split into separate sections for each category**Pie charts**– good way of displaying data when you want to show how something is shared or divided. When constructing pie charts for real data, you will often need to use rounded values for the size of the angles.**Comparative pie charts**– can be used to compare 2 sets of data of different sizes. The*areas*of the 2 circles should be in the same ratio as the 2 frequencies.

## STATISTICAL DIAGRAMS 2

Stem & Leaf diagrams – allow you to show the distribution of the data but retains the detail of the data. Must have a *key* (eg: 4|7 = 47).

**Back to back stem & leaf diagrams**allow comparison of 2 sets of data**Frequency polygons**– used to show the shape of a continuous frequency distribution. Frequencies are plotted against the*mid-points*of each class interval & joined with lines.**Population pyramids**– consist of 2 back to back bar charts which allow you to compare aspects of a population usually by gender.**Histograms**– made up of a series of bars where the*area*of the bar represents the frequency of the class interval. In a histogram, frequency density = frequency / class**width**

## Choropleth maps

used to classify regions. Different regions are shaded with an increasing depth of colour or using hatched lines or dots

## Misleading Graphs

Data presentations can be **misleading. ** All data presentations should:

- Take care not to misuse length, area, volume (eg doubling length of side of square actually increases area by 4).
- Mark scales clearly and make sure they go up in equal steps
- Give units
- Clearly label & title

## Cumulative frequency and SUMMARISING DATA

Used to estimate median, LQ, UQ, percentiles, deciles etc.

Add an extra column to your frequency table with the cumulative frequencies in it.

Plot cumulative frequencies on vertical axis against *upper bound* of class width on horizontal axis.

Points joined with a smooth cumulative frequency curve for continuous data but with straight lines (cumulative frequency step polygon) for discrete data.

**SUMMARISING DATA**

To summarise any set of data we use a measure of **location** (ie an average) and a measure of **spread** (ie how spread out data around that average).

## MEASURES OF LOCATION (averages)

The **mode** is the value that occurs the most often. A set of data can have 2 modes or no modes at all.

The **mean** is calculated by adding all the data values up and dividing by how many there are

The **median** is the middle value when they are written out in order. If there are 2 middle values then you take the average of them. If there are *n* values then the median is the ^{th} value.

Need to know how mean, median affected by adding or removing a data value or by making adjustments to all data values *(see mymaths – GCSE statistics – averages – making adjustments & mymaths – GCSE statistics – averages – changing the data)*

**Weighted mean** is used to combine different sets of data when one is more important that another.

The **mean from a frequency distribution** is calculated by creating a new column from the first 2 columns (x and f) and summing it to help you work out the mean.

## Grouped fequency means

If you have **grouped frequency data** (0 ≤ *x* < 5, 5≤ *x* < 10 etc) then the best you can do is estimate the mean using the **mid-point** (2.5, 7.5 etc) and carry out the same process as above.

The **modal class** will be the class with the highest frequency.

## Median

**Median** can be found using a cumulative frequency diagram or can be estimated from a table using *interpolation* as follows:

Find the interval in which the median lies, (call the lowest bound of this interval L, the frequency of this interval f and the width of this interval w).

Find the position of the median in this interval (say its position is p)

The median is L + ( (p/f) x w) )

## MEASURES OF SPREAD

Range is difference between highest and lowest values.

**Quartiles** divide the data into 4 equal groups. Find median then find median of values below/above median. (Or use LQ is ^{th} value & UQ is ^{th} value)

· **Interquartile Range (IQR)** is the difference between the upper & lower quartiles

IQR = UQ – LQ. 50% of data lies between LQ and UQ.

· **Percentiles** divide a set of data into 100 equal parts**. **

· **Deciles** divide a set of data into 10 equal parts**.**

· **Variance** is how much the data varies from the mean

· **Standard deviation** is the positive square root of the variance (sd is never negative

Example: standard deviation of a list of values 3, 4, 6, 10, 12, 13

*n* = 6, = 3 + 4 + 6 + 10 + 12 + 13 = 48, = 3^{2} + 4^{2} + 6^{2} + 10^{2} +12^{2} + 13^{2} = 474 mean = 8

## Standard deviation of frequency

The **standard deviation from a frequency distribution** is calculated by creating 2 new columns from the first 2 columns (x and f) and summing them to help you work out the standard deviation

If you have **grouped frequency data** (0 ≤ *x* < 5, 5≤ *x* < 10 etc) then the best you can do is estimate the standard deviation using the mid-point (2.5, 7.5 etc) and carry out the same process as above.

## Standardised scores and Box Plots

These can be used to *compare* 2 different sets of data. They are defined as:

Standardised score =

A standardised score of zero is the mean. Standardised scores >0 are above average and standardised scores <0 are below average.

**Box plots**

Using a scale for your data, mark where the LQ, median, UQ, min & max lie. Join the LQ, median and UQ up to form a box. The whiskers extend to the min & max value. However if there are outliers (see definition below) the whiskers should extend to the highest and lowest values that are not outliers and the outliers should be marked with an ‘x’.

An **outlier** is any point that is less than 1.5 times the IQR below the lower quartile or more than 1.5 times the IQR above the upper quartile.

## Skewness and Use of measures of location & spread

If median is right in middle of LQ & UQ then distribution is **symmetrical** &

mean = mode = median

If median closer to UQ than LQ then data **negatively skewed** and

mode > median > mean**. ** Data bunched to right.

If median closer to LQ than UQ then data **positively skewed** and

mode < median < mean**. ** Data bunched to left.

**Use of measures of location & spread**

Use mean with standard deviation & the median with range or IQR. Main advantage of mean is that it uses all data. Main advantage of median is that it is not affected by extreme values.

## SCATTER DIAGRAMS AND CORRELATION

If the points on a scatter diagram lie approximately on a straight line there is a linear relationship between 2 variables. Correlation is a measure of the *strength* of the linear relationship between the 2 variables.

**Positive correlation** - one variable increases as the other increases.

**Negative correlation** - one variable decreases as the other increases.

Association does not necessarily mean there is correlation (there may be a non-linear relationship).

When a change in 1 variable directly causes a change in another variable, there is a **causal relationship** between them. (Correlation does not necessarily mean there is a causal relationship). One way to calculate correlation is using Spearman’s rank correlation coefficient. You will be given a table of data (see first 2 columns below) & you will need to rank the data (see 3^{rd} & 4^{th} columns below). If 2 data points are tied (eg the 2^{nd} and 3^{rd} are tied) then you rank them both by their average (ie in the example 2.5).

## SCATTER DIAGRAMS AND CORRELATION 2

Example: “%” is the % scored by pupils in a test; “IQ” is their IQ. Note there is data from 9 pupils:

Once the data are ranked, find “d” the difference between the ranks and d^{2} the square of the ranks. Then find Σd^{2} and use the formula on the formula sheet ie

SRCC = = 0.904 (3dp)

Note that n is the number of students ie 9.

We see in the above example that the SRCC is close to +1 so we have strong positive correlation. If SRCC is close to -1 we have strong negative correlation. If SRCC is close to zero we have no linear correlation.

A **line of best fit** is a straight line drawn so that the plotted points on a scatter diagram are approximately evenly scattered either side of the line. If possible it should be drawn through the mean point .

## Lines of best fit

**Interpolation** is when you find values *within* the range of the values you are given.

**Extrapolation** is when you find values *outside* the range of the values you are given.

The **equation of the line of best fit** is where *a* is the gradient of the line and *b* is the intercept with the y-axis.

Find gradient by finding 2 points on the line say and and calculating:

ie difference in y coordinates divided by difference in x coordinates

*b* can be read off the graph or calculated from b = y – ax using any point (*x,y*) on the line.

*(see mymaths – GCSE statistics – scatter graphs – revision – tabs 1-5)*

## Non-linear correlation and time series

Sometimes the data are related by a non-linear relationship . Equations of typical curves include:

(reciprocal curve) (quadratic curve) (square root curve)

**TIME SERIES**

**Time series** are plotted on a line graph and use data measured over a period of time to show **trends. ** They are plotted with time on the x-axis.

A long-term trend is the way a graph appears to be going over a long period of time. There may be a *rising* trend, a *falling* trend or a *level* trend.

Some time series graphs show repeating patterns. This is called **seasonal variation** (think of ice cream sales through a year – they go up in summer)

## Time series 2

A **moving average** is a way of smoothing out these seasonal variations.

Eg: to calculate the 4 point moving average of a set of data, take the first 4 pieces of data and work out their mean. Then take the 2^{nd}, 3^{rd}, 4^{th} & 5^{th} pieces of data and work out their mean etc until you reach the last 4 pieces of data. Remember when plotting them to plot the moving average *midway* (on the x-axis) between the data values it represents ie between the 2^{nd} and 3^{rd} data values for the first moving average.

Note you can calculate other moving averages if required (eg 3-point, 5-point etc)

A **trend line** is like a line of **best fit** through the moving averages and shows the overall trend of the data.

The **seasonal effect** is the difference between the actual value and the trend line value.

## Time series 3

To **predict a value in the future**, calculate the **average seasonal effect** which is the mean of all the seasonal effects for the same quarter every year. Then predict the value by reading the value from the trend line and adding the average seasonal effect.

The **equation of the trend line** is where *a* is the gradient and *b* is the intercept with the y-axis. (*a* and *b* can be calculated in the same way as for scatter diagram lines of best fit)

## INDEX NUMBERS

I**ndex numbers** are used to compare a quantity, value or price of an item over time.

Index number = current value x 100

value in base year

A **chain base index number** tells you the annual percentage change.

Chain based Index number = current value x 100

Previous year’s value

To find a **weighted index number** you need to calculate the index number for every element & then find the weighted average of all the elements

*(see mymaths – GCSE statistics – weighting – revision – tab 2)*

## INDEX NUMBERS 2

The **retail price index** is a weighted mean of the price relatives of goods & services. The weightings are chosen to show the spending habits of an average household.

## Probability

A probability of 0 means the event is impossible

A probability of 1 means the event is certain to happen.

**Outcomes** are the possible results of a trial or experiment

**Equally likely outcomes** are outcomes that have the same chance of happening.

**Probability of an event =** number of successful outcomes

total number of outcomes

## Probability 2

Relative frequency– if we don’t know the probability of something we can estimate it by repeating an experiment over & over again and calculating:

Relative frequency = number of successful outcomes

total number of trials

The more times you repeat the experiment, the nearer your estimate for the probability should be to the true value.

The **sample** space is a list of all possible outcomes

A **venn diagram** is a diagram representing a sample space.

A set of events is **exhaustive** is the set contains all possible outcomes

Two events are **independent** if the outcome of 1 event does not affect the outcome of the other event. For 2 independent events A and B ; P(A and B) = P(A) x P(B)

## Tree diagrams

**Tree diagrams** can be used to calculate probabilities. *Multiply* along branches to calculate probability of each outcome. Then if more than one outcome is needed, *add* the probabilities.

Watch out – look to see if the question says “with” or “without” replacement as it will affect the probabilities for the second and later events.

## Conditional probability

**Conditional probability** is the probability of A *given* that B has already happened.

**Conditional probability from two way tables**

eg this shows what main courses & deserts 23 students chose:

We can see the probability someone chooses beef is 11/23. We can also find the conditional probability that someone has crumble *given* that they had beef. Look at the beef column and see the answer is 4/11.

**Conditional probability from Venn diagrams**

Venn diagrams are useful for working out probabilities. For example the following Venn diagram shows the number of 6^{th} form pupils studying English and Maths. 7+4 pupils study English; 12+4 study Maths; 4 study both and 7 study neither.

## Conditional probability 2

We can also use this Venn diagram to calculate some conditional probabilities eg the probability that a pupil studies Maths *given* he studies English. Look at the English circle only and see that the answer is 4/11.

## PROBABILITY DISTRIBUTIONS

A **probability distribution** is a list of all possible outcomes together with their probabilities.

A **discrete uniform distribution** is where all the probabilities of each outcome are equally likely. Eg a fair dice. P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6

## Binominal distribution

The **binomial distribution** is used in an experiment where there are just 2 outcomes (success & failure). The experiment is repeated *n* times.

P(success) = p and P(failure) = q.

Obviously p + q = 1 so we can also say P(failure) = 1 – p.

If *n* binomial trials are conducted, the probability for each event will be the terms of the expansion of (p+q)^{n}

**Example**: the probability that a tulip bulb will flower is 0.3. If I plant 4 in my garden what is the probability that more than one will flower? You may use:

(p+q)^{4} = p^{4} + 4p^{3}q + 6p^{2}q^{2} + 4pq^{3} + q^{4}

We have p = 0.3 so q = 0.7 and “more than one” means 2, 3 or 4 so we need to use:

p^{4} + 4p^{3}q + 6p^{2}q So we have (0.3^{4}) + (4 x 0.3^{3} x 0.7) + (6 x 0.3^{2} x 0.7^{2}) = 0.3483

## normal distribution

The **normal distribution** is a distribution often found in nature for continuous variables (eg height or weight). When plotted it looks like a bell shaped curve.

The mean (μ), mode & median are the same in a normal distribution and are on the line of symmetry in the middle. The standard deviation (σ) describes how spread out the data are.

95% of the data lie within 2 standard deviations of the mean ie between μ - 2σ and μ + 2σ.

99.8% (ie nearly all) of the data lie within 3 standard deviations of the mean ie between

μ - 3σ and μ + 3σ

## Quality assurance

Quality assurance is making sure products turn out ok.

A control chart is a time-series chart that is used for process control.

If the plotted values are within the warning limits the process is under control.

If a value is between the warning & action limits, another sample is taken.

If a value is outside the action limits, the process is stopped & the machine reset.

Warning limits are usually set at μ ± 2σ and action limits at μ ± 3σ

## Comments

No comments have yet been made