Data handling: Use variance and regression analysis to interpolate and extrapolate bivariate data

Unit 1: Calculate variance and standard deviation

Gill Scott

Unit outcomes

By the end of this unit you will be able to:

  • Calculate variance for ungrouped data manually.
  • Calculate standard deviation for ungrouped data manually.
  • Interpret variance and standard deviation.

What you should know

Before you start this unit, make sure you can:

Introduction

Measures of central tendency of data sets, the mean, median and mode, give a first impression of the characteristics of a data set. From the work that you have already done, you saw that although these measures can be useful, they can also be misleading. So, it is necessary to investigate how the data in any set is spread, scattered or dispersed in order to have a complete picture of the data set.

The range is a measure of dispersion , being the spread of data from smallest to largest. The interquartile range (IQR) is a better measure of dispersion than the range. It gives the range of spread around the median, so 50%50% of the data set. However, the mean is often a better measure of central tendency than the median is, and in this unit we will investigate how data is spread around the mean. The measures of dispersion around the mean are the variance and the standard deviation.

Variance

Suppose that two groups of nine learners wanted to see how long they could balance on a slackline, with each member recording how many seconds passed before they fell off.

The nine members of group A balanced for:

320320sec; 250250sec; 183183sec; 4141sec; 335335sec; 7878sec; 142142sec; 210210sec; 115115sec.

The nine members of group B balanced for:

185185sec; 188188sec; 183183sec; 191191sec; 185185sec; 179179sec; 192192sec; 184184sec; 187187sec;

The mean of each group was calculated:
Group A’s mean:
ˉxA=320+250+183+41+335+78+142+210+1159=16749=186 sec¯xA=320+250+183+41+335+78+142+210+1159=16749=186 sec
Group B’s mean:
ˉxB=185+188+183+191+185+179+192+184+1879=16749=186 sec¯xB=185+188+183+191+185+179+192+184+1879=16749=186 sec

The means of the two groups are the same but as you can see the recorded data values are very different. The mean does not provide enough information to make a useful comparison of the data sets.

The deviation of each value from the mean for the groups was tabulated:

Group A
Time
xx
Deviation from the mean
xˉxAx¯xA
320320 320186=134
250 250186=64
183 183186=3
41 41186=145
335 335186=149
78 78186=108
142 142186=44
210 210186=24
115 115186=71
Group B
Time
x
Deviation from the mean
xˉxB
185 185186=1
188 188186=2
183 183186=3
191 191186=5
185 185186=1
179 179186=7
192 192186=6
184 184186=2
187 187186=1

The table shows that although the means for both groups were the same, the times for group A are much more widely dispersed about the mean than the times for group B. We need to investigate the dispersions. Suppose we find the total deviations from the mean for each group:
Sum of Group A’s deviations: (xˉx)=134+643145+14910844+2471=0
Sum of Group B’s deviations: (xˉx)=1+23+517+62+1=0

In each group, the negative values cancel out the positive values giving the total of 0 (this will happen for any group; can you see why?). However, the extent of dispersion of data around the mean gives a good idea of how representative the mean is of the data set. Squaring the distance from the mean for each data element gives a positive value for each, and so enables us to look at total spread about the mean, although this value is squared. Thus, the next step is to square each of the deviations from the mean, and to calculate the sum of the squared values:

Group A:
(xˉx)2=(134)2+(64)2+(3)2+(145)2+(149)2+(108)2+(44)2+(24)2+(71)2=17 956+4 096+9+21 025+22 201+11 664+1 936+576+71=84 504

Group B:
(xˉx)2=(1)2+(2)2+(3)2+(5)2+(1)2+(7)2+(6)2+(2)2+(1)2=1+4+9+25+1+49+36+4+1=130

The variance, σ, is defined as the average, or mean, of the squared deviations, so the sum for each must be divided by the number of data elements:
Variance for group A =(xˉx)2n=84 5049=9 389.33

Variance for group B =(xˉx)2n=1309=14.44

The variance of a data set is the average ˉx of the squared deviations of each of the n elements x of the set from the mean for the set:
Variance =(xˉx)2n
Notice that the units of variance are squared units.

Standard deviation

From the definition above, you can see that the variance is a squared value, which is not a very useful measure as the data values given are not squared. The other measure of dispersion, the standard deviation, represented by the Greek letter σ(lower case ‘sigma’), is the square root of the variance.

Continuing the example above:
Standard deviation for group A =variance=(xˉx)2n=9 389.33=96.90
Standard deviation for group B =variance=(xˉx)2n=14.44=3.80

The much larger standard deviation for group A indicates that the data for group A is much more widely distributed around the mean than that for group B. There is greater dispersion in the distribution of data for group A than that for group B. This shows that the mean for group B is much less representative of the data elements than that of group A. The more varied the data values are, the less reliable they are as a means of prediction.

Standard deviation may serve as a measure of uncertainty – or accuracy. It gives an idea of how much variation there is from the mean. The standard deviation is the square root of the average distance of the values in the data set from their mean. The standard deviation is always a positive value, and is always measured in the same units as the data elements of the set.

Standard deviation σ of n elements x of data in a set:
σ=(xˉx)2n
Note that the standard deviation is always positive, and the units of the standard deviation are the same as the units of the data elements.

Note

For other explanations of variance and standard deviation watch “Variance of a population”,

Variance of a population (Duration: 08.05)

Variance of a population

or read through “Describing Variability“.

Describing Variability

Take note!

For a fairly normal distribution that is not too skewed by having some very large or very small values:

  • about 67% of the elements of the data set will lie within one standard deviation of the mean
  • about 95% of the elements of the data set will lie within two standard deviation of the mean.

Example 1.1

Eight cupcakes from a batch were weighed and their masses recorded as follows:
23 g; 37 g; 25 g; 28 g; 33 g; 31 g; 29 g; 26 g.

  1. Find the range of the masses.
  2. Calculate the mean.
  3. Calculate the variance.
  4. Calculate the standard deviation.

Solutions

  1. Arrange the masses in order: 23 g; 25 g; 26 g; 28 g; 29 g; 31 g; 33 g; 37 g.
    Subtract the smallest mass from the largest: Range =3723=14 g
  2. Divide the sum of all the masses by the number of cupcakes:
    Mean:
    ˉx=xn=2328=29 g
  3. To find the variance, find the deviation of each mass from the mean, and square that.
    .

    Mass
    x
    Deviation from the mean
    xˉx
    (Deviations)2
    (xˉx)2
    23 2329=6 36
    37 3729=8 64
    25 2529=4 16
    28 2829=1 1
    33 3329=4 16
    31 3129=2 4
    29 2929=0 0
    26 2629=3 9

    Variance=(xˉx)2n=36+64+16+1+16+4+0+98=1468=18.25

  4. Standard deviation is the square root of the variance:
    σ=(xˉx)2n=1468=18.25=4.27 g

You will notice that tabulating the data and the calculations simplifies the application of the formulae.

Activity 1.1: Working with temperatures

Time required: 12 minutes

What you need:

  • a pen and paper
  • a calculator

What to do:

The maximum daily temperatures in Johannesburg in the second week of April 2021 are recorded and tabulated below, alongside those of the second week of January of the same year.

April Temperature x Deviation from the mean xˉx (Deviation)2 (xˉx)2
11th 21
12th 26
13th 23
14th 19
15th 25
16th 26
17th 27

 

January Temperature x Deviation from the mean xˉx (Deviation)2 (xˉx)2
10th 25
11th 27
12th 27
13th 25
14th 24
15th 26
16th 28
  1. Work out:
    1. The mean temperature for the week in April (correct to one decimal place).
    2. The mean temperature for the week in January (correct to one decimal place).
  2. Copy and complete the table above for both months.
  3. Work out the variance for April and for January (correct to one decimal place).
  4. Work out the standard deviation for April and for January (correct to one decimal place).
  5. On what percentage of days in each of the months was the maximum temperature within one standard deviation of the mean?
  6. What do the two standard deviations and your calculations show about the spread of data around the respective means?

What did you find?

  1. .
    1. April Mean =ˉx=x7=1677=23.9
    2. January Mean =ˉx=x7=1827=26
  2. .Table for April
    April Temperature x Deviation from the mean xˉx (Deviation)2 (xˉx)2
    11th 21 2123.9=2.9 8.4
    12th 26 2623.9=2.1 4.4
    13th 23 2323.9=0.9 0.8
    14th 19 1923.9=4.9 24.0
    15th 25 2523.9=1.1 1.2
    16th 26 2623.9=2.1 4.4
    17th 27 2723.9=3.1 9.6

    Table for January

    January Temperature x Deviation from the mean xˉx (Deviation)2 (xˉx)2
    10th 25 2526=1 1
    11th 27 2726=1 1
    12th 27 2726=1 1
    13th 25 2526=1 1
    14th 24 2426=2 4
    15th 26 2626=0 0
    16th 28 2826=2 4
  3. April:
    Variance=(xˉx)2n=8.4+4.4+0.8+24.0+1.2+4.4+9.67=52.87=7.5
    Notice that we leave out the units for variance: the ‘square’ of degrees is not helpful here.
    .
    January:
    Variance=(xˉx)2n=1+1+1+1+4+0+47=127=1.7
  4. Standard deviation for April:
    σ=(xˉx)2n=52.87=7.5=2.74
    .
    Standard deviation for January:
    σ=(xˉx)2n=127=1.7=1.3
  5. April:
    One standard deviation from the mean=ˉx±σ=23.9±2.74
    So the interval is [latex]\scriptsize \left[ {23.9-2.74;23.9+2.74} \right]=\left[ {21.16{}^\circ ;26.64{}^\circ \right][/latex].
    The maximum temperatures on 12, 13, 15and 16April fall within this interval.
    47×100%=57.14%
    So the maximum temperature on 57.14% of the days of the week in April fall within one standard deviation of the mean.
    .
    January:
    One standard deviation from the mean =ˉx±σ=26±1.3
    So the interval is [261.3;26+1.3]=[24.7;27.3].
    The maximum temperatures on 10, 11, 12, 13 and 15 January fall within this interval.
    57×100%=71.43%
    So the maximum temperature on 71.43% of the days of the week in January falls within one standard deviation of the mean.
  6. The temperatures were more consistent, with fewer fluctuations, in the week in January than the week in April.

Exercise 1.1

  1. World Health Organisation data for 2018 reported numbers of tuberculosis cases per 100 000 in the population for some countries in Southern and Eastern Africa as follows:
    Country Number per 100 000
    Angola 355
    Botswana 275
    Kenya 292
    Lesotho 659
    Malawi 153
    Mozambique 361
    Namibia 524
    South Africa 677
    Zimbabwe 210
    Uganda 200
    United Republic of Tanzania 253
    Zambia 346
    1. What is the range of tuberculosis incidence per 100 000 in the populations across these countries?
    2. What is the mean for the entire region?
    3. What is the standard deviation of numbers of reported tuberculosis cases per 100 000 for the entire region?
    4. What percentage of countries’ tuberculosis incidence falls within one standard deviation from the mean?
  2. World Health Organisation estimated data for 2016 country death rates due to road traffic injuries per 100 000 population are as follows:
    Country Number per 100 000
    Angola 23.6
    Botswana 23.8
    Kenya 27.8
    Lesotho 28.9
    Malawi 31
    Mozambique 30.1
    Namibia 30.4
    South Africa 25.9
    Zimbabwe 34.7
    Eswatini 26.9
    United Republic of Tanzania 29.2
    Zambia 20.9
    1. What is the range of road traffic death rates per 100 000 in the populations for each of these countries?
    2. What is the mean for the region?
    3. What is the standard deviation of numbers of deaths per 100 000 for the region?
    4. What percentage of countries’ road traffic death rates falls within one standard deviation from the mean?
  3. A manufacturer checks the width of a number of roller bearings from the production line in order to control quality. The following widths were measured, in micrometres (thousandth of a millimetre):
    15 015; 15 101; 15 089; 15 062; 15 111; 15 054; 15 028; 15 137; 15 009; 15 096

    1. Calculate the range.
    2. Calculate the mean.
    3. Calculate the standard deviation.
    4. What percentage of the measurements are within one standard deviation of the mean?

The full solutions are at the end of the unit.

Summary

In this unit you have learnt the following:

  • How to calculate the variance of a data set.
  • How to calculate the standard deviation of a data set.
  • How to interpret the results of calculations of standard deviation of a data set

License

Share This Book