Data handling: Use variance and regression analysis to interpolate and extrapolate bivariate data

Unit 2: Represent data using a scatter plot

Gill Scott

Unit outcomes

By the end of this unit you will be able to:

• Draw a scatter plot.
• Understand when it is appropriate to use a scatter plot.
• Draw an intuitive line of best fit.

What you should know

Before you start this unit, make sure you can:

Introduction

Very often the purpose of an investigation is to find a relationship between two variables. For example height and mass; a taller person is likely to weigh more than a shorter person does. The aim is to find a mathematical expression of the relationship between these variables, because this would help to predict values for data elements that may not have been included in the set. In this unit, we will plot points to find the mathematical relationship between the variables of each element of data in a set.

Each element of univariate data has only one variable.

Each element of bivariate data has two variables.

Drawing scatter plots

When each element of data in a dataset consists of two parts, for example height and mass, it is called bivariate data, to indicate that it consists of two variables. A first step in analysing bivariate data is to visualise it by plotting the data elements on an $\scriptsize x\text{-}y$ coordinate system, with each axis representing one of the variables. The plotting of the data points is then analysed to see if the plotted points (the ‘scatter plot’) approximate the graph of one of the functions we have seen before.

Example 2.1

Eight children’s sweets consumption and sleeping habits were recorded as in the table below. Draw a scatter plot of the data by plotting the independent variables on the x-axis, and the dependent variables on the y-axis. Explain what you find.

 No. sweets per week $\scriptsize 15$ $\scriptsize 9$ $\scriptsize 10$ $\scriptsize 6$ $\scriptsize 23$ $\scriptsize 8$ $\scriptsize 13$ $\scriptsize 3$ Average hours sleep per day $\scriptsize 4.5$ $\scriptsize 5$ $\scriptsize 6$ $\scriptsize 7$ $\scriptsize 3$ $\scriptsize 3$ $\scriptsize 4$ $\scriptsize 8.5$

Solution

The data can be plotted as follows:

Looking at the scatter plot above, it appears that the points approximate a straight line, rather than any curve, although they clearly do not fit exactly to one straight line. It is also clear that the points in general have a negative relationship (the line has a negative gradient), with a lower consumption of sweets linked to higher average hours of sleep per day.

Take note!

When the function is a straight line, if increased values of the one variable correspond to increased values of the other variable, the relationship is said to be ‘positive’; this corresponds to the line having a positive slope. Similarly, if the line has a negative slope, and low values of one variable correspond to high values of the other variable, the relationship is said to be ‘negative’.

A straight line can be drawn approximating the points in the scatter plot in example 2.1, more or less as follows:

Once the line has been drawn, the normal processes can be used to find its equation.

In the above graph:

• $\scriptsize y\text{-intercept}=c=8$
• The straight line passes through the point $\scriptsize (20,3)$.

\scriptsize \begin{align*}y&=mx+c\\y&=mx+8\\3&=20m+8\\20m&=-5\\m&=-\displaystyle \frac{1}{4}\end{align*}

So the equation of the line is $\scriptsize y=-\displaystyle \frac{1}{4}x+8$.

Take note!

In drawing the line:

• Take care that it follows the general direction followed by the points.
• If one or two points clearly do not follow the general direction, they are probably ‘outliers’, and should be ignored.
• Of those points that do not lie on the line, there should be more or less the same number above the line as below it.
• Clusters of points above and below the line should not occur at the ends of the line.
• Taking the above guidelines into account, the more points that lie on the line, the better.

Activity 2.1: Draw a scatter plot, and intuitively fit a curve

Activity adapted from an example in Siyavula, Grade 12 Mathematics Chapter 9 p 385

Time required: 15 minutes

What you need:

• a pen or pencil
• paper on which to draw a graph

What to do:

The following points represent numbers of visitors to a website ($\scriptsize y$) on the $\scriptsize x$th day since the establishment of the website.
$\scriptsize (3,5);\text{ (4,7); (5,10); (6,30), (7,39); (8,51); (9,145); (10,200) }$

1. Plot the points on an x-y coordinate system.
1. What are the two variables being compared?
2. What type of function best fits the data?
3. Is the relationship between the two variables strong or weak?
4. Is the relationship between the two variables positive or negative?
3. Using the answers above, describe the relationship between the two variables in one sentence.

What did you find?

1. Plot the points on an x-y coordinate system:
2. .
1. The variables being compared are the number of daily visitors and the number of days since establishment of the website.
2. The data fit an exponential function.
3. The data points do not fit the curve very closely, so the relationship can be described as weak.
4. As time increases, the number of visitors increases, so the relationship can be described as positive.
3. There is a weak, positive exponential relationship between the number of visitors to the website and the number of days since its establishment.

Note

For more explanations of drawing scatter plots and lines of best fit, have a look at the following sites when you have access to the internet:

Exercise 2.1

1. Identify the function (linear, exponential or quadratic) which would best fit the data in each of the scatter plots below. Describe the relationship (positive or negative) where possible, and the strength of the fit:
1. .
2. .
3. .
4. .
5. .

1. Dr Dandara is a scientist trying to find a cure for a disease that has an $\scriptsize 80\%$ mortality rate. This means that $\scriptsize 80\%$ of people who get the disease will die. He knows of a plant which is used in traditional medicine to treat the disease. He extracts the active ingredient from the plant and tests different dosages (measured in milligrams) on different groups of patients. Examine the data below and complete the questions that follow.
 Dosage (mg) $\scriptsize 0$ $\scriptsize 25$ $\scriptsize 50$ $\scriptsize 75$ $\scriptsize 100$ $\scriptsize 125$ $\scriptsize 150$ $\scriptsize 175$ $\scriptsize 200$ Mortality rate (%) $\scriptsize 80$ $\scriptsize 73$ $\scriptsize 63$ $\scriptsize 49$ $\scriptsize 42$ $\scriptsize 32$ $\scriptsize 25$ $\scriptsize 11$ $\scriptsize 5$
1. Draw a scatter plot of the data.
2. Which function would best fit the data? Draw in the line of best fit, and find its equation.
3. Describe the fit in terms of strength and direction.
2. The enrolment of learners for NC(V) programmes at TVET colleges was reported as follows:
 Year $\scriptsize 2010$ $\scriptsize 2011$ $\scriptsize 2012$ $\scriptsize 2013$ $\scriptsize 2014$ $\scriptsize 2015$ $\scriptsize 2016$ NC(V) enrolment (in thousands) $\scriptsize 130$ $\scriptsize 124$ $\scriptsize 140$ $\scriptsize 154$ $\scriptsize 166$ $\scriptsize 165$ $\scriptsize 177$
1. Draw a scatter plot with the year on the horizontal axis and the enrolment on the vertical axis.
2. Draw a best fit line through the points and find its equation.
3. What is the meaning of the gradient of the line?
4. According to your model, what was the approximate enrolment in the year $\scriptsize 0$. Do you think this answer is realistic? Discuss.

1. Different climate conditions, such as temperature and rainfall, have significant effects on the yield of vegetable and other crops. Farmers need information of this type to work out the best time for planting. The following table matches average temperatures over a $\scriptsize 12\text{-month}$ period in different parts of a country against average crop yields recorded in tonnes per hectare.
 Average monthly temperature $\scriptsize 8$ $\scriptsize 10$ $\scriptsize 13$ $\scriptsize 15$ $\scriptsize 18$ $\scriptsize 20$ $\scriptsize 21$ $\scriptsize 19$ Tonnes per hectare produced $\scriptsize 10$ $\scriptsize 16$ $\scriptsize 22$ $\scriptsize 24$ $\scriptsize 23$ $\scriptsize 21$ $\scriptsize 18$ $\scriptsize 20$
1. Draw a scatter plot of the data.
2. Which function would best fit the data? Draw in the line of best fit, and give the general form of its equation.
3. Describe the fit in terms of strength.

The full solutions are at the end of the unit.

Summary

In this unit you have learnt the following:

• How to draw a scatter plot.
• When it is appropriate to draw a scatter plot.
• About drawing lines of best fit intuitively.

Unit 2: Assessment

Suggested time to complete: 40 minutes

Keep your solutions to these questions for referring to again in unit 3.

Question 1 adapted from NC(V) Mathematics Level 4 examination, November 2017

1. A study was done to compare electricity usage of geysers that are inside or outside the house. The table below shows the electricity usage (in kilowatt hours) for equivalent water consumption for matched households that have geysers inside the house, and those with geysers outside the house. Nine houses of each type were considered in the study.
 Inside the house (kWh) $\scriptsize 29$ $\scriptsize 31$ $\scriptsize 20$ $\scriptsize 40$ $\scriptsize 26$ $\scriptsize 39$ $\scriptsize 32$ $\scriptsize 34$ $\scriptsize 35$ Outside the house (kWh) $\scriptsize 19$ $\scriptsize 23$ $\scriptsize 13$ $\scriptsize 32$ $\scriptsize 17$ $\scriptsize 28$ $\scriptsize 25$ $\scriptsize 24$ $\scriptsize 28$
1. Draw a scatter plot of the data.
2. Draw a line of best fit.
3. Find the equation of the line.
4. Describe the relationship between the electricity usage when the geyser is inside the house and when it is outside.
2. Tobacco smoking is still one of the world’s largest health problems, although the prevalence of smoking is generally decreasing. The table below shows numbers of deaths (in thousands) in South Africa from smoking, in recent years.
 Year $\scriptsize 2010$ $\scriptsize 2011$ $\scriptsize 2012$ $\scriptsize 2013$ $\scriptsize 2014$ $\scriptsize 2015$ $\scriptsize 2016$ $\scriptsize 2017$ Deaths $\scriptsize ('000)$ $\scriptsize 38.0$ $\scriptsize 35.8$ $\scriptsize 34.1$ $\scriptsize 32.6$ $\scriptsize 31.8$ $\scriptsize 31.5$ $\scriptsize 31.3$ $\scriptsize 29.9$
1. Draw a scatter plot of the data.
2. Draw the line of best fit.
3. Find the equation of the line.
4. Describe the relationship between the year and the number of deaths from smoking.
3. A college helps learners to complete their national diplomas by negotiating with employers in the region with the aim of placing the learners for work experience. Over recent years they have tracked their engagements with employers against numbers of learners placed in work experience as follows:
 No. employers engaged $\scriptsize 15$ $\scriptsize 45$ $\scriptsize 65$ $\scriptsize 35$ $\scriptsize 38$ $\scriptsize 25$ $\scriptsize 40$ $\scriptsize 30$ No. learners placed $\scriptsize 40$ $\scriptsize 90$ $\scriptsize 128$ $\scriptsize 90$ $\scriptsize 95$ $\scriptsize 60$ $\scriptsize 140$ $\scriptsize 75$
1. Draw a scatter plot of the data.
2. Draw the line of best fit.
3. Find the equation of the line.
4. Describe the relationship between the numbers of employers engaged and the numbers of learners placed.

Question 4 taken from NC(V) Mathematics Level 4 examination, November 2019

1. The data below shows the mathematics marks of $\scriptsize 10$ learners at a college for the internal examinations and the external examinations.
 Internal examinations $\scriptsize (x)$ $\scriptsize 80$ $\scriptsize 68$ $\scriptsize 94$ $\scriptsize 72$ $\scriptsize 74$ $\scriptsize 83$ $\scriptsize 56$ $\scriptsize 68$ $\scriptsize 65$ $\scriptsize 75$ External examinations $\scriptsize (x)$ $\scriptsize 72$ $\scriptsize 71$ $\scriptsize 96$ $\scriptsize 77$ $\scriptsize 82$ $\scriptsize 72$ $\scriptsize 58$ $\scriptsize 83$ $\scriptsize 78$ $\scriptsize 80$
1. Draw a scatter plot of the marks in the above table on an x-y plane, with each axis showing values from $\scriptsize 50$ to $\scriptsize 96$.
2. Draw the line of best fit.
3. Find the equation of the line.
4. Describe the relationship between the internal and external examination marks.

The full solutions are at the end of the unit.

Unit 2: Solutions

Exercise 2.1

1. .
1. Weak, negative linear relationship
3. Strong positive, exponential relationship
4. Negative exponential relationship
5. Weak positive linear relationship
2. .
1. .

The data points approximate a straight line.
2. .

The best-fit line passes through points $\scriptsize (0,80)$and $\scriptsize (200,5)$. (You might have drawn a slightly different straight line, in which case your equation would be a bit different from that calculated below. Check that your line complies with the guidelines in the notes before activity 2.1.)
Using the method you learnt in level 3 subject outcome 3.2 unit 1, or any other method, the equation of this straight line can be found as follows.
\scriptsize \begin{align*}\displaystyle \frac{{y-{{y}_{1}}}}{{x-{{x}_{1}}}}&=\displaystyle \frac{{{{y}_{2}}-{{y}_{1}}}}{{{{x}_{2}}-{{x}_{1}}}}\\\displaystyle \frac{{y-80}}{{x-0}}&=\displaystyle \frac{{5-80}}{{200-0}}\\y-80&=-\displaystyle \frac{{3x}}{8}\\y&=-\displaystyle \frac{{3x}}{8}-80\end{align*}
3. There is a strong negative relationship between an increase in the dosage and a decrease in the mortality rate.
3. .
1. .
2. .

(You might have drawn a slightly different straight line, in which case your equation would be different from that calculated below. Check that your line complies with the guidelines in the notes before activity 2.1.)
Best-fit line passes through point $\scriptsize (2014,160)$and $\scriptsize (2016,180)$.
\scriptsize \begin{align*}\displaystyle \frac{{y-{{y}_{1}}}}{{x-{{x}_{1}}}}&=\displaystyle \frac{{{{y}_{2}}-{{y}_{1}}}}{{{{x}_{2}}-{{x}_{1}}}}\\\displaystyle \frac{{y-160}}{{x-2\text{ }014}}&=\displaystyle \frac{{180-160}}{{2\text{ }016-2\text{ }014}}\\y-160&=\displaystyle \frac{{20}}{2}\left( {x-2\text{ }014} \right)\\y&=10x-20\text{ }140+160\\y&=10x-19\text{ 980}\end{align*}
3. With the gradient of the line being $\scriptsize 10$, this means that the enrolment is increasing every year by a factor of $\scriptsize 10$.
4. The straight-line graph indicates a negative enrolment of $\scriptsize -19\text{ 980}$ in year $\scriptsize 0$. This is not realistic, since it would need the college to have been enrolling NC(V) learners for more than $\scriptsize 2\text{ 000}$ years, while the curriculum was only introduced in $\scriptsize 2007$.
4. .
 Average monthly temperature $\scriptsize 8$ $\scriptsize 10$ $\scriptsize 13$ $\scriptsize 15$ $\scriptsize 18$ $\scriptsize 20$ $\scriptsize 21$ $\scriptsize 19$ Tonnes per hectare produced $\scriptsize 10$ $\scriptsize 16$ $\scriptsize 22$ $\scriptsize 24$ $\scriptsize 23$ $\scriptsize 21$ $\scriptsize 18$ $\scriptsize 20$
1. .
2. The scatter plot seems to indicate a quadratic equation.

$\scriptsize y=-a{{x}^{2}}+bx-c$
The coefficients of $\scriptsize {{x}^{2}}$ and the constant are indicated as negative because the curve is concave facing down, and the y-intercept will clearly be negative if the curve is continued to that point.
3. There is a strong fit between the data and the curve.

Note: You are not required to find equations of best fit lines for scatter plots that do not satisfy or approximate straight lines, but you should be able to recognise the type of graph that best fits the data.

Back to Exercise 2.1

Unit 2: Assessment

1. .
 Inside the house (kWh) $\scriptsize 29$ $\scriptsize 31$ $\scriptsize 20$ $\scriptsize 40$ $\scriptsize 26$ $\scriptsize 39$ $\scriptsize 32$ $\scriptsize 34$ $\scriptsize 35$ Outside the house (kWh) $\scriptsize 19$ $\scriptsize 23$ $\scriptsize 13$ $\scriptsize 32$ $\scriptsize 17$ $\scriptsize 28$ $\scriptsize 25$ $\scriptsize 24$ $\scriptsize 28$
1. .
2. .

(You might have drawn a slightly different straight line, in which case your equation would be different from that calculated in c. below. Check that your line complies with the guidelines in the notes before activity 2.1.)
3. The straight line passes through $\scriptsize (20,13)$ and $\scriptsize (35,26)$.
\scriptsize \begin{align*}\displaystyle \frac{{y-{{y}_{1}}}}{{x-{{x}_{1}}}}&=\displaystyle \frac{{{{y}_{2}}-{{y}_{1}}}}{{{{x}_{2}}-{{x}_{1}}}}\\\displaystyle \frac{{y-13}}{{x-20}}&=\displaystyle \frac{{26-13}}{{35-20}}\\y-13&=\displaystyle \frac{{13}}{{15}}(x-20)\\y&=\displaystyle \frac{{13x}}{{15}}-\displaystyle \frac{{52}}{3}+\displaystyle \frac{{39}}{3}\\y&=\displaystyle \frac{{13x}}{{15}}-\displaystyle \frac{{13}}{3}\end{align*}
4. There is a strong, positive relationship between electricity usage when the geyser is inside the house and when it is outside.
2. .
 Year $\scriptsize 2010$ $\scriptsize 2011$ $\scriptsize 2012$ $\scriptsize 2013$ $\scriptsize 2014$ $\scriptsize 2015$ $\scriptsize 2016$ $\scriptsize 2017$ Deaths $\scriptsize ('000)$ $\scriptsize 38.0$ $\scriptsize 35.8$ $\scriptsize 34.1$ $\scriptsize 32.6$ $\scriptsize 31.8$ $\scriptsize 31.5$ $\scriptsize 31.3$ $\scriptsize 29.9$
1. .
2. .
3. The straight line passes more or less through the points $\scriptsize (2016,31)$ and $\scriptsize (2012,34)$.
(You might have drawn a slightly different straight line, in which case your equation would be a bit different from that calculated below. Check that your line complies with the guidelines in the notes before activity 2.1.)
The equation of the line is the following:
\scriptsize \begin{align*}\displaystyle \frac{{y-{{y}_{1}}}}{{x-{{x}_{1}}}}&=\displaystyle \frac{{{{y}_{2}}-{{y}_{1}}}}{{{{x}_{2}}-{{x}_{1}}}}\\\displaystyle \frac{{y-31}}{{x-2016}}&=\displaystyle \frac{{34-31}}{{2012-2016}}\\y-31&=-\displaystyle \frac{3}{4}(x-2016)\\y&=-\displaystyle \frac{{3x}}{4}+1512+31\\y&=-\displaystyle \frac{{3x}}{4}+1543\end{align*}
4. The relationship between the year and the numbers of deaths from smoking is a strong negative one.
3. .
 No. employers engaged $\scriptsize 15$ $\scriptsize 45$ $\scriptsize 65$ $\scriptsize 35$ $\scriptsize 38$ $\scriptsize 25$ $\scriptsize 40$ $\scriptsize 30$ No. learners placed $\scriptsize 40$ $\scriptsize 90$ $\scriptsize 128$ $\scriptsize 90$ $\scriptsize 95$ $\scriptsize 60$ $\scriptsize 140$ $\scriptsize 75$
1. .
2. .
3. The straight line passes more or less through $\scriptsize (20,60)$ and $\scriptsize (60,132)$. (You might have drawn a slightly different straight line, in which case your equation would be a bit different from that calculated below. Check that your line complies with the guidelines in the notes before Activity 2.1.)
The equation of the line is the following:
\scriptsize \begin{align*}\displaystyle \frac{{y-{{y}_{1}}}}{{x-{{x}_{1}}}}&=\displaystyle \frac{{{{y}_{2}}-{{y}_{1}}}}{{{{x}_{2}}-{{x}_{1}}}}\\\displaystyle \frac{{y-60}}{{x-20}}&=\displaystyle \frac{{132-60}}{{60-20}}\\y-60&=\displaystyle \frac{{72}}{{40}}(x-20)\\y&=1.8x-24\end{align*}
4. There is a positive relationship between numbers of employers engaged and numbers of learners placed, but this is not a very strong relationship.
4. .
1. .
2. .
3. The straight line passes through points $\scriptsize (70,75)$ and $\scriptsize (90,87)$. (You might have drawn a slightly different straight line, in which case your equation would be a bit different from that calculated below. Check that your line complies with the guidelines in the notes before activity 2.1.)
The equation of the line is the following:
\scriptsize \begin{align*}\displaystyle \frac{{y-{{y}_{1}}}}{{x-{{x}_{1}}}}&=\displaystyle \frac{{{{y}_{2}}-{{y}_{1}}}}{{{{x}_{2}}-{{x}_{1}}}}\\\displaystyle \frac{{y-75}}{{x-70}}&=\displaystyle \frac{{87-75}}{{90-70}}\\y-75&=\displaystyle \frac{{12}}{{20}}(x-70)\\y&=\displaystyle \frac{{3x}}{5}-33\end{align*}
4. There is a weak, positive relationship between internal and external examinations.

Back to Unit 2: Assessment