Data handling: Represent, analyse and interpret data using various techniques

# Unit 1: Use various techniques for data collection, representation and interpretation

Natashia Bearam-Edmunds

### Unit outcomes

By the end of this unit you will be able to:

• Identify situations or issues that can be dealt with through statistical methods.
• Make resolutions to maximise efficiency from given data which has been organised and graphically represented.

## What you should know

Before you start this unit, make sure you can:

• understand basic principles of statistics. You can revise the following statistics units from the previous levels:

## Introduction

Social media has a huge impact on the way people interact and the decisions they make. It can also influence many decisions and highlight issues of public importance. A major environmental issue is the need to reduce plastic waste.

This has been widely publicised and the negative effect of plastic on the ocean is well documented. But, what if we wanted to find out how social media reacts to plastic pollution? How could we assess the reaction? To make a conclusion about this question we need to collect information.

The calculation of statistics always starts with collecting information. Why do you think it is important to get information and analyse statistics about plastic waste?

If consumers are becoming more concerned about plastic waste this will influence decisions about the type of products they buy. Businesses need to pay attention to the growing environmental concerns so they can adapt and change their focus to avoid financial pitfalls. In fact, a survey has been done on this very topic. It concluded that people are talking more about the plastics problem on social media, and they are googling the topic more, too. Below is a snapshot of that survey.

From environmental research to sports, statistical calculations are used in almost every field. As you will no doubt encounter statistics somewhere, it is important to be able to analyse statistics and understand how they are computed.

To make an informed decision about a current problem, such as plastic pollution, you will need to research the problem and compute statistics. Statistics are calculated for research purposes in many fields. But, not every question justifies the cost and effort of performing statistical research.

Think about situations around you that possibly need further research. So, how do we know if a problem warrants further study? The following will guide your decision to conduct statistical research.

• Can the issue be studied and what is the purpose of the investigation?
• Does it justify further research?
• Is it worth the time, money and effort that will go into the research?
• Is there money available to investigate the issue?
• Are there people with the right skills available to conduct the research?
• Is the size of the sample reasonable to investigate?
• Can you formulate the hypothesis?

A hypothesis is a statement that must be proved or tested through research (observation or experimentation). It is an educated guess and expresses the supposed relationship between two variables. Remember that a variable is something that changes and can have different values or conditions.

For example, if you suspect that time watching TV negatively influences exam results, then your hypothesis could be that the more time you spend watching TV the worse you perform at exams. The variables are time spent watching TV and exam performance.

A hypothesis test involves collecting data from a sample and evaluating the data. Then, a statistician makes a decision whether or not there is sufficient evidence, based on analysis of the data, to reject or accept the hypothesis.

### Note

Hypothesis testing is not examinable but it is the basis for most statistical calculations. For your own interest you can learn more about hypothesis testing by watching this video when you have internet access, “Simple hypothesis testing”.

Simple hypothesis testing (Duration: 06.25)

When an issue needs further statistical research we must collect, record, organise and interpret the data using the methods discussed in detail in levels 2 and 3. We will revise those methods next.

## Data collection

These are the different types of data that we have worked with so far.

Qualitative data deals with descriptions that can be observed but not measured. For example colours, size, tastes, and appearance.

Categorical data are qualitative. For example hair colour of people at a shopping mall.

Quantitative data deals with numerical data that can be measured. For example length, height, weight, time, cost, and number of people.

Quantitative data are divided into discrete and continuous.

Discrete data are whole number values. For example the number of people attending a maths course.

Continuous data are values that can be measured. For example the heights of learners in an NCV level 4 maths class.

Data sources are varied and include the internet, surveys, censuses and existing records. Often questionnaires, observations and interviews are used to collect data.

In statistics, we generally want to study a population. You can think of a population as a large collection of persons, things, or objects under study. To study the population, we select a sample. The idea of sampling is to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. Data are the result of sampling from a population.

Because it takes a lot of time and money to examine an entire population, sampling is a very practical technique. From the sample data, we can calculate a statistic. A statistic is a number that represents a property of the sample.

The statistic is an estimate of a population parameter. A parameter is a numerical characteristic of the whole population that can be estimated by a statistic.

Data can be collected by sampling in many ways. The simplest way is direct observation.
For example, if you want to find out how many bicycles pass a busy intersection during rush hour traffic, you can stand close to the intersection and count the number of bicycles that pass by in that interval.

Statistics can be a powerful tool in research. Unfortunately, statistics can also have faults. Sample bias is one such fault. Bias is deliberate favouritism when collecting data, resulting in lopsided, misleading results. Bias can occur in the way the sample is chosen and the way the data are collected.

For example, if we wanted to find out how many learners played sport at a college and chose only the male learners to be part of the survey. This will result in misleading results as we have not chosen a sample that is representative of the entire college population, which includes females.

It is important to keep in mind that sampling bias refers to the method of sampling, not the sample itself.

### Avoiding bias when selecting a sample

The methods used to collect data must ensure that the data is reliable. This means that it is data that we can trust. Data cannot be trusted unless it has been collected in a way that makes sure that every member of the population under investigation has the same chance of being selected in the sample.

Sample bias occurs when a particular group of the population from which the sample is drawn does not represent that population. The way to avoid sample bias is to take a random sample. A sample is random if every member of the population has an equal chance of being selected.

In addition to the sample being random it must be of an adequate size. The bigger the sample size the more accurate the results.

### Example 1.1

Identify the bias in the example below:

Sibusiso collected data from a sample of grade 12 boys at his school to find out how many learners play soccer.

Solution

Since the sample is not random, some individuals are more likely than others to be chosen. Always think very carefully about which individuals are being favoured and how that will influence the results. Sibusiso’s sample is restricted to boys only and is more likely to get a favourable result and skew data. The sample must include girls as well to be a true reflection of the learners at the school.

### Organising data

Data is often recorded electronically by using spreadsheets, computer software, scanners and online surveys. The data can then be sorted and organised by:

• grouping using frequency tables
• tallies on tally tables
• stem and leaf diagrams.

Once data are organised it can be summarised so that it can be better analysed.

## Summarising data

In levels 2 and 3 we discussed single numerical values that gave us information about the data; measures of central tendency and dispersion. The measures we have already learnt about are the:

• mean
• median
• mode
• range
• lower quartile
• upper quartile
• interquartile range
• semi-interquartile range
• variance and
• standard deviation.

You must be able to calculate the above measures for ungrouped and grouped data, where applicable.

The following formulae are used to calculate the estimated mean, median and mode for grouped data.

Mean:

\scriptsize \displaystyle \begin{align*}& \bar{x}=\displaystyle \frac{{\sum {{f}_{i}}{{x}_{i}}}}{n}\text{ }\\&{{f}_{i}}{{x}_{i}}\text{ is the class midpoint multiplied by the frequency}\\&n\text{ is the number of observations}\end{align*}

Median:

\scriptsize \displaystyle \begin{align*}&{{\text{M}}_{e}}=l+\displaystyle \frac{{\left( {\displaystyle \frac{n}{2}-F} \right)}}{f}\times c\\&l\text{ is the lower limit of the median class}\\&n\text{ }~\text{is the number of observations}\\&F\text{ is cumulative frequency of the class before the median class}\\&f\text{ is the frequency of the median class}\\&c\text{ is the class width}\end{align*}

Mode:

\scriptsize \begin{align*}&{{M}_{o}}=l+\displaystyle \frac{{{{f}_{m}}-{{f}_{{m-1}}}}}{{2{{f}_{m}}-{{f}_{{m-1}}}-{{f}_{{m+1}}}}}\times c\\&l\text{ is the lower limit of the modal class}\\&{{f}_{m}}\text{ is the frequency of the modal class}\\&{{f}_{{m-1}}}\text{ is the frequency of the class before the modal class}\\&{{f}_{{m+1}}}\text{ is the frequency of the class after the modal class}\\&c\text{ is the class width}\end{align*}

### Example 1.2

The frequency distribution shows the pulse rates of a group of women.

 Pulse rates of women Frequency $\scriptsize 60-69$ $\scriptsize 12$ $\scriptsize 70-79$ $\scriptsize 14$ $\scriptsize 80-89$ $\scriptsize 11$ $\scriptsize 90-99$ $\scriptsize 1$ $\scriptsize 100-109$ $\scriptsize 1$ $\scriptsize 110-119$ $\scriptsize 0$ $\scriptsize 120-129$ $\scriptsize 1$

Use the table to find:

1. The average pulse rate for the women.
2. If these pulse rates are observed in a sample of women admitted to a private hospital is this a good indication of the average pulse rate of all patient admissions?
3. Find the median pulse rate.

Solutions

1. Find the class midpoints to apply the formula.
 Pulse rates of women Class midpoint Frequency $\scriptsize 60-69$ $\scriptsize 64.5$ $\scriptsize 12$ $\scriptsize 70-79$ $\scriptsize 74.5$ $\scriptsize 14$ $\scriptsize 80-89$ $\scriptsize 84.5$ $\scriptsize 11$ $\scriptsize 90-99$ $\scriptsize 94.5$ $\scriptsize 1$ $\scriptsize 100-109$ $\scriptsize 104.5$ $\scriptsize 1$ $\scriptsize 110-119$ $\scriptsize 114.5$ $\scriptsize 0$ $\scriptsize 120-129$ $\scriptsize 124.5$ $\scriptsize 1$

\scriptsize \displaystyle \begin{align*}\bar{x}&=\displaystyle \frac{{\sum {{f}_{i}}{{x}_{i}}}}{n}\text{ }\\&=\displaystyle \frac{{64.5\times 12+74.5\times 14+84.5\times 11+94.5+104.5+124.5}}{{40}}\\&=\displaystyle \frac{{3\text{ }070}}{{40}}\\&=76.75\end{align*}
The average pulse rate for women is $\scriptsize \displaystyle 76.75$.

2. No it is not a good representation of the entire population. Male pulse rates are excluded, and the sample size is very small, making this an unreliable sample.
3. The median class is $\scriptsize 70-79$.
\scriptsize \displaystyle \begin{align*}{{\text{M}}_{e}}&=l+\displaystyle \frac{{\left( {\displaystyle \frac{n}{2}-F} \right)}}{f}\times c\\&=70+\displaystyle \frac{{\left( {\displaystyle \frac{{40}}{2}-12} \right)}}{{14}}\times 10\\&=75.71\end{align*}
The median pulse rate is $\scriptsize \displaystyle 75.71$.

### Example 1.3

On a timed maths test, the lower quartile for time it took to finish the exam was at $\scriptsize \displaystyle 35$ minutes. Interpret the first quartile in the context of this situation.

Solution

This means that $\scriptsize \displaystyle 25\%$ of learners finished the exam in less than $\scriptsize \displaystyle 35$ minutes, or we can say $\scriptsize \displaystyle 75\%$ of learners finished the exam in more than $\scriptsize \displaystyle 35$ minutes.

## Representing data

We have used different types of graphs to represent data. Graphs represent data well because they give a picture of the data that is easy to interpret.

Some graphs are better for displaying certain kinds of information than others. The type of graph depends mostly on the type of data that needs to be represented.

 Representation Advantages Stem-and-leaf diagram Used to plot data and look at the distribution. All data values within a class are visible. Box-and-whisker diagram Used to organise data visually. Easy to see the five-number-summary. Bar graph Used for showing discrete quantitative data or data in categories. Bar graphs allow us to compare the quantities of different categories, for example, the exam results of different subjects. They are a really good way to show relative sizes. Compound bar graph Used to compare two or more characteristics for each category. For example, we could use a double-bar graph to compare the differences between male and female preferences for sport to watch. Histogram Used to represent continuous data that is grouped into equal class intervals, for example height, weight, etc. Histograms are useful to show the way the data is spread out. Pie chart Used to show a whole divided into parts. They show how the parts relate to each other and how the parts relate to a whole. They do not show the quantities involved. You can use pie charts to show the relative sizes of many things, such as what type of phone people prefer, etc. Broken line graph Used to show trends or changes in quantities over time, where the categories are related to each other or follow on from each other. For example the categories might be consecutive times, days, months, or years. Ogive (cumulative frequency graph) Used to determine how many data values lie above or below a particular value in a data set. Ogives are useful for determining the median, percentiles and five number summary of data. Scatter plot Used to graph data points that have two values associated with them. Data values have two independent measurements, for example, maths marks and science marks.

You do not need to draw the statistical graphs again in level 4 but you are expected to interpret given graphs and answer questions based on the graphs.

### Example 1.4

The stem-and-leaf diagram shows Drew’s calculus test marks (in percentages) for the year.

 $\scriptsize 3$ $\scriptsize 5$ $\scriptsize 4$ $\scriptsize 3$ $\scriptsize 9$ $\scriptsize 5$ $\scriptsize 8$ $\scriptsize 9$ $\scriptsize 9$ $\scriptsize 6$ $\scriptsize 5$ $\scriptsize 7$ $\scriptsize 7$ $\scriptsize 2$ $\scriptsize 5$ $\scriptsize 5$ $\scriptsize 5$ $\scriptsize 8$ $\scriptsize 1$ $\scriptsize 4$
1. How many calculus tests did Drew write?
2. What is his highest mark?
3. What is the modal mark?
4. Calculate the mean mark to the nearest percent.

Solution

1. Remember: The stem and-leaf diagram is a good choice when the data sets are small. To create the diagram, divide each observation of data into a stem and a leaf. The leaf consists of a final significant digit. For example, $\scriptsize 35$ has stem $\scriptsize 3$ and leaf $\scriptsize 5$. The decimal $\scriptsize 8.7$ has stem $\scriptsize 8$ and leaf $\scriptsize 7$. To draw the stem-and-leaf diagram, list the stems vertically from smallest to largest. Draw a vertical line to the right of the stems. Then write the leaves in increasing order next to their corresponding stem.
2. There are $\scriptsize 14$ marks listed, so Drew wrote $\scriptsize 14$ tests.
3. His highest mark is $\scriptsize 84\%$.
4. The modal mark is the one that occurs most often. His modal mark is $\scriptsize 75\%$ as it appears three times.
5. Mean mark:
\scriptsize \begin{align*}\bar{x}&=\displaystyle \frac{{\sum{x}}}{n}\\&=\displaystyle \frac{{35+43+49+58+59+59+65+67+72+3(75)+81+84}}{{14}}\\&=\displaystyle \frac{{897}}{{14}}\\&=64\%\end{align*}

### Example 1.5

The weights of a random sample of boys from a sports club were recorded. The cumulative frequency graph (ogive) below represents the recorded weights.

1. How many of the boys weighed between $\scriptsize \displaystyle 90$ and $\scriptsize \displaystyle 100$ kilograms?
2. Estimate the median weight of the boys.
3. If there were $\scriptsize \displaystyle 250$ boys in the club, estimate how many of them would weigh less than $\scriptsize \displaystyle ~80$ kilograms?
4. Which other graph(s) could have been used to represent the data?

Solutions

1. $\scriptsize 42-28=14$ weighed between $\scriptsize \displaystyle 90$ and $\scriptsize \displaystyle 100$ kilograms.
2. Approximately $\scriptsize \displaystyle 88\text{ kg}$.
3. $\scriptsize \displaystyle 15$ out of $\scriptsize \displaystyle 50$ boys weigh less than $\scriptsize \displaystyle ~80$ kilograms so $\scriptsize 75$ boys $\scriptsize \left( {\displaystyle \frac{{15}}{{50}}\times 250=75} \right)$ out of the total of $\scriptsize \displaystyle 250$ would weigh less than $\scriptsize \displaystyle ~80$ kilograms.
4. A histogram would be an appropriate way to represent the data.

### Example 1.6

A group of learners count the number of sweets they each have. This is a histogram describing the data they collected.

A cat jumps onto the table and all their notes land on the floor, mixed up, by accident! Help them find which of the following data sets match the above histogram:

Data set A

 $\scriptsize 2$ $\scriptsize 1$ $\scriptsize 20$ $\scriptsize 10$ $\scriptsize 5$ $\scriptsize 3$ $\scriptsize 10$ $\scriptsize 2$ $\scriptsize 6$ $\scriptsize 1$ $\scriptsize 2$ $\scriptsize 2$ $\scriptsize 17$ $\scriptsize 3$ $\scriptsize 18$ $\scriptsize 3$ $\scriptsize 7$ $\scriptsize 10$ $\scriptsize 8$ $\scriptsize 18$

Data set B

 $\scriptsize 2$ $\scriptsize 9$ $\scriptsize 12$ $\scriptsize 10$ $\scriptsize 5$ $\scriptsize 9$ $\scriptsize 10$ $\scriptsize 13$ $\scriptsize 6$ $\scriptsize 5$ $\scriptsize 11$ $\scriptsize 10$ $\scriptsize 7$ $\scriptsize 2$

Data set C

 $\scriptsize 3$ $\scriptsize 12$ $\scriptsize 16$ $\scriptsize 10$ $\scriptsize 15$ $\scriptsize 17$ $\scriptsize 18$ $\scriptsize 2$ $\scriptsize 3$ $\scriptsize 7$ $\scriptsize 11$ $\scriptsize 12$ $\scriptsize 8$ $\scriptsize 2$ $\scriptsize 7$ $\scriptsize 17$ $\scriptsize 3$ $\scriptsize 11$ $\scriptsize 4$ $\scriptsize 4$

Solution

Count the number of values in each range of the drawn histogram and compare that to the given tables of data.

Data set A has eight values in the $\scriptsize 0-3$ range but the histogram has five values in that range so A does not match the histogram.

Data set B has one value in the $\scriptsize 0-3$ range so it is not the right match for the histogram.

Data set C has five values in the $\scriptsize 0-3$ range and the number of values in each of the other ranges matches too. Therefore, data set C matches the given histogram.

### Note

Types of statistical studies (Duration: 10.31)

### Exercise 1.1

1. The box-and-whisker diagrams (plots) show the maths test results in percentages for two tests that learners wrote.

1. What is the highest mark in test A?
2. What is the lowest mark in test B?
3. What is the median mark in test B?
4. Between what values do $\scriptsize 50\%$ of the marks lie in test A?
5. What mark did $\scriptsize 25\%$ of learners get less than in test B?
6. What mark did $\scriptsize 75\%$of learners get more than in test A?
7. Which other graph(s) could have been used to represent the data?
2. The cumulative frequency curve shows the percentage improvement in marks of a group of learners after they attended a maths camp.

1. How many learners attended the maths camp?
2. How many learner’ marks improved by $\scriptsize 24$ to $\scriptsize 40\%$?
3. How many learners’ marks increased by $\scriptsize 64\%$ or more?
4. Would a box-and-whisker diagram be an appropriate representation for the type of information we are looking for in this case?

The full solutions are at the end of the unit.

## Summary

In this unit you have learnt the following:

• How to test if issues warrant further scientific research.
• How to identify graphs that best represent a given data set.
• How to compare different data representations.

# Unit 1: Assessment

#### Suggested time to complete: 20 minutes

1. The following ogive shows the test results, in percentages, for a class.

1. How many learners are in the class?
2. How many learners got $\scriptsize 70\%$ or less?
3. $\scriptsize 30\%$of learners got less than what mark?
4. If the pass mark is $\scriptsize 50\%$, how many learners passed?
5. What other graph(s) could have been used to represent the data?
2. The box-and-whisker plot shows the ages of members at a sports club.

1. How old is the youngest member?
2. What is the median age?
3. Between what ages do the middle $\scriptsize 50\%$ of data values lie?
4. Below what age do $\scriptsize 100\%$ of data values lie?
5. $\scriptsize 75\%$ of the club membership is older than what age?
6. What other graph(s) could have been used to represent the data?

The full solutions are at the end of the unit.

# Unit 1: Solutions

### Exercise 1.1

1. .
1. The highest mark in test A is $\scriptsize 90\%$.
2. The lowest mark in test B is $\scriptsize 40\%$.
3. The median mark in test B is $\scriptsize 80\%$.
4. In test A $\scriptsize 50\%$ of the marks lie between $\scriptsize 70\%$ and $\scriptsize 90\%$ (or between $\scriptsize 30\%$ and $\scriptsize 70\%$).
5. $\scriptsize 25\%$ of learners got less than $\scriptsize 75\%$ in test B.
6. $\scriptsize 75\%$of learners got more than $\scriptsize 50\%$ in test A.
7. Ogives or compound bar graphs could have been used to represent and compare the data.
2. .
1. $\scriptsize 24$
2. $\scriptsize 15-10=5$
3. $\scriptsize 4$
4. No a box-and-whisker diagram would not be an adequate representation in this case. The ogive is also known as the ‘less than’ graph and we can easily see what percentages/values are below or above a certain point.

Back to Exercise 1.1

### Unit 1: Assessment

1. .
1. $\scriptsize \displaystyle 50$
2. $\scriptsize \displaystyle 35$
3. $\scriptsize 50\%$
4. $\scriptsize \displaystyle 35$ learners passed.
5. A box-and-whisker diagram or bar graph could have been used.
2. .
1. The youngest member is $\scriptsize 25$ years old.
2. $\scriptsize \displaystyle 35$ is the median age.
3. The middle $\scriptsize 50\%$ of data values lie between $\scriptsize 30$ and $\scriptsize 45$.
4. $\scriptsize 60$
5. $\scriptsize 75\%$ of the club membership is older than $\scriptsize 30$.
6. A bar graph or ogive could have been used.

Back to Unit 1: Assessment