^{a}

^{b}

^{b}

^{b}

In our everyday lives, we are required to make decisions based upon our statistical intuitions. Often, these involve the comparison of two groups, such as luxury versus family cars and their suitability. Research has shown that the mean difference affects judgements where two sets of data are compared, but the variability of the data has only a minor influence, if any at all. However, prior research has tended to present raw data as simple lists of values. Here, we investigated whether displaying data visually, in the form of parallel dot plots, would lead viewers to incorporate variability information. In Experiment 1, we asked a large sample of people to compare two fictional groups (children who drank ‘Brain Juice’ versus water) in a one-shot design, where only a single comparison was made. Our results confirmed that only the mean difference between the groups predicted subsequent judgements of how much they differed, in line with previous work using lists of numbers. In Experiment 2, we asked each participant to make multiple comparisons, with both the mean difference and the pooled standard deviation varying across data sets they were shown. Here, we found that both sources of information were correctly incorporated when making responses. Taken together, we suggest that increasing the salience of variability information, through manipulating this factor across items seen, encourages viewers to consider this in their judgements. Such findings may have useful applications for best practices when teaching difficult concepts like sampling variation.

When deciding whether two groups are different on some measure, one of the most important concepts to understand is the mean or “average”. Indeed, many teachers have focussed on determining the best ways to convey this idea to students at an early age, both through calculation and visual impression (

For several years, studies have investigated what has been termed ‘informal inferential reasoning’, where judgements are made based on prior knowledge but not formal statistical procedures. Here, we consider whether people are sensitive to both the average and the spread of data when asked to make such judgements. Recent research provides some evidence that both the mean differences and the set variances correctly influence decisions about which of two groups is larger when the data are presented as lists of raw values (

So far, the raw data for the two groups have been presented as lists of numbers from which participants were expected to make summary judgements. In addition, several studies have investigated the potential effects of displaying summary statistics such as the mean, sample size, and standard deviation (e.g., using visual analogue scales;

Also of relevance to the current research is how people understand visual representations of data more generally. As mentioned, little is known about how people compare two sets of raw data presented visually. However, evidence suggests that even our interpretations of simple line graphs, depicting three variables, are often incomplete and incorrect, and such graphs require complex processes in order to comprehend (

To our knowledge, only one study has displayed raw data (that is, each value separately) visually.

In the current work, we investigate the possibility that viewers previously failed to incorporate information regarding variability because data were presented as lists of numbers, summary values, or in graphically inaccessible ways. We hypothesise that visually presenting the data using dot plots may provide this type of information in a readily accessible format. In two experiments, we investigate whether this presentation method will result in both the means and standard deviations influencing participants’ responses.

In this first experiment, we investigated whether the means and standard deviations of the simulated data would affect responses when tested

We recruited participants from three sources in order to incorporate a wide range of ages and education levels. The first group (

The three groups represented convenience samples that, when combined, would provide a spread of ages and education levels. Large amounts of variation along these dimensions were also present within groups, in particular the first group comprising members of the public. As such, the uneven sizes of these nominal groups, of itself, was not important. In total, 165 people (112 women; age

All participants provided verbal consent before taking part, and were given both a verbal and written debriefing after completion. The experiment’s design and procedure were approved by the university psychology department’s ethics committee (identification number 484) and conform to the Declaration of Helsinki.

We gave participants a pen-and-paper questionnaire describing a study in which a new product, ‘Brain Juice’, was being tested for its memory-boosting ability. Participants were informed that in this fictional study, one group of 20 children drank water before their memory test while a second group of 20 children drank Brain Juice. The two groups were reported as identical in all other ways. The children’s memory test scores were then presented in a graph on the questionnaire for the participants to examine (see

Example version of a graphical representation of the data shown to participants in Experiment 1 (Condition 4 in

We created seven versions of this pen-and-paper questionnaire (see

Condition | Water |
Brain Juice |
Cohen’s |
Participants’ Ratings | |||
---|---|---|---|---|---|---|---|

1 | 24 | 40 | 10 | 45 | 10 | 0.5 | 3.21 (2.02) |

2 | 24 | 40 | 20 | 50 | 20 | 0.5 | 3.67 (1.76) |

3 | 24 | 40 | 10 | 50 | 10 | 1.0 | 3.83 (1.55) |

4 | 24 | 40 | 10 | 60 | 10 | 2.0 | 5.29 (1.49) |

5 | 23 | 40 | 5 | 50 | 5 | 2.0 | 3.96 (1.85) |

6 | 23 | 40 | 10 | 80 | 10 | 4.0 | 6.17 (1.64) |

7 | 23 | 40 | 2.5 | 50 | 2.5 | 4.0 | 4.35 (1.72) |

The children’s test scores were produced using customised MATLAB software. For each set of values, 20 normally distributed random numbers were generated and then standardised, resulting in a mean of zero and a standard deviation of one. These were then multiplied by the standard deviation specified for that condition (see

Each participant received one version of the questionnaire only, determined by the order in which they took part – the first person was given Condition 1, the second Condition 2, and so on, with the eighth starting at 1 again.

After reading the description of the fictional study and examining the graph that followed, the questionnaire then asked, “If you were asked to give a rating, how large do you think the Brain Juice improvement was?” Participants circled their answer on a labelled rating scale from 0 (‘none’) to 9 (‘very large’) on a 10-point scale. Next, participants read, “If a reporter asked you if the Brain Juice group did better than the children who drank water, what would you say?” Answers were given by circling either ‘yes’ or ‘no’. This second question was designed to investigate real-world outcomes, where participants are required to make a decision regarding, for example, the purchase of one particular car over another.

Finally, demographic information was collected: age, sex, and how much mathematics/statistics education participants had previously received. For this question, several options were provided (e.g., secondary school, undergraduate degree) for participants to select, or they could choose ‘other’ and provide an open response.

Throughout testing, participants were instructed not to confer, and were not shown other versions of the questionnaire.

Participant data and visual stimuli can be found in the

The data were analysed using multiple regression in order to determine which factors predicted participants’ judgements of the Brain Juice improvement. First, the amount of statistics education that participants had received was converted to an ordinal variable, ranging from 1 (primary school) to 5 (PhD). Next, several regression models were explored with participants’ ratings as the dependent variable, and condition variables (mean difference, pooled standard deviation, Cohen’s

First, we averaged ratings across participants for each condition (see

In Models 1-3 (see ^{2}) to Model 1 if the pooled standard deviation is included, this increase in explanatory power is not statistically significant, ^{2}_{change} = .030, _{change}(1,4) = 1.53,

Model | Variable | Beta | Adjusted ^{2} |
|||
---|---|---|---|---|---|---|

1 | Mean difference | 0.944 | 6.42 | .001 | .870 | 41.27 |

2 | Pooled standard deviation | -0.143 | -0.32 | .760 | .176 | 0.10 |

3 | Cohen’s |
0.742 | 2.48 | .056 | .461 | 6.14 |

4 | Mean difference | 0.950 | 6.79 | .002 | .883 | 23.58 |

Pooled standard deviation | -0.173 | -1.24 | .284 | |||

5 | Mean difference | 0.829 | 3.10 | .054 | .858 | 13.08 |

Pooled standard deviation | -0.056 | -0.22 | .844 | |||

Cohen’s |
0.186 | 0.55 | .620 |

^{2} and

Next, we used multiple regression in order to model participants’ individual ratings. This approach allows us to consider the potential influence of individual differences (sex, age, education level).

In Models 1-4 (see

Model | Variable | Beta | Adjusted ^{2} |
|||
---|---|---|---|---|---|---|

1 | Mean difference | 0.464 | 6.69 | < .001 | .211 | 44.78 |

2 | Mean difference | 0.466 | 6.73 | < .001 | .213 | 23.21 |

Pooled standard deviation | -0.085 | -1.22 | .223 | |||

3 | Cohen’s |
0.365 | 5.01 | < .001 | .128 | 25.12 |

4 | Mean difference | 0.409 | 3.39 | < .001 | .210 | 15.53 |

Pooled standard deviation | -0.029 | -0.25 | .806 | |||

Cohen’s |
0.089 | 0.59 | .558 | |||

5 | Mean difference | 0.590 | 2.33 | .021 | .198 | 6.74 |

Education | 0.089 | 0.69 | .489 | |||

Sex | -0.079 | -0.61 | .544 | |||

Age | 0.022 | 0.18 | .860 | |||

Mean difference x Education | -0.237 | -0.94 | .350 | |||

Mean difference x Sex | 0.229 | 0.81 | .420 | |||

Mean difference x Age | -0.145 | -0.66 | .508 |

^{2} and

While Cohen’s ^{2}_{change} = .007, _{change}(1,162) = 1.50,

In Model 5, the demographic variables and their two-way interactions with the mean difference were included in the model, along with the mean difference itself. However, only the mean difference was a significant predictor of ratings. Indeed, if these additional variables and interactions are added to Model 1 in a second step, they provide no significant improvement over the original model, ^{2}_{change} = .010, _{change}(6,156) = 0.35,

We carried out an independent samples ^{2}(1) = 31.83, ^{2} = 0.264. The odds ratio for the participants’ ratings (1.84) meant that, for a one-unit increase along the rating scale, we expect an 84% increase in the odds of a ‘yes’ response. These results suggest that, although separable, the switch from responding ‘no’ to ‘yes’ occurred over a relatively small interval on our scale.

We also considered how well the three condition variables predicted participants’ decisions. As individual predictors, we can say that Cohen’s ^{2} = 0.094) in comparison with the mean difference (0.047) and the pooled standard deviation (0.065). However, by considering combinations of these predictors, we find that a model including the mean difference and the pooled standard deviation provides the best fit, ^{2}(2) = 12.30, ^{2} = 0.108. The addition of Cohen’s ^{2}(1) = 0.12,

In the second experiment, we investigated whether the means and standard deviations of the simulated data would affect responses when tested

Thirty-three students (31 women; age

All participants provided written informed consent, and were given both a verbal and written debriefing after completion. The experiment’s design and procedure were approved by the university psychology department’s ethics committee and conform to the Declaration of Helsinki.

The materials used here were similar to those presented in Experiment 1, with a few important differences. First, all text and graphs were presented on a computer using custom MATLAB software, and responses were collected using the keyboard. Second, a fully crossed design (mean differences: 5, 10, 20, 40; pooled standard deviation: 2.5, 5, 10, 20) with all possible value combinations was used to create 16 conditions. In this experiment, each participant was presented with all 16 graphs (in a randomised order), and was required to provide a rating for each of them.

The children’s test scores were generated as in Experiment 1 for each condition. However, new sets of test scores could be produced for each participant since this was a computer-based task. As such, every participant saw a graph where the mean difference was 10 and the pooled standard deviation was 20 (for example), but the raw data were newly generated for each instance. Using different graphs across participants, we were able to rule out any influence of particular distributions (e.g., the presence of outliers which might affect perceptions), in comparison with Experiment 1, where the same graph was always presented for a given condition.

Finally, only one response was required for each graph, and the question itself was reworded from Experiment 1. Here, participants were asked, “If you were asked how much Brain Juice improves memory, what would you say?” This rewording was used to give a more inferential tone, encouraging participants to predict general/future outcomes rather than simply describing the data presented. Participants entered their responses using a rating scale from 0 (‘not at all’) to 9 (‘very large’) on a 10-point scale.

After reading the description of the fictional study onscreen and examining the graph depicted below, participants were required to respond to the rating-scale question using the keyboard. No time constraints were imposed. Once a response had been recorded, the raw data on the onscreen graph changed to reflect a new condition. Only the graph itself changed, while the text remained unaltered. Participants rated all 16 graphs in a random order.

Finally, demographic information was collected: age, sex, and how much statistics training/education participants had previously received (in months/years). In addition, participants were invited to provide written responses to two open-ended questions: “What were the main changes you noticed across the graphs that you rated?” and “How did the groups seem to differ from each other (when they did)?” Lastly, following

Participant data and visual stimuli can be found in the

As in Experiment 1, the data were analysed in order to determine which factors predicted participants’ judgements of the Brain Juice improvement. Several models were explored with participants’ ratings as the dependent variable and condition variables (mean difference, pooled standard deviation, Cohen’s

First, we averaged ratings across participants for each condition (data presented in the

In Models 1-3 (see ^{2}_{change} = .179, _{change}(1,13) = 104.44, ^{2}_{change} = .092, _{change}(1,13) = 10.94,

Model | Variable | Beta | Adjusted ^{2} |
|||
---|---|---|---|---|---|---|

1 | Mean difference | 0.894 | 7.46 | < .001 | .785 | 55.66 |

2 | Pooled standard deviation | -0.423 | -1.75 | .103 | .120 | 3.05 |

3 | Cohen’s |
0.799 | 4.98 | < .001 | .613 | 24.75 |

4 | Mean difference | 0.894 | 21.61 | < .001 | .974 | 285.68 |

Pooled standard deviation | -0.423 | -10.22 | < .001 | |||

5 | Mean difference | 0.874 | 13.63 | < .001 | .973 | 178.41 |

Pooled standard deviation | -0.406 | -6.91 | < .001 | |||

Cohen’s |
0.032 | 0.42 | .684 |

^{2} and

Next, we used generalised linear mixed models in order to investigate participants’ individual ratings, with each participant’s unique ID included in the model as a random term to account for data collected repeatedly from the same person. Condition variables (mean difference, pooled standard deviation, Cohen’s

In Models 1-3 (see

Model | Variable | Beta | AICc | |||
---|---|---|---|---|---|---|

1 | Mean difference | 0.690 | 25.39 | < .001 | 2177.44 | 644.64 |

2 | Pooled standard deviation | -0.326 | -8.46 | < .001 | 2521.68 | 71.63 |

3 | Cohen’s |
0.617 | 20.21 | < .001 | 2289.91 | 408.35 |

4 | Mean difference | 0.690 | 30.14 | < .001 | 2014.37 | 555.87 |

Pooled standard deviation | -0.326 | -14.26 | < .001 | |||

5 | Mean difference | 0.675 | 19.63 | < .001 | 2019.26 | 370.22 |

Pooled standard deviation | -0.313 | -9.95 | < .001 | |||

Cohen’s |
0.024 | 0.60 | .548 |

Next, we consider the inclusion of participants’ individual differences as predictors. Using Model 4 (above) as our starting point, we find that the addition of how much statistics training each participant has had, along with this variable’s interactions with the two original predictors, produces no improvement in the model (AICc = 2044.34, all additional coefficient

Participants’ written responses to the two open-ended questions (“What were the main changes you noticed across the graphs that you rated?”; “How did the groups seem to differ from each other (when they did)?”) were not analysed formally. These questions were simply included in order to determine whether viewers noticed specifically that the spread of the data varied across the graphs. Although such observations do not guarantee that the information was correctly utilised in the responses they had previously given, they at least confirm that this manipulation was salient to participants.

From reading the written responses, it is clear that participants noticed this particular change. (Remember that only two factors varied across the 16 graphs: the mean difference and the pooled standard deviation.) There were numerous mentions of “closer together”, “how far apart”, “spread out”, “distribution”, “grouped together (not scattered)”, and so on. We conclude from this coarse evidence that participants were explicitly aware that the variability of the data sets changed across trials. Our regression analyses (presented above) confirm that such information was used when participants gave their responses.

The aim of this research was to determine how people compare sets of data when these are presented visually in a way that was hypothesised to make both the averages and the variance within each group salient. By making this information accessible, we predicted that participants would utilise within-group variance in their decisions. Previous research has shown that people are heavily influenced by the mean difference (between-group variability) but place less (if any) importance on within-group variability (

These two seemingly contrasting findings are likely the result of the two different experimental procedures used here. Experiment 1 suggests that, when faced with two sets of data, people do not naturally incorporate a consideration of variability when making decisions – their ratings are driven solely by the mean difference. Experiment 2, however, supports the idea that viewing 16 graphs in succession encourages the (correct) use of the pooled standard deviation. From their written responses, we know that participants noticed the changes in variability across trials. Viewers are, therefore, able to utilise information regarding the spread of the data in principle, but it may be important to draw their attention to this feature (here, through its manipulation over the course of the experiment). Unfortunately, we were unable to determine whether participants in Experiment 1, when presented with a single graph, considered variation and simply failed to incorporate it into subsequent decisions. This would be an important question for future research.

In Experiment 1, when forced to make a binary decision regarding the outcome of the Brain Juice intervention, we found that participants’ ratings were a strong predictor of their subsequent choices. However, our results also suggest that the mean difference

The wording of the question in Experiment 1 (“…how large do you think the Brain Juice improvement was?”) may have been interpreted as descriptive rather than inferential. Perhaps participants’ responses were limited to the specific samples rather than a more generalised statement about the Brain Juice product and its effects. Importantly, showing that participants fail to incorporate variability information even in this situation is informative. In the second experiment, we reworded the question to perhaps imply a consideration of future outcomes beyond the specific samples presented (“if you were asked how much Brain Juice improves memory…”). Although a subtle difference, this may have helped participants to think more generally about treatment effects in a broader context. Unfortunately, we are unable to quantify the effects of this change (if any) within the current data, but further investigation might consider the influence of this type of framing on judgements.

We predicted that displaying the raw data visually would help participants to use the spreads of the two groups in their judgements. While such information may be perceived more easily than when simple lists of numbers were presented in previous work (

In both experiments described here, we found no predictive effect of the level of statistical education that participants had previously received, or their knowledge of specific statistical terms, in line with previous research (

Perhaps another way to encourage participants to consider variability when making their judgements is to draw their attention to it explicitly. For two groups of participants, we might ask only one group to first judge whether the level of variability is the same or different in the two parallel dot plots. We predict that this group, who first considered the variability before making their difference judgements, would show a greater variability influence in their subsequent ratings. Again, this presumes that people would know what to do with this type of information, but Experiment 2 (presented here) suggests that this may well be the case.

Previous research has started to explore the effect of sample sizes on judgements (e.g.,

In conclusion, we extend previous research showing that participants fail to take into account the importance of within-group variability when making decisions about group differences. While information regarding the variability within the data does not seem to influence perceptions in isolation, we see that manipulating variability across situations may increase the salience of this factor, encouraging viewers to consider this additional source of information. Our results have important implications for statistical educational approaches, where visual displays may be a more suitable format for highlighting variability changes across items. In an applied context, these types of data sets might prove useful as a tool for conveying difficult concepts like variability to students.

The authors have no funding to report.

The authors have declared that no competing interests exist.

The authors have no support to report.