Like any other discipline within the psychological and behavioral sciences, researchers in numerical cognition rely on statistical inference – hypothesis testing and parameter estimation – to verify theoretical claims against observed data. In recent years, Bayesian inference has become increasingly popular as an alternative to classical null hypothesis significance testing (NHST). Bayesian inference operates under a fundamentally different framework from NHST; instead of hard “accept/reject” decisions about a null hypothesis, Bayesian inference works by quantifying evidence about competing hypotheses in light of observed data. Hypotheses that do a better job of predicting data will receive increased support, whereas hypotheses that do not predict the data well will receive decreased support.
Despite its many advantages over NHST and increased calls for its use, the Bayesian approach is still relatively underutilized in the social sciences (van de Schoot, Winter, Ryan, Zondervan-Zwijnenburg, & Depaoli, 2017). One of the possible reasons for this lack of adoption may stem from the fact that Bayesian statistics is rarely taught in applied statistics courses. As a result, some researchers may not feel confident in their ability to apply these methods to their own problems of interest. Also, standard statistical software packages did not include routines for Bayesian inference, and so performing Bayesian inference in the past often required sophisticated programming skills.
In this tutorial paper, we will introduce the reader to the basics of Bayesian inference through the lens of some classic, well-cited studies in numerical cognition. In three detailed examples illustrating -tests, linear regression, and ANOVA, we will walk the reader through performing and interpreting the results of Bayesian analyses using the software package JASP (JASP Team, 2020). JASP is a free, open-source statistical software package offering the user the option of both Bayesian and frequentist analyses, all housed within a familiar, easy-to-use graphical user interface. Below we give a brief introduction to Bayesian hypothesis testing, followed by the three tutorial examples.
Introduction to Bayesian Hypothesis Testing
In this section, we will review and elaborate on the concept of hypothesis testing. Even though the reader of this paper is probably familiar with the general concept of hypothesis testing, we believe that it is nonetheless instructive to go over the basics.
Hypothesis testing can be viewed as a process of validating general claims based on a small sample of the general population. To illustrate, suppose that we are policy makers confronted with a new study program which claims to increase the mathematical abilities of sixth graders in the United States. To ease the exposition we use a (fictitious) scale called the Scale for Advancing Mathematical Ability (SAMA) to measure mathematical ability. As policy makers we know that the population mean SAMA score is and that the standard deviation is for the population of sixth graders that are taught using the standard materials in the US.
The superiority claim about the new study program can be formalized as the working hypothesis . Here refers to the population mean of SAMA score of the same sixth graders, but taught using the new study program instead. Such a treatment population does not (yet) exist and cannot be measured, and is, therefore, unknown.
Nonetheless, we would like to infer the validity of the working hypothesis by running an experiment in which, say, sixth graders are taught with the new study program. Suppose that these particular participants scored an average SAMA score of . Clearly, this sample mean is larger than the status quo population mean of , but can we assume that this finding generalizes to the whole population? If the answer is “yes” and we declare the working hypothesis to be verified, then how confident would we be about such a decision? Below we recall how -values are typically used to make hard “yes” and “no” decisions, and we remind the reader why a -value does not quantify the (un)certainty of such decisions.
To deduce how hypotheses, i.e., claims about the population such as , affect the data, such as , we have to link the former to the latter with a distribution. For instance, let us suppose a normal distribution provides this link. The population parameters, the distribution that links them to the data, and the data together make up a (statistical) model, which allows for quantitative predictions. For instance, the claim that the treatment population is identical to the current population of sixth graders implies the null model comprised of (1) the null hypothesis , (2) the assumption , and (3) normally distributed SAMA scores. When all three assumptions hold true, we can mathematically deduce that the average SAMA score for a group of participants is expected to fall within the interval and with 95% chance.
Null Hypothesis Significance Testing
Reasoning from the general population by deriving statements about potential data outcomes under the assumption that the parameters are known forms the basis of null hypothesis significance testing. By definition, the -value is the chance of seeing the observed outcome and more extreme – but unobservable – outcomes under the null.
Under the null model , the more extreme data consist of potential outcomes of larger than the observed sample mean of , as well as those potential values of that are smaller than the unobserved possible sample mean of (i.e., a potential sample mean that is less than, but equally far from, the hypothesized mean of 50) . Under the null, and the normality assumption in particular, we can deduce that the chance of seeing a potential in or is 5%, thus, . Note that , which is the 95% chance of the non-extreme event of having fall within the interval , if the null model holds true. We use this example to emphasize that -values help us reason from the null model to (potential) data, but they do not allow us to do the reverse; that is, make quantitative statements from the actually observed data about the hypotheses.
A small -value has many potential causes, and for this case we mention four: (1) the null hypothesis is not true; (2) the true variance is actually higher than ; (3) the data are not normally distributed; or (4) the null model actually holds true, but we saw a rare event. Despite the various causes of a small , typically only (1) is singled out and mentioned as culprit. The typical conclusion is then to reject the null, and the working hypothesis is proposed as the winner. Note that this type of reasoning provides only an indirect argument for our working hypothesis, as the predictions of the working hypothesis itself were not considered at all. Only the null was used in the computation of . The fact that the null is not a good fit for the data does not automatically imply that the alternative fit the data well.
We conclude our brief review on -values with three observations: (i) -values only allow us to deduce predictions from the general population to potential data, but not the other way around; (ii) the comparison between and is unfair in the sense that -values only consider one side of the coin – namely, the fit of – and thus, support for is only ever indirect; and (iii) traditional practicei with -values can only lead to hard reject or non-reject decisions, which encourages “black and white” thinking and ignores the nuances of a scientific enquiry. On this issue, Rozeboom (1960) writes
The null-hypothesis significance test treats “acceptance” or “rejection” of a hypothesis as though these were decisions one makes. But a hypothesis is not something, like a piece of pie offered for dessert, which can be accepted or rejected by a voluntary physical action. Acceptance or rejection of a hypothesis is a cognitive process, a degree of believing or disbelieving which, if rational, is not a matter of choice but determined solely by how likely it is, given the evidence, that the hypothesis is true (pp. 422-423).
In a similar vein, Jeffreys (1939) offered the following critique of null-hypothesis significance testing:
Modern statisticians have developed extensive mathematical techniques, but for the most part have rejected the notion of the probability of a hypothesis, and thereby deprived themselves of any way of saying precisely what they mean when they decide between hypotheses (p. v).
Though there are many other problems with -values (see Wagenmakers, 2007), we contend that the key point is the neglect of any indication of support for . One way to remedy this is to consider Bayesian hypothesis testing as an alternative.
Bayesian Evidence as Relative Predictive Performance
To address these problems, we can use Bayesian hypothesis testing, which gives us the ability to quantify evidence by comparing the relative predictive performance of both and . Compared to null-hypothesis testing, Bayesian testing adds the missing piece , which indexes the predictive performance of . We then quantify the relative predictive performance by computing the Bayes factor (Kass & Raftery, 1995),
which indexes the relative adequacy of both hypotheses as predictors of our observed data. One immediate benefit of the Bayes factor is ease of interpretation. For example, if , this tells us that our data are 5 times more likely to be observed under than . Thus, we go beyond -values by simultaneously assessing the predictive adequacy of both hypotheses, not just . Of course, in order for to make predictions, we have to be specific about what is meant by “ ”. Because we are rarely certain about what “ ” means exactly, we place a prior distribution on the parameter . We will specify more about this later when we get to specific tests, but in general, the prior allows us to mathematically express our uncertainty about the values of under . For now, please note that this prior is a technical piece that is necessary for computing the Bayes factor.
Bayes Factor Interpretations
Typically, the Bayes factor is denoted as , where the 1 in the subscript refers to the alternative model, and the 0 to the null model. Bayes factors range from 0 to , and implies that the observed data are 8 times more likely under the alternative compared to the null model. For Bayes factors smaller than 1, we typically take the reciprocal to make them easier to interpret. For instance, is easier interpret as , which means that the observed data are 5 times more likely under the null compared to the alternative. Bayes factors are best viewed as a continuous measure of evidence. In general, a value of larger than 1 is perceived to provide evidence for the alternative against the null.
Unlike the -value, there are no universal standards for when a Bayes factor provides a sufficient amount of evidence. However, people have made recommendations in the past regarding sizes of Bayes factors, and these provide us with a helpful, but very rough guideline on its interpretation, when we do not know the context of the problem. In particular, we mention the recommendations of Jeffreys (1961), who proposed the following schemeii:
|1 – 3||anecdotal|
|3 – 10||moderate|
|10 – 30||strong|
|30 – 100||very strong|
In principle, the Bayes factor does not come with a specified decision rule. We believe that context made explicit using prior model probabilities by field experts are much more important than overly generalized “rules” such or , where is some arbitrarily chosen number. It is also worth noting that a Bayes factor is simply a quantity which measures relative evidence between two models. It is completely independent of our prior beliefs in the two hypotheses and . To incorporate the Bayes factor (relative predictive performance) with these beliefs (i.e., the prior model probabilities), we need Bayes’ rule.
Bayesian Belief Updating
Bayes’ rule mathematically expresses the most basic idea in science; namely, that our prior belief in a hypothesis – – is updated after observing data to a posterior belief in the following way:
Since our goal is to directly compare the predictive adequacy of two hypotheses and , we can employ Bayes’ rule twice to derive the following:
This gives us a lovely way to remember another fundamental fact of Bayesian hypothesis testing; that is, posterior odds = prior odds BF. Thus, not only does the Bayes factor assess the relative predictive performance of two hypotheses, it also indexes the factor by which our relative belief in the two hypotheses can be updated after observing data.
Now we come back to our earlier example of the SAMA test for mathematical ability and demonstrate these ideas in context. Working as policy makers, we might have seen many claims of improved study programs that did not work out well. For instance, if four out of five past claims of improved study programs led to null results, we might place 4-to-1 odds against the claim of improvement and thus set and . On the other hand, the designers of the new study program to increase SAMA scores might feel strongly about the benefits of their program, and so instead might choose and . If and are the only models under consideration we have and . The Bayes factor can then be used to update these prior model probabilities to posterior model probabilities, as follows:
For illustration, suppose that we found . Then the skeptic – who set before seeing the data – increases their belief in the superiority of the new study program to 56%, i.e., , and retains 44% probability for the null after data observation. On the other hand, the same results in the proponent of the new study program – who set before seeing the data – to report a posterior model probability of and instead. Note that the posterior model probabilities combine what we believed prior to data observation with the evidence of the data in terms of the Bayes factors. Both the skeptic and the proponent could, if they wish, declare the new study program to be superior to the status quo, but the skeptic would be much less confident than the proponent. The resulting posterior model probabilities can, thus, be used to complement a decision with a measure of uncertainty.
Bayesian Hypothesis Testing: How It Works
While the goal of this paper is to provide some concrete examples of Bayesian hypothesis testing in numerical cognition, before doing so we will discuss how Bayesian hypothesis tests work in conceptual terms. The technical details of each test are provided in a supplementary document, but the reader need only consult these details if interested.
As a general method, Bayesian hypothesis testing gives us several pragmatic benefits, each of which will be highlighted in our examples below. The first of these is the ability to quantify evidence on a continuous scale. A second benefit is the ability to differentiate between evidence of absence and absence of evidence. Recall that in traditional null hypothesis testing, failure to reject cannot be taken as evidence for a null effect. In this instance, it is not clear whether our data provide evidence for (i.e., evidence of absence) or our data are simply not evidential for either hypothesis (i.e., absence of evidence). The Bayes factor gives us the ability to separate these two ideas. For example, if is large, we take this to represent evidence for the null – that is, evidence for the absence of something. Another benefit is the ability for the researcher to monitor evidence as the data accumulate. Though problematic for inference based on -values, optional stopping does not affect the interpretation of Bayesian analyses (Hendriksen, de Heide, & Grünwald, 2018; Rouder, 2014). Thus, researchers may continuously monitor the “evidential flow” of their data and choose to stop whenever evidence is sufficiently compelling. Finally, Bayesian methods integrate testing with estimation, allowing the researcher to not only measure the evidence for one hypothesis over another, but to simultaneously obtain a probabilistic estimate of the effect size under the hypothesis that it is non-zero.
Given these benefits, what does a Bayesian hypothesis test require of the researcher? Like any hypothesis testing situation, the researcher must specify a null hypothesis and an alternative hypothesis . In a Bayesian context, though, there are some specific considerations that govern how this choice is made. Essentially, the researcher must specify a prior distribution of effect sizes that are expected under . That is, we not only specify that the effect size is nonzero under , but we also specify a prediction of how it is nonzero. We do this by picking a probability distribution for that quantifies our uncertainty about the size of the effect if it exists. Most importantly, this prior is needed to define the Bayes factor, as it allows to make predictions that can be evaluated against observed data. Furthermore, after data observation Bayes rule allows us to update the prior over to a posterior distribution, which provides us the likely sizes of the effect if it exists. For example, the Bayesian -test of Jeffreys (1961), popularized by Rouder, Speckman, Sun, Morey, and Iverson (2009), uses a Cauchy distribution with scale as a default choice for the prior distribution of effect sizes that are expected if there is a nonzero effect. This prior satisfies several desirable properties for an objective, default prior; see the Supplementary Materials for details. Given this definition of (i.e., as a distribution of expected effect sizes ), the null hypothesis is specified as a specific point within this distribution, namely .
Given this specification, we can use JASP to perform all necessary calculations “under the hood”. Note that the specification of the prior distribution of effect sizes under can be done differently, including choosing priors based on earlier experiments (Ly, Etz, Marsman, & Wagenmakers, 2019; Verhagen & Wagenmakers, 2014) or on general knowledge about a domain (i.e., informed priors; Gronau, Ly, & Wagenmakers, 2020). However, we recommend that researchers start their analysis with the default priors that are provided by JASP. These form a good reference point, and as one’s Bayesian expertise develops (or if one has strong prior knowledge) then additional analyses can be performed.
In the remainder of the paper, we describe a number of practical examples of using Bayesian inference in the context of numerical cognition. In the examples, each based on a well-cited paper in numerical cognition, we demonstrate a number of JASP’s features using synthetic data sets, each generated in R using specific model assumptions that closely match the patterns of observed data described in the original papers. The data sets, as well as the R script used to generate the data sets, are available for download from https://git.io/JeXui.
Example 1: A Bayesian t-Test
To demonstrate the Bayesian -test, we use a data set based on Verguts and De Moor (2005), who investigated whether two-digit numbers are represented holistically on a mental number line, or instead are processed by decomposing into separate digit components. In their experiment, Verguts and De Moor (2005) presented subjects with pairs of two-digit numbers who had the same decade (e.g., 53 versus 51) or decades that differed by one (e.g., 53 versus 61). Their reasoning for this manipulation was as follows: if subjects are using holistic processes to compare two-digit numbers, then they should exhibit a numerical distance effect (Moyer & Landauer, 1967) on same-decade as well as different-decade pairs. That is, response times should be faster when the numerical distance between the two numbers is large, and slower when the numerical distance is small. If, on the other hand, subjects are using decomposed processes, this numerical distance effect should disappear on the different-decade pairs. This is because in different-decade pairs, such as 53 versus 61, the comparison can be made solely on the basis of the decade digit alone. Thus, response times in this condition should be immune to manipulation of numerical distance.
In their original paper, Verguts and De Moor (2005) found support for decomposed processing on the basis of several single sample -tests. For subjects asked to choose the smaller of the digit pairs, they found (as expected) a statistically significant numerical distance effect for same-decade pairs, , . However, the distance effect was not significantly different from 0 for different-decade pairs, , . A similar pattern was found for subjects who were asked to choose the larger of each pair: there was a significant distance effect for same-decade pairs, , , but no significant distance effect for different-decade pairs, , .
Executing the t-test in JASP
We will now describe a Bayesian analysis inspired by the original methods of Verguts and De Moor (2005). After downloading the file ttest.csv from the Github repository (https://git.io/JeXui), you can open the file in JASP by clicking the menu button – “Open” – “Computer”, which will then open a file browser for you to locate the downloaded file on your computer. Once the file is opened, you will be able to see a data set with three variables (see Figure 1). The variable condition defines whether a participant was instructed to choose the smaller or the larger of each presented two-digit number pair. The variable distSame represents each participant’s slope after regressing response times against numerical distance (c.f., Lorch & Myers, 1990) for same-decade pairs. Note that a negative slope is indicative of the numerical distance effect, as response times decrease as a function of increasing numerical distance. Finally, distDifferent represents each subject’s regression slope for different-decade pairs.
Similar to Verguts and De Moor (2005), we are interested in whether the mean of the regression slopes (collapsed across participants by condition) depends on the type of decade pair being compared. Specifically, the decomposed processing account leads us to predict that for same-decade pairs, the numerical distance effect (indexed by mean slope) will be negative, whereas for different-decade pairs, the numerical distance effect will be 0. As a first step, we can easily obtain some descriptive statistics. In JASP, click on the “Descriptives” in the ribbon at the top, then click “Descriptive Statistics” in the dropdown menu. This will open a dialogue box showing the list of variables available in our data set. Highlight distSame and distDifferent, then click the right-facing arrow to move these into the “Variables” box. Immediately, some descriptive statistics will appear on the right side of the screen under “Results”. We can also construct a boxplot (with jittered datapoints included) for each of the variables, split by condition, by clicking on the “Plots” menu and selecting “Boxplots” and “Boxplot element” and “Jitter element” (see Figure 2). Immediately, we can see that the slopes are mostly negative for same-decade pairs (the top boxplot in Figure 2); this is indicative of a numerical distance effect for the same-decade pairs. However, the slopes are centered at 0 for the different-decade pairs (the bottom boxplot), indicating no numerical distance effect for these pairs. Thus, the data initially seem to support the decomposed processing narrative of Verguts and De Moor (2005).
We are now ready to perform a Bayesian analysis of the simulated Verguts and De Moor (2005) data. First, since we will separately analyze the “Choose smaller” and “Choose larger” groups, we will need to apply a filter to the data set; specifically, we will first consider only the “Choose smaller” group. There are several ways to do this in JASP, but the easiest is the “click filter.”iii First, click the header of the condition column. To remove all of the larger trials from the analysis and retain only the smaller trials, we simply “turn off” the larger value by clicking on the check mark next to it. The check mark will be replaced by a cross, and you will be able to see that the larger trials have been greyed out. The filter dialog can be closed by clicking on the X on the right side of the dialog box. Notice that a funnel icon will appear next to the condition column header to indicate that a filter is currently being applied. Any filters can be removed by re-opening the filter dialog and clicking the eraser icon.
To perform a Bayesian one-sample -test, click on the “T-Tests” button in the ribbon at the top, and select “Bayesian One Sample T-Test”. There are two -tests that we wish to perform, but for clarity, we will start with describing one in detail. We will first perform a -test for the same-decade trials. From the variable list, move distSame over to the “Variables” box. Immediately, JASP will output a Bayes factor of . However, let us look more closely. Note that under the “Alt. Hypothesis” list, the option “ Test value” is selected by default. Thus, this Bayes factor is for a two-sided alternative, . But because the numerical distance effect is expected to be negative, we instead consider the one-sided alternative , where the minus in the subscript reminds us that the alternative is one-sided negative. Thus, we choose “ Test value” from the radio buttons under “Alt. Hypothesis”. Immediately, we see that the Bayes factor increases to 9.027; this indicates the importance of properly specifying the correct alternative hypothesis. Finally, select also “Prior and posterior” under “Plots”. This setup and the resulting output appears in Figure 3. There are two components of this output that we now discuss in detail: the computed Bayes factor (top of the Results pane in Figure 3) and the plot of the prior and posterior for the effect size parameter (bottom of the Results pane in Figure 3).
As mentioned above, the Bayes factor is , which indicates that the observed data are approximately 9 times more likely under the alternative hypothesis than under the null hypothesis . According to the convention of Jeffreys (1961), this is considered “moderate” evidence for , which states that the true population effect (from which our participants’ regression slopes are drawn) is negative. A useful visual display of this Bayes factor is provided in the plot in the bottom of Figure 3. This probability wheel (or “pizza plot”) displays the posterior model probabilities for both and under the assumption that they were equally probable before data observation, and the text beside it on the left shows the posterior model odds of 9.027-to-1 in favor of over the null.iv
The bottom plot shows this in more detail. The dashed line depicts the aforementioned Cauchy prior (with a default scale/prior width of ) on the population effect size , but restricted to negative values in accordance to the alternative . The solid line depicts the posterior density for . Under the assumption that the effect cannot be positive, this density provides a measure of uncertainty about the size of the population effect after observing the data. We can see in Figure 3 that the more probable values of are now centered around . For the testing problem, one specific value worth exploring is . Notice that relative to the prior, the -coordinate at in the posterior is smaller than the -coordinate at . This reflects the fact that our posterior belief regarding has decreased after data observation. In fact, it has decreased by a factor of approximately 9, which is the Bayes factor we obtained. This relationship is no accident; For the -test the Bayes factor can be represented as the ratio of the prior and posterior on the population effect evaluated at the null hypothesis . This fact is known as the Savage-Dickey density ratio (Dickey, 1971; Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010).
Figure 3 also provides information about the magnitude of in the form of a 95% credible interval of [-0.962, -0.110]. This means that if the population effect is assumed to be negative – thus, under the implicit assumption that the null is false – then with 95% probability, the true value of the population effect is between -0.962 and -0.110. Note that while it is tempting to use this estimate as a replacement for the inference that is afforded by the Bayes factor, this approach is simply not valid (Berger & Delampady, 1987; Jeffreys, 1939; Wagenmakers, Lee, Rouder, & Morey, in press). The credible interval presented here is conditional on , so it only makes sense if is the true model. Before interpreting the credible interval, we must first perform a test to index support for over . So, when reporting the credible interval, we should report the Bayes factor first, then we can report the credible interval. Do not be tempted to report the credible interval without the Bayes factor!
Another question to consider in our analysis is how sensitive the Bayes factor is to changes in the specification of the Cauchy prior on the population-level effect size . Clicking the box for “Bayes factor robustness check” will help us to answer this question. Figure 4 displays the result of this robustness check. As mentioned before, the default prior for is a Cauchy distribution centered around with scale (or “width”) . Figure 4 depicts how the Bayes factor changes with varying values of . The plot represents the Bayes factor as a continuous function of , and the legend at the top lists the Bayes factor at four values: the value of that produces the maximum Bayes factor, the original user-specified prior ( ), a “wide” prior ( ), and an “ultrawide” prior ( ). This plot should be interpreted holistically –across a wide range of values for the prior scale , the resulting Bayes factor ranges from 6.3 to 9, which implies that the result is quite robust.
Now we are ready to perform a Bayesian -test for the different-decade trials. Recall that these were the critical trials for Verguts and De Moor (2005), as a null numerical distance effect would lend support for a strictly decomposed model of two-digit number processing. We perform this test in an identical manner to the previous test, with one exception. After moving the variable distDifferent to the “Variables” list, we select from the “Bayes Factor” menu. Note that this is not strictly necessary to perform the analysis, but doing so will cast our Bayes factor in terms of support for our working hypothesis over . The result is shown in Figure 5. We obtain , indicating that our data are approximately 7.6 times more likely under than . This shows some evidence that the true population effect , from which our synthethic data were generated, is equal to 0. As we can see in Figure 5, our posterior belief that has increased relative to our prior belief that , specifically by a factor of 7.595.
It is again important to note that the 95% credible interval that is displayed in the figure is conditional on . In other words, it only makes sense if is the true model. However, in this case, we have moderate evidence that is the true model, so we can ignore this credible interval (Jeffreys, 1939). Conditioned on our best guess for the size of the effect would simply be . It is also worth mentioning that there might be some confusion present when to take this credible interval at face value. Some might argue that since the credible interval does not contain 0 (its right endpoint is -0.004), this is actually evidence for . However, this is simply not true. The apparent paradox lies in how the credible interval is computed; in JASP the 95% credible interval is computed as the central 95% of the mass of the posterior for . Since there is no posterior mass to the right of , this central portion of the posterior must necessarily be contained strictly to the left of , thus producing the right end point that is slightly less than . However, if we were to instead specify the credible interval as 95% of the posterior distribution with highest density, the resulting credible interval would contain . Currently, this functionality is not available in JASP, so great care must be taken in interpreting the credible interval for based on restricted priors.
What is also important to realize is that a test addresses a different question than an estimation. The Bayes factor focuses on the question “Does the effect exist?”, whereas the posterior focuses on the question “If it exists, then how large is it?”. This is why the Bayesian testing problem compares the model where the population effect is fixed at zero against one where it is free to vary. Once the data have convinced us that the population effect varies freely, only then would we be interested in the magnitude of the effect (Haaf, Ly, & Wagenmakers, 2019).
Additionally, one might want to see how evidence changes over time as data are collected. JASP gives us the ability to monitor this evidential flow with the “Sequential analysis” plot. The resulting plot shows the progression of the Bayes factor over time; see Figure 6. This plot shows how the Bayes factor changes sequentially as each data point is added, starting from (no preference for either model) at to the obtained Bayes factor of at . Notice that the plot provides both the Bayes factor (on a logarithmic scale) and a verbal label of evidence for each region, corresponding to the recommendations of Jeffreys (1961) that we referenced earlier. If the alternative is the true model, then will increase at an exponential rate, whereas if the null is true, then will increase, albeit more slowly. In general, it is not clear when to stop, but good practice would be to sample until the Bayes factor reaches 10, either in support of or .
We can follow the same steps described above on the participants who were assigned to the “choose larger” condition. The inference is largely the same. For same decade trials, we obtain , indicating that the data are approximately 7.5 times more likely under than under . This provides some evidence that the true population effect is negative, which is indicative of a negative numerical distance effect. For different decade trials, the inference is less clear; we obtain , indicating that the observed data are approximately 2 times more likely under than under .
Let us now recap what we have done here. Importantly, we have described how to conduct the Bayesian -test in JASP, and through the example, we have demonstrated many of the pragmatic advantages of Bayesian analyses in general, going beyond what is afforded by traditional null hypothesis testing. First, we have quantified evidence for the specific hypotheses that were derived from the decomposed processing account of Verguts and De Moor (2005). That is, we showed that when people were given same decade trials, our data were approximately 9 times more likely under the hypothesis of a negative slope (i.e., the numerical distance effect) than under the null hypothesis. On the other hand, when people were given different decade trials, our data were approximately 7.5 times more likely under the null hypothesis (i.e., evidence of absence). Further, for same decade trials, where the alternative hypothesis received more support, we obtained a probabilistic estimate for , the standardized effect size. Specifically, we reasoned that if is the true model, there is a 95% probability that the effect size is between -0.962 and -0.110. Finally, we demonstrated sequential analysis, which allows the researcher to monitor evidential flow as data come in. In all, we went beyond what Verguts and De Moor (2005) were able to do with -values alone, particularly regarding their conclusion of a null distance effect for different decade trials.
Though the discussion throughout this example is designed to be instructive, most of the narrative we have used is also good for the reporting of Bayesian analyses. Practitioners who desire more guidance on the reporting of such analyses are encouraged to consult van Doorn et al. (2019).
Example 2: A Bayesian Linear Regression
To demonstrate Bayesian linear regression, we use an example inspired by Holloway and Ansari (2009), who examined the relationship between basic numerical processing and higher mathematical skills in children. Specifically, they examined whether childrens’ mathematical achievement scores were related to their ability to compare symbolic and nonsymbolic number magnitudes. In this example, we will recreate one specific component of their analysis; namely, we will use Bayesian linear regression to examine the extent to which mathematical fluency scores are predicted by childrens’ standardized symbolic and nonsymbolic distance effects. To operationalize these effects, Holloway and Ansari (2009) subtracted the mean RT for large numerical distances from the mean RT for small numerical distances, and then standardized across individuals by dividing the resulting RT difference by the mean RT for large distances. In their original paper, Holloway and Ansari (2009) found that the childrens’ symbolic distance effect was a significant predictor of fluency, , , whereas the nonsymbolic distance effect did not predict fluency, , .
The questions we would like to address here are (1) “Which out of these two covariates ( and ) predict fluency score?”, and (2) “How large is the effect of each covariate ?”. Note that the first question regarding the which covariate(s) predict fluency score can be answered as one of the following combinations of distance effects: (a) neither symbolic nor nonsymbolic, (b) only symbolic, (c) only nonsymbolic, and (d) both symbolic and nonsymbolic. Equivalently, these answers correspond to specific hypotheses about the regression coefficients , respectively: (a) the coefficients for both symbolic distance and nonsymbolic distance are equal to 0; (b) only the coefficient for nonsymbolic distance is equal to zero; (c) only the coefficient for symbolic distance is equal to zero; and (d) neither coefficient is equal to zero. Cast in this manner, the first question is the testing problem, whereas the second question deals with the estimation problem.
The Bayesian linear regression procedure begins by considering the following four models:
Here, represents a model where neither symbolic nor nonsymbolic distance has any effect on fluency. For and , only symbolic (respectively, nonsymbolic) distance has an effect on fluency. Finally, for , both symbolic and nonsymbolic distance have effects on fluency. As with our earlier example, our goal is to assess the predictive adequacy of each model. To do this, we must place prior distributions on the parameters in the model (see the supplementary text for details). For this context, the relevant parameters are the standardized regression coefficients and , representing the “effects” of symbolic and nonsymbolic distance, respectively. In JASP, the default prior on these regression coefficients is a multivariate Cauchy distribution (see Rouder & Morey, 2012) – conceptually, this is a generalization of the prior that we placed on effect size for the Bayesian -test described earlier. This prior allows each model to make predictions (see the Supplementary Materials for details), which JASP then uses to assess how well each model predicts the observed data at hand. Below, we describe for the practitioner how to execute this procedure and interpret its rich output.
Executing the Bayesian Linear Regression in JASP
After downloading the file regression.csv from the Github repository (https://git.io/JeXui), open the file in JASP. This will reveal a data set with three variables: symbolicDE, representing the standardized distance effect measure for the symbolic number comparison task; nonsymbolicDE, representing the standardized distance effect measure for the nonsymbolic comparison task; and fluency, representing the mathematics fluency score.
First, we will construct scatterplots to visualize the relationship between fluency and the two proposed predictors. An easy way to do this uses the “Bayesian Correlation” option, which can be selected by clicking the “Regression” button. Once here, move all three variables into the analysis box, then open the “Plot Individual Pairs” menu. The two scatter plots will be revealed by moving the variables symbolicDE and fluency (in that order) as a pair into the box, and then doing the same with nonsymbolicDE and fluency. Figure 7 displays this menu as well as the resulting scatterplots. As we can see, fluency seems to decrease as the symbolic distance effect increases, but there seems to be little effect of nonsymbolic distance on fluency. We will now confirm these observations with Bayesian linear regression.
To perform the Bayesian linear regression, we click the “Regression” button and select “Bayesian Linear Regression”. We then move the fluency variable to the “Dependent Variable” box and we move both the symbolicDE and nonsymbolicDE variables to the “Covariates” box. We will leave the basic defaults in place except for two. First, select “Posterior summary” under “Output”. Also, we will select one more option that is found in the “Advanced Options” section, but will actually simplify our discussion for this tutorial. Under “Model Prior”, select “Uniform”. The output can be seen in Figure 8.
First, we will walk through the basic output given by JASP. Note that the basic functionality of JASP here is based on the R package BAS (Clyde, 2020; see also van den Bergh et al., 2020). As can be seen in the output table in Figure 8, JASP constructs the four models we described earlier (i.e., ). These models are each displayed in the rows of the output in Figure 8 in decreasing order of posterior model probability (i.e., the model with the highest posterior model probability is listed at the top). We can see immediately that the best model is , which posits that fluency is predicted only by the symbolic distance effect. But, to understand why it is the best model and how it compares to the other 3 models, we need to discuss what each of the columns in the output table represent.
The first column, , lists the prior probabilities of the 4 models. Above, we specified that the prior model probabilities should be uniform. Under this option, JASP gives the models equal prior model probability. Since there are 4 models, the prior probability of each model is . The second column, , gives the posterior model probabilities. Like the prior model probabilities, these must add to 1, but now there has been some shift from prior model probability to posterior model probability. Here, after observing data, we can see that and account for most of our posterior belief, with posterior probabilities .704 and .289, respectively. The remaining two models have very low posterior probability totaling .007.
Various Bayes factors are contained in the third and fourth columns. Each column gives a slightly different type of Bayes factor, however, and must be interpreted appropriately. The third column, , gives the factor by which the prior odds for a given model have been updated to produce the posterior odds for that model. For example, consider , the model where fluency is predicted only by the symbolic distance effect. The prior odds for , which is defined as the probability that is the true model divided by the probability that is not the true model, can be computed as . By similar calculation, the posterior odds are equal to . Thus, the updating factor from prior odds to posterior odds is calculated as . One of the useful properties of this column is that it tells the user quickly whether the odds for a particular model has increased or decreased. As we can see in Figure 8, the data has increased model odds for both and , but only just slightly for . In other words, we have increased our belief in the two models which contain symbolic distance as a predictor for fluency.
Finally, the fourth column, , lists the Bayes factor for a given model compared to the most probable model (note that this is the default setting in JASP, but if desired, it can be changed by the user to instead display the Bayes factor compared to the null model).v This is a very useful column for doing specific model comparisons. For example, we can use this column to directly compare the two most probable models, and . In the results table, we see that the Bayes factor for the compared to the best fitting model is 0.411. Taking reciprocals, this tells us that the Bayes factor for over is . In other words, the observed data are approximately 2.5 times more likely under the model where fluency is predicted only by the symbolic distance effect compared to the model where fluency is predicted by both symbolic and nonsymbolic distance effects. Another way of saying this is that we have positive evidence for excluding nonsymbolic distance effect as a predictor of mathematical fluency once we already accepted symbolic distance effect as a predictor.
Once we have positive evidence for , which posits only symbolicDE as a predictor of fluency, we can move to the problem of estimation. To this end, we focus on two outputs: (1) the “Posterior Summary” output of Figure 8; and (2) the “Marginal posterior distributions” plot (see Figure 9), which can be displayed by clicking on the “Plots” menu in JASP. Both outputs provide model averaged posterior summaries of the regression coefficients (the intercept), (the coefficient of symbolicDE), and (the coefficient of nonsymbolicDE). Before interpreting these outputs, however, there are some subtle details that we should discuss. As mentioned earlier in the -test example, any estimation that we present is conditional on some model. However, in this example, we have four models to consider. So, how to proceed? One approach is to employ Bayesian model averaging, which JASP performs by default. Basically, it works as follows. Rather than estimating parameters conditioned on one specific model, JASP calculates an average posterior distribution for the parameter. That is, we factor in both the uncertainty about the parameter within any one particular model and the uncertainty about the models themselves. We think this is a good approach to estimation, as it alleviates the need for the researcher to pick one model as “the winner”, and instead, provides estimates that incorporate both levels of uncertainty (within models and across models). For an accessible tutorial on Bayesian model averaging, see Hinne, Gronau, van den Bergh, and Wagenmakers (2019).
Given this background, let us first interpret the Posterior Summary table in Figure 8. First, the table provides estimates of mean and standard deviation for the three coefficients, as well as 95% credible intervals. Moreover, the table provides “inclusion probabilities”, which provide prior and posterior probabilities of including each covariate as a predictor of the linear model. We can take a look at these in more detail. First, let us focus on , the prior inclusion probability for each covariate. These are based on the uniform prior model probabilities that we chose in the “Advanced Options” section. Since the intercept was included in each of the 4 models, and each model had prior probability .25, the prior probability of including the intercept is equal to . On the other hand, only two of the models ( and ) included the covariate symbolicDE. Since each of these models had prior probability .25, the prior probability of including the predictor symbolicDE is equal to . A similar argument shows the same prior probability for the covariate nonsymbolicDE.
After observing data, we can calculate the posterior inclusion probabilities . This can be done easily by considering the posterior probability of each model in the “Model Comparison” table of Figure 8. The intercept is in each of the models, so summing the posterior probabilities for all four models produces . For symbolicDE, we sum the posterior probabilities for and , which gives . Finally, for nonsymbolicDE, we sum the posterior probabilities for and , giving .
Using these posterior inclusion probabilities, we can calculate an “inclusion Bayes factor”, which is defined as the factor by which the prior odds for including a specific predictor are increased after observing data. For example, the prior odds for including symbolicDE is equal to 0.5/(1-0.5) = 1, and the posterior odds is equal to . Thus, the odds have increased by a factor of 138.27/1 = 138.27. Similarly, the prior odds for including nonsymbolicDE as a predictor is equal to 1, whereas the posterior odds is equal to . We can interpret this Bayes factor less than one via the reciprocal by reporting that the odds of including have decreased after observing data by a factor of .
Finally, let us consider the marginal posterior distribution displayed in Figure 9. For simplicity of discussion, we only display the marginal posterior for the nonsymbolicDE covariate, though JASP provides a plot for each covariate being considered. The figure shows exactly what is meant by model averaging; from the discussion above, we know there is uncertainty about whether to include nonsymbolicDE as a predictor in the overall model. This uncertainty can be quantified as the posterior inclusion probability, which from the posterior summary table we know to be . Thus, it easily follows that the posterior probability of excluding nonsymbolicDE as a predictor is ; this probability is displayed in Figure 9 as a “spike” at 0. This spike together with with the posteriors of each model result in an unconditionalvi 95% credible interval, which states that the coefficient for nonsymbolicDE ranges from -0.764 to 29.736.
To summarize, the Bayesian linear regression procedure offers many advantages over and above what we get from traditional null hypothesis testing. Rather than simply making black-and-white decisions about whether symbolic and/or nonsymbolic distance is a significant predictor of mathematical fluency, we are able to engage in a much deeper analysis. With only a few clicks in JASP, we can easily perform model comparison, allowing us to compare the adequacy of four different models to predict the data at hand. Further, with Bayesian model averaging, we are able to simultaneously engage in testing and estimation, factoring our uncertainty about model choice into our estimation of the effects of symbolic and nonsymbolic distance on mathematical fluency. One should note that model averaging becomes quite beneficial when we have many predictors in our regression model, as the space of possible models on these predictors becomes large very quickly. The reader is encouraged to consult van den Bergh et al. (2020) for a good tutorial on multi-model Bayesian linear regression in JASP.
Example 3: A Factorial Analysis of Variance
In our final example, we demonstrate a Bayesian factorial analysis of variance (Rouder et al., 2012). To do this, we use a data set based on Campbell and Fugelsang (2001), who investigated the effects of problem size and numerical surface form on response times in an addition verification task. Not only did they find the usual effects of problem size (small problems are solved faster than large problems) and problem format (digit problems are solved faster than word problems), but they also observed an interaction between size and format. Specifically, the impact of problem format on RTs was greater for large problems than for small problems. This interaction was interpreted as evidence against the prevailing additive model of mental arithmetic, where the effects of problem format are localized to systems that encode numerical stimuli (Dehaene & Cohen, 1995). Instead, the interaction found by Campbell and Fugelsang (2001) suggested that surface format directly penetrates the downstream central calculation processes.
Theory of Bayesian Analysis of Variance
The theory of Bayesian ANOVA is analogous to that of Bayesian linear regression elaborated on above, because an ANOVA model can be expressed as a linear regression with dummy variables (Christensen, 1987; Field, Miles, & Field, 2012). These dummy variables are covariates that can be either “0” or “1”. For instance, the factor “size” with levels “small” and “large” and the factor “format” with levels “digit” and “word” are expressed by the dummy codes such as , which implies that , the th response of the th participant, corresponds to the mean RT of stimuli that were “small” in size and “word” of format.
Rouder et al. (2012) used the relationship between ANOVA and linear regression to develop the Bayesian ANOVA, which therefore also uses a multivariate Cauchy prior. Furthermore, this relationship allows us to express the problem at hand as a comparison between the models
Notice that in this case, the null model contains factors for size and format. This may seem odd at first, especially compared to the previous linear regression example that contained only the intercept in the null model. Here, we are setting size and format to be nuisance variables. That is, our previous knowledge of mathematical cognition leads us to expect effects of size and format, so we are not interested in testing those effects. Rather, the problem at hand is to test whether there is an interaction between size and format. The model specifications above let us do this as a model comparison between (the interaction model) and (the null, or additive, model). Thus, with this comparison we can fully focus on the interaction term. Below we show how in JASP all parameters except the interaction term can be made part of the null model.
Implementation in JASP
The simulated data is contained in the file anova.csv, which can be downloaded from (https://git.io/JeXui). Loading this file into JASP, we can see five variables: subject, which encodes the subject number (1-48); and four variables that each represent mean RTs in specific combinations of problem size and format: small.digit, representing small problems in digit format; small.word, representing small problems in word format; large.digit, representing large problems in digit format; and large.word, representing large problems in word format. Notice that these data are presented in wide format, with each row representing a unique subject and each column giving the multiple measurements observed from each subject.
To perform a Bayesian analysis of variance, we click the button for the “ANOVA” module and select “Bayesian Repeated Measures ANOVA” from the menu. This procedure requires us to specify the names and levels of the repeated measures factors; clicking on the terms “RM Factor 1”, “Level 1”, etc. in the “Repeated Measures Factors” box makes these names editable. In Figure 10, we have changed RM Factor 1 to “problems size”, Level 1 to “small”, and Level 2 to “large”. Additionally, we have changed RM Factor 2 to “format”, and its corresponding levels 1 and 2 were changed to “digit” and “word”, respectively. The names we give in the the “Repeated Measures Factors” box define repeated measures cells immediately reflecting the four unique combinations of problem size and format. We then drag our variables into these cells, in order of appearance in the “Repeated Measures Cells” box; that is, we first add small.digit, then small.word, followed by large.digit and large.word.
At this point, the output will look similar to the linear regression example we did previously. But before attempting to interpret the output, let us construct a plot of the condition means. In Figure 11, we have displayed such a plot, which was constructed by selecting the “Descriptives Plots” menu and moving problem size to the box labeled “Horizontal Axis” and format to the box labeled “Separate Lines”. Note that the format effect, indicated as the vertical distance between the two lines, is greater for large problems than for small problems. Such a pattern would be indicative of the interaction model, but to quantify the evidence we will perform a Bayesian model comparison.
Based on the repeated measures factors that are entered, JASP constructs a set of models to reflect the possible additive combinations and interactions between these factors. Since we entered two factors (size and format), JASP builds 5 competing models:
Note that JASP does not include interactions without also including the relevant main effects; thus, the model which includes the interaction between problem size and format (i.e., ) also includes both main effects of problem size and format. A model including an interaction without the corresponding main effects is routinely considered as implausible, as such a model would rely on picking the exact levels of experimental manipulation required to make the main effects cancel each other out (Rouder, Engelhardt, McCabe, & Morey, 2016). Further rationale for this choice is discussed in Rouder, Morey, Verhagen, Swagman, and Wagenmakers (2017).
Though we have 5 models to consider, our goal is to directly compare the predictive adequacy of the additive model against the interaction model . To make this easier, we can make the additive model into a “null” model of sorts, so that now there is only one comparison: the interaction model against the additive model. We do this in JASP by clicking on “Model” and selecting the “Add to null model” checkboxes for both problem size and format; see Figure 12.
At this point, the output will change considerably. Figure 13 shows that the Model Comparison table now only contains two rows; one for each of the models under consideration (additive model versus interaction model). Again, by default, JASP sorts these models in decreasing order of posterior model probability. The interpretation of each column in the Model Comparison table is exactly the same as in the linear regression example, but for completeness, we will provide a brief explanation of each column.
The first column, , lists the prior probabilities of the 2 models under consideration. By default, JASP assigns these models to have equal probability, a priori. Since there are 2 models, the prior probability of each model is . The second column, , gives the posterior model probabilities. After observing data, we can see that the interaction model has the largest posterior probability, .999. The additive model has probability .0006799 (expressed in scientific notation in Figure 13).
As in the regression example, the third column, , gives the factor by which the prior odds for a given model have been updated to produce the posterior odds for that model. After observing data, the prior odds (1:1) in favor of the interaction model have been multiplied by a factor of 1469.749. For the two-model comparison in this example, this Bayes factor is identical to the Bayes factor produced in the fourth column. Recall that this column lists the Bayes factor for a given model compared to the best fitting model, which in this case is the interaction model. Since we want to know the Bayes factor for the interaction model over the additive model, we can compute the reciprocal (we can also show this by selecting “Compare to null model” and selecting as the displayed Bayes factor). Note that this Bayes factor is almost identical to the value we saw for (except for rounding error); in situations where there are only two models under consideration, this will always be the case.
In summary, our observed data is quite evidential for an interaction model, where there is an interaction between problem size and format on mean response times in a mental arithmetic task. This supports the narrative of Campbell and Fugelsang (2001), who argued that factors which affect encoding processes (i.e., format) also penetrate downstream to affect central calculation processes (reflected by the problem size effect). Note that just like we did with linear regression, we can obtain 95% credible intervals for the intercept, problem size effect, format effect, and the interaction effect – simply click on “Single Model Inference” and select the “Estimates” option under “Tables”. Again, this illustrates the powerful utility of Bayesian methods, which allow us not only to engage in effective model comparison, but also allow us to adjust our interval estimates in light of the inherent uncertainty in model choice.
We note that there are many other options that a user might want to explore when computing a Bayesian analysis of variance in JASP. Also, the methods described here also work when the user is interested in more complex designs, including those with 3 or more factors. In these cases, the number of potential models to consider increases quickly. However, JASP displays these models in order of decreasing posterior probability, so the user can quickly ascertain the best fitting model from the potentially large list of candidate models (see also Rouder et al., 2017). The interested reader should consult van den Bergh et al. (2019) for another tutorial on conducting Bayesian ANOVA in JASP, as well as van Doorn et al. (2019) for guidelines on how to report the results of this analysis.
In this tutorial, we have presented the basics of Bayesian inference in the context of three well-cited studies from the numerical cognition literature. The examples provided in this tutorial represent a broad class of techniques (e.g., -tests, linear regression, and analysis of variance) that are common to many studies in numerical cognition and should be familiar to most researchers. We have focused our tutorial on describing the use of Bayes factors as a tool for evaluating the predictive adequacy of competing hypotheses. Using the free software package JASP, we have demonstrated how to compute Bayes factors for each example and explained some of the most basic options that are provided to the user. While the examples we have provided are far from exhaustive with respect to depth of analysis, we feel that they provide the reader with a good foundation and provide an encouraging start to the use of Bayes factors.
Further, we have demonstrated how Bayesian analyses afford the practitioner a different perspective on statistical inference. Importantly, this perspective has many pragmatic advantages over the current practice of reporting -values and confidence intervals. These advantages include: (1) the ability to quantify evidence on a continuous scale; (2) the ability to differentiate between evidence of absence and absence of evidence; (3) the ability to monitor evidence as the data accumulate, allowing researchers to choose to stop data collection whenever evidence is sufficiently compelling; (4) the ability to integrate testing with estimation, allowing the researcher to not only measure the evidence for one hypothesis over another, but to simultaneously perform estimation in light of model uncertainty. Moreover, analyses that incorporate all of these pragmatic advantages can be obtained in the free software package JASP with only a few mouse clicks. We hope the reader will take this tutorial as a starting place to applying Bayesian inference in the analysis of their own datasets.