When subjects are required to judge two stimuli that differ on a single contrastive polar continuum (e.g., 'big' vs. 'small'), subjects are faster to judge which of the two stimuli is higher on that continuum, when the stimuli are high on that particular dimension, and they are faster to judge which of the two stimuli is lower on that continuum, when the stimuli are low on that particular dimension. Furthermore, when subjects are required to judge whether a target stimulus is bigger or smaller than a standard stimulus, subjects are faster when the relative size of the standard and of the target coincides (see Dehaene, 1989). Dehaene (1989) defined the first paradigm (i.e., chose the bigger/smaller of two stimuli) as a selection paradigm, and the second paradigm (i.e., is the target bigger/smaller than a standard) as a classification paradigm. The result that characterises these two paradigms is referred to as the semantic congruity effect. The semantic congruity effect has been replicated in perceptual and symbolic judgements across different domains, including surface area (Moyer & Bayer, 1976), line length (Petrusic, Baranski, & Kennedy, 1998), brightness (Wallis & Audley, 1964), scalar adjectives of quality (Holyoak & Mah, 1982), the distance between two cities (Holyoak & Mah, 1982) and Arabic numerals (Banks, Fujii, & Kayra-Stuart, 1976; Holyoak, 1978).
Many theories have been proposed to account for the semantic congruity effect. These theories vary greatly in the level of description of the phenomenon, with some theories being able to account for semantic congruity effects only in the case in which comparative instructions are presented to the subject (selection paradigm), but not when subjects have to decide whether a target is bigger or smaller than a standard stimulus (classification paradigm). For a detailed and exhaustive review of the models proposed for the explanation of the semantic congruity effect, refer to Petrusic (1992) and Leth-Steensen and Marley (2000); here we present a brief description of some of the theories that have been proposed for the explanation of this phenomenon.
According to the expectancy effect (Banks & Flora, 1977; Marschark & Paivio, 1979), the direction of the comparison (e.g., is the target stimulus bigger than the standard?) prepares the subject for the range of stimuli that will be presented. This results in a facilitation in case of congruency between the comparison and the stimuli. However, even when the comparative is presented together or after the presentation of the stimuli, the semantic congruity effect can still be observed (Holyoak & Mah, 1982), undermining a basic assumption of this model. Alternatively, the semantic coding model (Banks, Clark, & Lucy, 1975; Banks et al., 1976) explains the congruity effect by referring to linguistic codes; however, this struggles with the finding that even non-human primates show a semantic congruity effect when comparing magnitudes (Cantlon & Brannon, 2005).
A further verbal theory, the frequency explanation (Ryalls, Winslow, & Smith, 1998), explains the semantic congruity effect by the fact that each comparative is associated with one unique dimension during learning (i.e., subjects learn to use 'bigger' for high magnitude stimuli, and 'smaller' for low magnitude stimuli); yet, this explanation struggles with the result that the expectancy effect is found also when subjects are taught new comparisons with novel comparatives (Chen, Lu, & Holyoak, 2014). A further class of models are reference point models (Chen et al., 2014; Dehaene, 1989; Holyoak, 1978; Holyoak & Mah, 1982; Marks, 1972), according to which, subjects, when making a magnitude judgement, compare the numerical value of the stimulus with reference values stored in memory. Under this view, the subject is assumed to establish a reference point near one of the extreme values encountered in a given context and this results in a facilitation when the stimulus to discriminate is nearer to the reference point. From this perspective, the use of reference points has been suggested to affect the strength of evidence accumulation (see Chen et al., 2014; Dehaene, 1989); meaning, for example, that when the magnitude of the standard stimulus coincides with the magnitude of the target, this results in higher rates of evidence accumulation, compared to when there is not congruency between the relative sizes of the two stimuli. Other authors have explained the semantic congruity effect adopting random walk models (Birnbaum & Jou, 1990; Link, 1990; Link & Heath, 1975; Poltrock, 1989); these studies explain the semantic congruity effect as arising from a starting point adjustment dictated by the instructions. However, as argued in Leth-Steensen and Marley (2000), in tasks in which subjects are presented with symmetric differences (i.e., the same number of bigger and smaller comparisons are presented), it is not clear why subjects should adjust their starting point of evidence accumulation towards one of the two alternatives in selection paradigms. Finally, some evidence-accumulation models and instructional pathway interference accounts have been proposed (Leth-Steensen & Marley, 2000; Petrusic, 1992; Petrusic, Shaki, & Leth-Steensen, 2008), according to which the semantic congruity is due to a variation in the rate of evidence accumulation in case of congruency/incongruency between the instructions and the relative size of the stimulus pair.
Here, we focus on a computational model of decision making, known as the Drift Diffusion Model (Ratcliff & McKoon, 2008). This computational model has been applied to an impressive variety of tasks, paradigms and domains, including perceptual decision making, value-based decision making and also the description of the integration of sensory signals towards a motion-discrimination decision in monkeys (Gold & Shadlen, 2002; Ratcliff, 1978, 2002; Ratcliff & McKoon, 1988, 2008; Ratcliff & Rouder, 1998; Ratcliff, Thapar, Gomez, & McKoon, 2004; Ratcliff, Van Zandt, & McKoon, 1999; Shadlen & Newsome, 2001; Thapar, Ratcliff, & McKoon, 2003; Voss, Rothermund, & Voss, 2004).
In the DDM the decision maker integrates difference in evidence supporting two alternatives until a certain positive or negative threshold is crossed, and a decision is made in favour of that alternative.
In its simplest formulation, defined as 'the reduced version', the DDM is the continuous case of a random walk process (Bogacz, Brown, Moehlis, Holmes, & Cohen, 2006) and it is described by the following equation
1dx = µdt + θdW, x(0) = 0
where dx is the increment in evidence in a small time window dt, µ denotes the mean increase in evidence per unit time, and θdW denotes Gaussian white noise with mean zero and variance θ2dt.
Interestingly, the DDM - in its reduced version - implements the Sequential Probability Ratio Test (Wald, 1947; Wald & Wolfowitz, 1948), which is the procedure that gives the shortest decision time given a fixed error rate in a two-alternatives forced-choice task (Bogacz et al., 2006). It is possible to demonstrate (Bogacz et al., 2006) that as discrete samples are taken more frequently and one approaches continuous-time sampling of a variable, the SPRT converges to Equation 1. In this way, the DDM is statistically optimal for stationary distributions of evidence in conditions in which the subject has to manage a speed-accuracy trade-off (Bogacz et al., 2006). Given this feature of the model, the DDM not only represents a descriptive model of decision making, but has been proposed also as a normative model (Basten, Biele, Heekeren, & Fiebach, 2010; Wang, 2013) towards which, under the influence of natural selection, the decision maker may be supposed to have evolved (but see Pirrone, Stafford, & Marshall, 2014).
A further reason for the popularity of the DDM is that, as shown by Bogacz et al. (2006), other prominent models of choice, under specific parametrization implement or approximate the DDM, with the exclusion of race models (Vickers, 1970) - models with one accumulator for each alternative that accumulate evidence but do not inhibit each other.
Although there are numerous variants of the DDM, here we focus in particular on the version of the DDM as formalised in Ratcliff and McKoon (2008), a more refined and psychologically plausible version of the reduced DDM.
The first, denoted by a, is the boundary separation and it captures the distance between the two thresholds for a decision. When a is small the decision is faster but less accurate since, given noisy fluctuations in the accumulation of evidence, it is more likely to end up at the wrong boundary; when a is large the decision is slower and more accurate. An interpretation for this parameter is therefore the trade-off between speed and accuracy for a decision. Second, is the starting point of evidence accumulation, denoted by z. An interpretation for this parameter is the bias for either response; if z is not equidistant from the boundaries but nearer to the one of the two limits, the subject will be 'biased' to make the choice corresponding to the nearer boundary; when the accumulation of evidence starts at a/2 the process is unbiased. In the case of a biased process, fast reaction times (RTs) towards the nearer boundary and slow RTs towards the opposite boundary are predicted, given that the distance from the decision boundary is small in one case and large in the other. Third is the inter-trial variability of z, defined as sz. Fourth is the drift rate, denoted as v, which represents the mean rate at which information is accumulated over time. This parameter can be interpreted as the quality of the stimulus and the amount of information carried by it for the perceiver. Experimental conditions for which the correct decision is 'easy' will have a higher drift rate compared to more difficult conditions. Also, a further interpretation of this parameter is the sensitivity of a subject towards a stimulus. The accumulation of information varies according to the drift rate and to a fifth parameter, the inter-trial variability in drift rate, denoted by eta. This parameter can be interpreted as the variability in attention or motivation of the decision maker or, in the case of changing stimuli, it can be thought of as the variability in stimulus quality. The last two parameters of the DDM refer to the non-decisions time, since the decision maker has to encode the stimulus and execute the motor response when making a decision. The non-decision component of a RT is denoted by ter and its inter-trial variability is defined as st.
It is interesting to note that the DDM can account for the full range of correct and incorrect RTs and for the probability of correct and wrong answers. Additionally, the DDM offers several advantages in terms of the relation between model parameters, experimental design, and wider theoretical interpretation. The main parameters of the DDM have clear interpretations in terms of psychological processing (e.g., the speed-accuracy trade-off is reflected in the separation of the decision thresholds). Model fitting using the DDM tends to reveal single parameters changing their values to track changes across experimental conditions. Inter-related to both of these, the intuitive nature of some aspects of DDM function means that changes to experimental design can often produce clear predictions in terms of DDM parameter change.
Relatively few studies have applied the DDM to questions of numeracy judgement (Park & Starns, 2015; Ratcliff, 2006; Ratcliff, Thompson, & McKoon, 2015). However, these examples show the benefits of a DDM decomposition of data in this field. For example, in Ratcliff et al. (2015), through a series of four numerosity experiments, authors have found that accuracy is largely dependant upon drift rate while RTs are determined by threshold settings. The values of drift rate and boundary separation were correlated across tasks but interestingly, across subjects, these two parameters were not correlated. With four further experiments in which speed and accuracy instructions were manipulated, the authors replicated the results of accuracy-drift correlation, RTs-boundary correlation and the consistency across tasks, however between-subjects differences were maintained even when the internal response criterion of subjects was manipulated. This result shows the benefit of a computational decomposition of data and lays the foundation for the understanding of the contrasting results regarding presence/absence of correlation between RTs and accuracy in numeracy judgements. A second important application of the DDM to numeracy judgements comes from Park and Starns (2015). In Park and Starns (2015), authors were interested in acuity measures of the approximate number system - the cognitive system that allows to estimate numerosity non-linguistically. Traditionally, measures of acuity of the approximate number system only involve accuracy. However, using the DDM, the authors show that measures of acuity only based on accuracy cannot account for speed-accuracy trade-offs confounds that do affect acuity measurements. This means that traditional measures of acuity are likely to be inaccurate, since they are contaminated by speed-accuracy trade-off confounds. Furthermore, the authors found that drift rate is a better predictor of symbolic mathematical ability compared to previously proposed measures.
Comparing directly the semantic congruity effect theories described above is out of the scope of this work, since some of them are not framed within the evidence accumulation framework. Here, we bring the semantic congruity effect within the same framework as many other decision phenomena; we use the Drift Diffusion Model (Ratcliff & McKoon, 2008) and show how it can account for the semantic congruity effect, by fitting it to behavioural data from a magnitude comparison experiment conducted with human subjects. Since the semantic congruity effect manifests in changes in decision time, the use of the DDM, which explicitly considers the time course of decision-making, is natural. In contrast, some of the heuristic proposals outlined above lack such formal description of how decisions evolve over time, or when they specify how the decision evolves, they do so by adopting ad-hoc models that only make predictions for the specific task but cannot be generalised to other tasks or domains. A unifying framework such as the DDM overcome the limitations of task-specific models. Furthermore, with a DDM decomposition we can investigate which decision parameters account for the semantic congruity effect. Together with the explanations proposed (i.e., drift rate or starting point) other parameters that have never been taken into account, such as non-decision time or boundary separation, could play a role in the semantic congruity effect. For example, the non-decision time, which has never been taken into account in the previous literature, could as well contribute to a semantic congruity effect given that the congruency/incongruency between the magnitude of the stimuli (or between the instructions and the relative sizes of the target and standard stimulus) could affect the motor response of the subjects.
Usually, in two-alternative forced choice tasks, parameters such as the starting point of evidence accumulation or the boundary separation are assumed to take time to change and are assumed to be set before the stimulus appears (Bogacz et al., 2006); here, however, we assume that the size of the standard, to which subjects pay attention at first during the trial presentation, is apprehended quickly, and it affects the decision process. In the literature similar mechanisms that affects the early stages of a decision are described; for example, Provost and Heathcote (2015) provided a similar explanation for a mental rotation task, and in their computational investigation they found that participants adjusted their boundary separation on the basis of a property of the stimulus, rotation angle. Also, it should be noted that typically in the kind of tasks in which the DDM is used, subjects evaluate one single stimulus; in this case a change in decision parameters cannot be contingent on the outcome of the decision. However, in our case we have that one feature of the stimulus, the size of the standard stimulus, to which subjects pay attention at first, can affect the subsequent discrimination of the target.
In our experiment, participants had to decide whether a target stimulus was smaller or bigger than a standard array, hence ours is a classification paradigm. A stimulus example is reported in Figure 2. Our experiment presents some differences with semantic congruity tasks in which the direction of the comparison is explicitly given. However, with our task we elicit a semantic congruity effect similarly to what done before by other authors (e.g., Dehaene, 1989; Link, 1990; Mewhort, Smith, & Kohly, 1996).
Four right-handed subjects, one male, mean age = 20.5 years (SD = 3.2) with normal or corrected-to-normal vision participated voluntarily in the experiment in exchange of credits for course requirements. Each participant was tested in four sixty-minutes sessions on different days. The experiment was approved by the University of Sheffield, Department of Psychology Ethics Sub-Committee, and carried out in accordance with the University and British Psychological Society ethics guidelines and subjects gave their informed consent before performing it.
The experiments were programmed in Matlab, using the Psychophysics Toolbox (Brainard, 1997; Kleiner et al., 2007; Pelli, 1997). We used a modification of an established perceptual decision task (Gertner, Arend, & Henik, 2012; Piazza et al., 2010; Piazza, Fumarola, Chinello, & Melcher, 2011; Revkin, Piazza, Izard, Cohen, & Dehaene, 2008; Revkin, Piazza, Izard, Zamarian, et al., 2008) and a type of 'congruity' task similar to that used by Link (1990) and Dehaene (1989) - similar since also in our case subjects decided whether a target stimulus was bigger or smaller than a standard stimulus, however Link (1990) and Dehaene (1989) used two-digit numbers in their experiments. In our task, participants judged if a cluster of dots presented on the bottom of a laptop screen was 'smaller' or 'bigger' in numerosity than one presented on the top of the screen without counting and responding by button press.
On each trial, one array - the standard - contained a fixed numerosity (12 dots for one third of the trials, 24 dots for one third of the trials, 36 dots for the other third), and the other array - the target - contained a varying numerosity that was smaller or bigger than the fixed numerosity by one of seven possible ratios. The ratio defined the difficulty of the judgement, with ratios closer to 1 being harder. The seven ratios, in order of increasing difficulty, were 0.42, 0.50, 0.58, 0.66, 0.77, 0.83, 0.91. The absolute number of dots in each choice pair and a description of conditions is shown in Table 1.
|Condition||N of Dots||Ratio||Magnitude of Standard||Target (compared to standard) is|
|1||12 vs 5||0.42||small||smaller|
|2||12 vs 6||0.50||small||smaller|
|3||12 vs 7||0.58||small||smaller|
|4||12 vs 8||0.66||small||smaller|
|5||12 vs 9||0.75||small||smaller|
|6||12 vs 10||0.83||small||smaller|
|7||12 vs 11||0.91||small||smaller|
|8||12 vs 19||0.42||small||bigger|
|9||12 vs 18||0.50||small||bigger|
|10||12 vs 17||0.58||small||bigger|
|11||12 vs 16||0.66||small||bigger|
|12||12 vs 15||0.75||small||bigger|
|13||12 vs 14||0.83||small||bigger|
|14||12 vs 13||0.91||small||bigger|
|15||24 vs 10||0.42||medium||smaller|
|16||24 vs 12||0.50||medium||smaller|
|17||24 vs 14||0.58||medium||smaller|
|18||24 vs 16||0.66||medium||smaller|
|19||24 vs 18||0.75||medium||smaller|
|20||24 vs 20||0.83||medium||smaller|
|21||24 vs 22||0.91||medium||smaller|
|22||24 vs 38||0.42||medium||bigger|
|23||24 vs 36||0.50||medium||bigger|
|24||24 vs 34||0.58||medium||bigger|
|25||24 vs 32||0.66||medium||bigger|
|26||24 vs 30||0.75||medium||bigger|
|27||24 vs 28||0.83||medium||bigger|
|28||24 vs 26||0.91||medium||bigger|
|29||36 vs 15||0.42||big||smaller|
|30||36 vs 18||0.50||big||smaller|
|31||36 vs 21||0.58||big||smaller|
|32||36 vs 24||0.66||big||smaller|
|33||36 vs 27||0.75||big||smaller|
|34||36 vs 30||0.83||big||smaller|
|35||36 vs 33||0.91||big||smaller|
|36||36 vs 57||0.42||big||bigger|
|37||36 vs 54||0.50||big||bigger|
|38||36 vs 51||0.58||big||bigger|
|39||36 vs 48||0.66||big||bigger|
|40||36 vs 45||0.75||big||bigger|
|41||36 vs 41||0.83||big||bigger|
|42||36 vs 39||0.91||big||bigger|
There were in total 42 conditions; seven increasing ratios (i.e., increasing difficulty) for each of three levels of standard stimulus magnitude (small, medium and big) for each type of response 'smaller' or 'bigger' (i.e., half of the times the target stimulus was bigger/smaller than the standard). For each trial, subjects had to decide whether the target stimulus was smaller or bigger than the standard stimulus by pressing 'left' or 'right' on the keyboard. Conditions were chosen so that for each standard stimulus we would have accuracy levels that range from floor to ceiling on the basis of the results of previous pilot studies.
To avoid participants relying upon continuous quantities associated with numerosity (i.e., dot size and envelope area), in this experiment the dot arrays were generated following the method and the MATLAB code provided by Gebuis and Reynvoet (2012). This method was used to produce four sets of images with all possible combinations of correlation (positive vs. negative) between the two features of the stimuli (envelope area, dot size) and dot number.
During the whole experiment, subjects had to put their head on a chin rest at a viewing distance of 57 cm from the screen of a 14-inch laptop monitor (Dell Latitude E5430) with a refresh rate of 60 Hz. Subjects were required to fixate a red cross at the centre of the screen. The two dot arrays were presented simultaneously on the screen at ± 4.25 degrees of visual angle from the fixation cross, and participants were asked to judge if the cluster presented on the bottom of the screen was bigger or smaller than the one presented on top by pressing 'left' or 'right' on a keyboard. Each dot was randomly assigned an item size ranging between 0.08 and 0.59 degrees of visual angle. If subjects answered below 300 ms or above 3000 ms the sentence 'Too fast!' or 'Too slow!' was displayed on the screen. After giving a response, subjects were presented with a fixation cross that over the course of 600 ms was varying in size (i.e., small and then bigger for two times), as a warning signal for subjects to pay attention to the centre of the screen, and after subjects were presented with a new trial. Trials were presented in random order - within each day of the experiment - and participants performed 50 trials per condition after a training phase to familiarize them with the task which involved 1 trial per condition. Subjects participated in 4 different sessions on 4 different days (within a week from the first session), for a total of 200 trials per condition and 8400 trials for the whole experiment.
Figure 3 shows the psychometric functions averaged across subjects. This figure shows the imbalance in response probability due to the semantic congruity effect. The probability of answering bigger increases with the standard, although the same ratios across conditions are maintained; when the magnitude of the standard was small the probability of answering 'bigger' was lower compared to when the magnitude of the standard was big.
Figure 4 shows mean correct RTs as a function of the experimental condition when data are collapsed across participants and RTs lower than 0.3 s and bigger than 3 s are eliminated (about 0.5% of the data). The second column of plots of Figure 4 shows mean accuracy averaged across participants. The two plots on the top row show mean RTs and accuracy for conditions for which the standard stimulus was 'small' (i.e., it had 12 dots), the two plots on the middle row show mean RTs and accuracy for conditions for which the standard stimulus was 'medium' (i.e., 24 dots) and the two plots on the bottom row show mean RTs and accuracy for conditions for which the standard stimulus was 'big' (i.e., 36 dots).
Figure 3 and Figure 4 clearly show the presence of a semantic congruity effect, given that subjects, for conditions having the same ratio (e.g., 12 and 5 dots vs 12 and 19 dots), have different RTs and especially different accuracy depending on the congruency between size of the standard and of the target stimulus.
We entered correct RTs and accuracy levels in two different mixed-effect regression with ratio, magnitude of standard (abbreviated as 'magnitude') and correct response category (abbreviated as CRC) as dependent variables. In each regression, we included random effects for subject-specific constants and slopes. Regarding correct RTs, the regression showed that magnitude affected RTs, B = .228, 95% CI [.137, .318], t = 4.977, p < .001, with RTs increasing as magnitude increased. In particular, for each increase in magnitude, RTs increased between .137 s and .318 s. This effect suggests that subjects were biased towards answering 'smaller'. CRC affected RTs, B = .449, 95% CI [.333, .564], t = 7.628, p < .001, with RTs being higher when the CRC was ‘bigger’ compared to when it was 'smaller'. Also this effect suggests a bias towards answering smaller. As expected, ratio affected RTs, B = .462, 95% CI [.223, .701], t = 3.794, p < .001, with RTs increasing as ratio (i.e., difficulty) increased. The interaction effect of magnitude and CRC affected RTs, B = -.155, 95% CI [-.227, -.082], t = -4.322, p < .001.
As shown in Figure 4 when magnitude increased and CRC was 'bigger', RTs decreased, while when magnitude increased and CRC was 'smaller', RTs increased. Ratio by CRC affected RTs, B = -.201, 95% CI [-.365, -.037], t = -2.406, p = .016. As expected and as shown in Figure 4, the effect of CRC was larger for difficult discriminations compared to easy discriminations. Regarding accuracy, the binary logistic mixed effect regression showed that magnitude affected accuracy, B = -2.903, 95% CI [-4.408, -1.397], Exp(B) = .055, t = -3.986, p = .001, with accuracy decreasing when magnitude increased. CRC affected accuracy, B = -5.665, 95% CI [-7.235, -4.096], Exp(B) = .003, t = -7.388, p < .001, with accuracy decreasing when the CRC category was 'bigger', confirming that subjects were generally biased towards answering 'smaller'. Ratio affected accuracy, B = -8.534, 95% CI [-10.673, -6.395], Exp(B) < .001, t = -7.912, p < .001, with accuracy decreasing when difficulty increased. The interaction of magnitude and CRC affected accuracy, B = 1.670, 95% CI [.213, 3.126], Exp(B) = 5.31, t = 2.394, p = .027. As shown in Figure 3 and Figure 4 when magnitude increased and CRC was bigger accuracy increased, while when magnitude increased and CRC was smaller, accuracy decreased. Ratio by CRC affected accuracy, B = 3.139, 95% CI [1.454, 4.824], Exp(B) = 23.079, t = 3.768, p = .001, confirming that the effect of CRC was larger for difficult discriminations compared to easy discriminations.
To fit the DDM to our data we used the Diffusion Model Analysis Toolbox (DMAT; Vandekerckhove & Tuerlinckx, 2007, 2008) for Matlab (version 2013b). Among the options available, we used as objective function a chi-square function. We decided to represent the RT distributions of responses in terms of six bins, defined by the boundaries of the conventional .1, .3, .5, .7 and .9 quantile bins dividing the correct and error RT distributions (Vandekerckhove & Tuerlinckx, 2007). In DMAT, the observed response frequencies are compared to the expected response frequencies and a chi-square statistic is minimised to find the best fitting parameters.
For each participant the drift could be (i) fixed across conditions, or (ii) free to vary across conditions; the boundary separation could be (i) fixed across conditions, or (ii) free to vary across conditions; the starting point could be (i) fixed across conditions, or (ii) free to vary across conditions; and finally the non-decision time could be (i) fixed across conditions, or (ii) free to vary across conditions. Across-trials variabilities in drift, non-decision time and starting point were kept constant across conditions in order to avoid over-fitting. It should be noted that we also fitted a series of models in which, when a parameter was free to vary, its across-trials variability parameter was also free to vary. However, in this case DMAT warned that the variability parameters were not identified by the data or that their standard error estimates were biased. In theory, we could have used bootstrapping for an estimation of the parameters and their standard errors, but given the number of models to be fitted and the number of iterations required for the bootstrapping, this would have been computationally intensive (i.e., the fitting would have taken days to complete). Furthermore, when across-trials variabilities were fixed across conditions, DMAT did not provide warnings for the best model, so we opted for this option.
All possible combinations of models were fitted to each individual resulting in a total of 16 models per participant. To assess which model best satisfies the trade-off between simplicity and goodness of fit, we used a statistical criterion for model selection, the Bayesian Information Criterion (BIC; Raftery, 1995), calculated as −2 · loglikelihood(data|model) + k · logN, where k is the number of free parameters in the model and N the total number of observations. The BIC is a measure of goodness of fit to which a penalty for the introduction of parameters is added. The best model is the model with the lowest BIC value; as proposed in Kass and Raftery (1995), a difference of ten in BIC scores between two models is considered a strong evidence towards the model with the lowest BIC score. For all participants, the model in which only the drift rate was allowed to vary across conditions, was selected by far as the best model, with differences in BIC scores being always greater than 75 if the best model is compared to the second-best model, showing a striking preference for this model.
As it is clear from plotting the drift rate recovered from the fitting for each participant - Figure 5 -, the drift rate was (i) a function of the ratio between the standard and the target stimulus (i.e., higher the ratio, lower the drift) and (ii) a function of the magnitude of the standard (i.e., when the magnitude of the standard increases, the drift shifts towards the boundary for the response 'bigger'). In Figure 5, when drift values are positive, it means that they drifted towards the threshold for the response 'bigger', while when drift values are negative, it means that the process was directed towards the boundary for the response 'smaller'. Figure 5 shows also that in general participants were more biased towards answering 'smaller' given that the slope for this alternative is generally steeper than the slope for the opposite response; this is in line with the behavioural analyses showing a main effect of CRC on accuracy and RTs.
A linear regression on drift rates showed a main effect of CRC, B = .570, 95% CI [.408, .731], t = 6.961, p < .001, with drifts being higher when the CRC was smaller compared to when it was bigger. Magnitude affected drift, B = .065, 95% CI [.012, .118], t = 2.435, p = .016; as the magnitude increased, the slope of the drift rate increased suggesting an additive influence of the magnitude of the standard on the drift rates. As expected ratio affected drift rates, B = .319, 95% CI [.152, .486], t = 3.767, p < .001, with drift rates being higher when the comparison was simpler. The interaction effect of CRC and ratio resulted significant, B = -.534, 95% CI [-.770, -.298], t = -4.466, p < .001. In line with the behavioural results, this shows that the effect of CRC was stronger for difficult conditions.
Interestingly, in our plots the numerosity is represented on a linear scale (best linear fit reported in Figure 5) of the ratio of the two numerosities to compare and this is not in line with the results of Park and Starns (2015) - in their study drift rates followed the logarithm of the ratio of the two numbers to compare.
The remaining parameters and their standard errors estimated by DMAT, for each participant, are shown in Table 2.
|Participant ID||Parameter estimate and SE||a||ter||eta||z||sz||st|
Fits of the model to the data are represented by quantile probability plots, Figure 6. Quantile probability plots are a powerful way of showing the goodness of fit; on the x-axis it is shown the probability of a correct and of a wrong response for the model and the data, while on the y-axis are shown the quantile-RTs that divide the distributions of correct and wrong response, both for the model and the data. Here, we show the conventional .1, .3, .5, .7 and .9 quantiles that divide the RT distributions, for correct and error responses (Ratcliff & McKoon, 2008).
In Figure 6, we compare the predictions of the model based on the parameters averaged across individuals (represented by the lines), and the observed data pooled across individuals (represented by 'x'). Figure 6 has 6 plots; the two plots on top show conditions for which the standard was small, the plots on the middle show conditions for which the standard was medium and the plots on the bottom show conditions for which the standard was big. The plots on the left of Figure 6 show conditions for which the correct response category was 'smaller', while the plots on the right show conditions for which the correct response category was 'bigger'. Note that, as the behavioural analyses show, for conditions with a high ratio (i.e., high difficulty), the overall performance of subjects dropped below chance in some cases. As a consequence, for these conditions, the probability of a correct choice lays on the left of the graph, and the probability of an incorrect choice is on the right side of the graph, mostly near to chance level. In general, for conditions with highly discriminable stimuli (i.e., conditions with low ratio) little weight should be accorded to the quantiles for error responses since these are mainly influenced by a very limited and potentially unreliable number of measurements given that subjects made very few errors in these extreme conditions.
The quantile probability plots show that the model obtained from our fitting can capture the averaged data well, especially considering that the data are averaged across four experimental sessions with clear repercussions on the within-subject variability, and considering the high number of conditions present in this study.
As we described earlier, several theories have been proposed for the explanation of the semantic congruity effect (Banks et al., 1975; Banks & Flora, 1977; Banks et al., 1976; Holyoak & Mah, 1982; Marschark & Paivio, 1979; Ryalls et al., 1998). Here, we have adopted a computational framework, the drift diffusion model (DDM) that is psychologically plausible, mathematically rigorous and that has been shown to fit data in various psychological tasks (Ratcliff, 1978, 2002; Ratcliff & McKoon, 1988; Ratcliff & Rouder, 1998; Ratcliff et al., 2004, 1999; Thapar et al., 2003; Voss et al., 2004). Our results show that the DDM (Ratcliff, 1978; Ratcliff & McKoon, 2008; Ratcliff et al., 1999) can account for the data in an experiment in which we have elicited a semantic congruity effect.
We found that the changes in decision time and accuracy associated with our manipulation, can be best explained by a change in the drift rate. The drift rate is associated with the discriminability of the experimental condition, as it is commonly assumed in the DDM, but it is also affected by the magnitude of the standard stimulus. This effect seems to suggest that subjects were first assessing the numerosity of the standard and the magnitude of the standard biased them towards one of the two response categories. In particular, when the standard was small subjects were biased in answering 'smaller'; vice versa, when the standard was big subjects were biased in answering 'bigger'. Specifically, in this study subjects may have learnt to use the two extreme standard magnitudes as reference points for the values 'small' and 'big', since over the four experimental session the numerosity of the standard only consisted of three possible values. This strategy would result in the pattern observed in the data with subjects being faster and more accurate in judging which of the two stimuli is bigger/smaller when there was congruency between the magnitude of the standard and of the target. Interestingly, for the medium magnitude standard, being equidistant from the small and the big standard magnitude, the semantic congruity effect cancels out. The decision process in this study can be described as a two-stage DDM; first subjects had to asses the size of the standard stimulus. Afterwards, subjects had to assess whether the target stimulus was bigger or smaller than the standard and in case of congruency between the size of the standard and the relative size of the target, the response was faster and more accurate (i.e., drift rates were higher).
The main result of this study is in line with reference point models (see Chen et al., 2014; Dehaene, 1989) and it is relevant for theories in which the congruency between magnitude of the stimulus and direction of the comparison affects the strength of the evidence signal (Leth-Steensen & Marley, 2000; Petrusic et al., 2008). However, a key point of the models proposed by Leth-Steensen and Marley (2000) and by Petrusic et al. (2008) is that the semantic congruity effect arises when there is congruency between the comparison instruction and the relative size of the stimuli, while in our case, the semantic congruity is driven by the magnitude of the standard stimulus. Further theoretical work - in which such theories are framed within a DDM framework - and experimental work - in which the direction of the comparison is explicitly given - is needed to test Leth-Steensen and Marley (2000) and Petrusic et al. (2008) explanations, given that the experimental paradigm presented here and the conceptual explanation that we provided vary greatly from their conceptualisation of a similar phenomenon (i.e., semantic congruity effect in selection paradigm). For these theories, it has been proposed (Leth-Steensen, Petrusic, & Shaki, 2014) that it is the relative size of the stimulus pair that 'primes' the corresponding congruent form of the instruction, resulting in a facilitation in case of congruency. Our result is, in theory, also in line with an account in which the relative size of the stimulus pair 'primes' the corresponding congruent response category, resulting in a facilitation in case of congruency. However, in our case, it is not clear why an assessment of the overall size of the stimulus pair is necessary since it is not explicitly required. Nevertheless, it should be noted that an effect of the overall magnitude of the stimulus pair is in line with recent findings showing that in two-alternatives forced choice tasks, both absolute and relative evidence are integrated by participants (Starns, Chen, & Staub, 2017). If this is the case, subjects may not be able to ignore the absolute size of the two stimuli and also this could in theory account for our results. However, the result that semantic congruity effects arise even when the standard stimulus and the target stimulus are presented sequentially (Dehaene, 1989; Link, 1990), seems to undermine the role of the size of the stimulus pair in the explanation of semantic congruity effect for classification paradigms; given that subjects are presented with a target after the presentation of a standard, assessing the overall magnitude of the stimuli means assessing the magnitude of the target itself, and this is a clearly problematic assumption (Leth-Steensen & Marley, 2000) - as once the magnitude of the target is assessed, the response can be executed without any need to bias the decision. This argument leads us to conclude that even though semantic congruity effects in classification and selection paradigms can be due in both cases to an increase in the rate at which evidence is accumulated, the decision process faced by subjects for these two tasks varies greatly, hence it is reasonable to expect that different conceptual explanations are needed for the two tasks.
Here, we invalidate theories which interpret the semantic congruity effect as a modification in starting point of evidence accumulation (Birnbaum & Jou, 1990; Link, 1990; Link & Heath, 1975; Poltrock, 1989). Furthermore, the other principal theories that have been proposed for the explanation of the semantic congruity effect - the expectancy effect (Banks & Flora, 1977; Marschark & Paivio, 1979), the semantic coding model (Banks et al., 1975, 1976) and the frequency explanation (Ryalls et al., 1998) - seem to be already falsified by the contrastive results presented in the introduction. Furthermore, the expectancy theory and the semantic coding model do not apply in our study, given that they are dependent on the direction of the comparative instruction that is not used in the current task.
The choice of previous authors to not consider other decision mechanisms (e.g., boundary separation or non-decision time) in accounting for semantic congruity effects, is questionable. Here, we show directly - with the model selection procedure - that neglected mechanisms, such as boundary separation or non-decision time variations, do not play a role in the semantic congruity effect.
Our application of the DDM further highlights the heuristic power of the DDM, and shows that different phenomena that have been previously explained by descriptive and or task-specific theories can be accounted for by sequential sampling models of evidence accumulation and decision making, when the focus is shifted to the computational level of analysis. Our formal account of this phenomenon is parsimonious, as it uses a unifying model of choice rather than proposing an ad-hoc model for the explanation of the phenomenon, and rigorous, as we account for the full distributions of correct and error responses, by taking into consideration all the cognitive processes that underlie a decision.