Humans are numerical thinkers. Adults often use efficient processes for generating and retrieving arithmetic sums (LeFevre, Bisanz, & Mrkonjic, 1988; Rivera, Reiss, Eckert, & Menon, 2005) that are so automatic that they can interfere with judgments on unrelated number tasks (LeFevre et al., 1988). For instance, after simultaneously viewing the cue digits 5 and 3 within a classic priming study paradigm, adults take longer to reject the target stimulus 8 as a potential match for either cue digit than to reject nearby nonsum targets, such as 4, 7, or 9 (Figure 1a). This “obligatory” arithmetic—the LeFevre interference effect—raises questions about whether adults will engage in automatic number processing when doing so is contextually inappropriate or even counterproductive. In this study, we consider how variation in semantic context interacts with the LeFevre effect, specifically whether the effect occurs only when semantic context indicates that arithmetic operations are appropriate.
Although LeFevre et al. (1988) did not consider effects of context on the interference effects they described, Bassok and colleagues did (e.g., Bassok, Pedigo, & Oskarsson, 2008). Drawing from their work on effects of context in word problems (Bassok, Chase, & Martin, 1998) and from research on modulation of automatized cognitive processes (e.g., Besner, Stolz, & Boutilier, 1997), Bassok et al. demonstrated that the LeFevre interference effect is modulated by the nouns presented with cue digits. They proposed that categorically aligned noun pairs (e.g., tulips, daisies) support addition because they are appropriate to combine; whereas noun pairs do not support addition when they are either categorically misaligned (e.g., beans, planes) or related functionally but not categorically (e.g., pages, books). Indeed, the adults in their study showed the LeFevre interference effect only when cue digits were paired with categorically aligned nouns. The effect was absent when nouns were categorically misaligned (Figure 1b).
This Bassok effect demonstrates that automatic arithmetic is modulated by semantic alignment. Here we modified Bassok’s Number Matching task to further specify modulation of the LeFevre interference.^{i} We specifically modified the misaligned condition, which in Bassok’s version included multiple types of misalignment. Some of their misaligned sets included only tangible, concrete nouns (e.g., hens, radios), whereas other sets included both concrete and abstract, intangible nouns (e.g., tractors, messages). We propose that combinations of concrete and intangible referents are especially inconducive to automatic arithmetic because they are less likely to generate a plausible rationale for addition, compared to combinations of concrete nouns (aligned or misaligned). To test this hypothesis, we included two misaligned conditions and a categorically aligned condition in our version of the task (Figure 1c).
We also explored contextual interference with automatic arithmetic at the level of full sentences, based on evidence that semantic misalignment affects accuracy and findings of evoked related potential (ERP) responses during word problem solving and verification (e.g. Bassok, Chase, & Martin, 1998; Fisher & Bassok, 2009; Guthormsen et al., 2016). We developed a Sentence Verification task using a priming structure similar to one used in the Number Matching task, but also built on previous semantic misalignment experiments’ verification paradigms (Guthormsen et al., 2016). Our Sentence Verification task included cue sentences that either implicated multiplication (e.g. Jill carried 2 heavy 6packs of root beer) or did not (e.g. Jeff used 2 pans to make 6 omelets), and participants judged whether a subsequent target statement was likely to be true, based on the preceding cue sentence. We tested whether the implication of multiplication in cue sentences modulated the priming effect of arithmetic products in target sentences, just as the Number Matching task tested for effects of categorical alignment on the priming effect of sums (Figure 1c). Together, these two tasks lay the groundwork for a better understanding of when adults do or do not compute numbers in context.
We also pursued two secondary aims concerning how semantic alignment effects generalize across settings and persons. First, we attempted to replicate Bassok et al.’s priming study results (2008) under conditions in which expectations for addition were less explicit, by replacing the plus sign (+) fixation point in their study with a black square. This modification provides stronger evidence that semantic context modulates obligatory arithmetic in the absence of computational symbols. Second, we explored individual differences in sensitivity to semantic misalignment and whether individuals’ susceptibility to these contextual effects is associated with mathematics or reading achievement level.
Method [TOP]
Participants [TOP]
Participants were 92 students (61 females) enrolled in undergraduate (n = 86) or graduate programs at the University of Minnesota, who identified English as their primary language. These volunteers were predominately white (n = 64) or Asian (n = 19), and most selfreported as righthanded (n = 81). Excepting one 30 year old, the participants were 18 to 24 years old (M = 21.2, SD = 1.9). Participants opted to receive research credit or ten dollars for participating. They were naïve to the purpose of the experiment, which was described as a decisionmaking study. At the conclusion of the study, participants reported how many quantitative courses they had completed in college (e.g., mathematics, physics, finance) on a fourpoint scale. Six (7%) reported having taken no such courses, whereas others reported taking one to three (49%), four or five (24%), or six or more courses (20%).
Measures [TOP]
Number Matching Task [TOP]
We designed this computerized task to test whether priming for addition facts varies with semantic context, modifying the version designed by Bassok et al. (2008). Participants simultaneously viewed two cue nouns for 900 ms; immediately thereafter two cue digits appeared directly above the cue nouns for an additional 135 ms. Following these cue stimuli, two targets appeared sequentially: a single noun followed by a single digit (nounfirst order), or vice versa (digitfirst order). Following each target presentation, participants had two seconds to indicate via a keyboard press (Yes/No) whether the target matched either of the two preceding cue nouns or digits (see Figure 2). Noun targets were included to prevent participants from focusing solely on the numbers.
Our task differed from the original version (Bassok et al., 2008) in three ways. First, whereas Bassok and colleagues used an asterisk (*) as the initial fixation point and a plus sign (+) between the cue digits, we used a black square (■) as both the fixation point and the symbol between cues. Second, we presented noun targets for longer durations than did Bassok et al. (900 vs. 480 ms), to reduce error rates. (In pilot studies, we confirmed that this modification reduced mean error rates from 27–30%, as observed by Bassok et al., 2008, to less than 12%.) Crucially, we presented the cue digits for 135 ms (as did Bassok et al.) in attempt to replicate the effects observed in the original study. Third, we used noun sets designed to extend Bassok et al.’s findings. As detailed further below, these included categorically aligned nouns that were appropriate to combine (such as “ticks, fleas, moths”), and two distinct types of misaligned noun sets, both of which were less appropriate to combine (such as “beans, planes, crabs”). Moreover, to rule out potential alternative explanations for priming effects, all noun sets and digit sets were constrained based on strict criteria summarized below.
Noun sets [TOP]
We included 176 noun sets and enforced strict control of the nouns’ surface features, as follows: Nouns were drawn from the 5,000 most common plural nouns in the Corpus of Contemporary American English (Davies, 2010), thus excluding nouns that are rarely pluralized (e.g. throats). This list was filtered to exclude homonyms of any of the top 10,000 nonnouns. We included only enumerable nouns that were subject to regular plurals ending in ‘s’ (e.g. we excluded irregular plurals such as “geese”). All nouns were between 4–7 letters and 1–2 syllables to limit the potential influence of noun length on participants’ reaction times. Within individual noun sets, cue nouns had the same number of syllables and differed in total length by no more than one letter.
No single noun appeared in more than one noun triplet. Synonyms (e.g., ships, boats) were excluded within sets since they may be more subject to combining than other nouns. Likewise, subset relationships between cue nouns were excluded (e.g., roads, lanes) to avoid prompting division instead of addition (Bassok, Chase, & Martin, 1998). Homonyms were also excluded. To diminish familiarity effects or longdistance priming (e.g., de Vaan, Schreuder, & Baayen, 2007), each triplet appeared only once throughout the task.
Categorical alignment [TOP]
Our critical manipulation was three levels of categorical alignment based on the appropriateness of summing across noun referents within a set (as per Bassok et al., 2008) and whether nouns had concrete or intangible referents (our key variable of interest). The aligned concreteconcrete (ACC) sets comprised concrete nouns from the same higherlevel category (e.g. “orchids, poppies, lilies”). These ACC triplets correspond with the “aligned categorically (AC)” triplets in Bassok et al. (2008), and are appropriate to sum (e.g., in this case, all are flowers). Less appropriate to sum were the misaligned concreteconcrete (MCC) triplets, comprised of concrete nouns from different categories (e.g. “blouses, kiosks, lagoons”) that refer to tangible items but otherwise differ from each other. Our third set, misaligned concreteintangible (MCI) triplets, included both concrete and intangible cue nouns (e.g., raisins, chances). The MCC and MCI can be collapsed into an overall misaligned category that is comparable to the “misaligned unrelated (MU)” triplets in Bassok et al. (2008). For all three types of noun triplets, in half of the trials the target noun matched one of the cues, and in the other half the target matched neither cue.
The Number Matching Task included 40 noun triplets for each set (ACC, MCC, and MCI) in the digitfirst trials (i.e., the trials of interest in this study). To ensure that participants did not statistically learn that noun triplets were often misaligned, 48 of the 56 nounfirst trials involved ACC triplets, so that overall, half of all triplets in the study were aligned (as was the case in Bassok et al., 2008). Example nouns appear in Table 1.
Table 1
Nonmatching target

Matching target



Noun Type  Cue 1  Cue 2  Target  Cue 1  Cue 2  Target 
ACC  pigs  cows  goats  whales  sharks  sharks 
ticks  fleas  moths  doctors  lawyers  doctors  
donuts  bagels  cookies  plates  bowls  bowls  
bankers  actors  sailors  frogs  toads  toads  
apples  lemons  mangoes  lamps  desks  lamps  
MCC  webs  eels  cabs  magnets  wizards  magnets 
homes  bones  cops  hotels  ladders  hotels  
cards  boats  ports  prisons  oysters  oysters  
brakes  swords  phones  statues  cigars  cigars  
robots  towels  guitars  papers  hunters  hunters  
MCI  tanks  myths  laws  wives  facts  wives 
trucks  nights  clowns  pearls  tales  pearls  
sins  dogs  chips  pastors  options  options  
weeks  bombs  hairs  defects  turkeys  defects  
tactics  acorns  lessons  tasks  eggs  eggs 
Note. ACC = aligned concreteconcrete; MCC = misaligned concreteconcrete; MCI = misaligned concreteintangible.
Digit sets [TOP]
Digit sets used in the Number Matching task consisted of three unique digits between 1 and 9 (Table 2). We used a subset of the digit sets created by LeFevre and Kulak (1994), which were composed of two cue digits and the target digit. Ties (e.g., “7, 7”) were excluded from all cue digit pairs since these may prompt different response patterns for participants (LeFevre et al., 1988). Such restrictions limited the number of possible combinations, leaving 40 distinct digit sets in the experiment. Each digit set appeared in three of the digitfirst trials and either once or twice in the control wordfirst trials.
Table 2
Nonmatching

Matching



Target

Target Control

Cue Control


Cue  Sum  Neutral  Cue  Target  Cue  Target 
2■3  5  8  7■5  5  2■3  2 
3■2  5  7  5■8  5  3■2  3 
2■5  7  9  3■7  7  2■5  5 
5■2  7  9  9■7  7  5■2  2 
6■2  8  5  5■8  8  6■2  2 
5■3  8  6  9■8  8  5■3  5 
4■3  7  9  7■9  7  4■3  4 
3■5  8  6  8■4  8  3■5  3 
6■3  9  7  9■1  9  6■3  3 
5■4  9  7  9■6  9  5■4  4 
Note. The symbol ■ was used as a focal point for participants during this task. Each triplet comprised one cue pair, followed by a target digit.
Nonmatching digit sets. — For twenty of the 40 digit sets, the target digit did not match either of the two cue digits. In these nonmatching sets, the target was either a sum of the preceding cue digits or was neutral (i.e., the target did not match either of the cue digits nor their sum, product, quotient, or difference). Each of the ten pairs of cue digits was included in two nonmatching triplets, once with a sum target and once with a neutral target. The variable of interest was the difference in response latency (RT) between sum and neutral digitfirst trials, a key outcome for assessing the presence of the LeFevre interference effect.
We also controlled for the size and distance between each sum/neutral target and its associated cue digits, and whether target digits appeared on the left or the right, because these factors may affect response times in ways unrelated to the LeFevre interference effect. Sum and neutral targets had a similar distribution of the digits 5 to 9 (Both: M = 7.3, SD = 1.4). The distance from the farthest cue digit to the target (the minimum split) was similar on average for both sets (both: M_{neutral} = 2.8, M_{sum} = 2.6), but the standard deviation for the neutral set (SD_{neutral} = 1.7) exceeded that of the sum set (SD_{sum} = 1.1), because sum splits had a unimodal distribution whereas neutral splits had a bimodal distribution. Similar patterns held for the maximum and average splits.
Matching digit sets. — Twenty of the 40 digit sets were matching sets, wherein the target digit matched either of the two cue digits (e.g., cue digits: 4, 6; target digit: 4). These throwaway control trials were included to verify that participants were on task and ensure that participants would not expect nonmatching trials to be more likely. Moreover, to ensure that specific cue digits did not reveal whether a match was likely, ten cuecontrol triplets each had the same cue digits as one of the nonmatching sets, but also had a target digit that matched one of the cue digits.
Similarly, it was necessary to prevent the target digit from revealing whether it was likely to be a match. Since the cue digits in the nonmatching sets were constrained to sum to less than 10, these cue digits tended to be small (M = 3.7), allowing high targets to indicate a nonmatch by default. Therefore, targetcontrol triplets each had the same target as one of the sum triplets, but appeared following a new pair of cue digits, one of which actually matched the target. As noted above, sum and neutral targets had a similar distribution of digits, so this set also had a similar distribution of digits to neutral targets.
Test administration [TOP]
Participants completed 176 trials. On the 120 digitfirst trials relevant to this study, the digit target appeared after the cue duration and before the noun target. Thus, most of the trials appeared in digitfirst order to facilitate capturing the shortterm time scale of the LeFevre interference effect. In the remaining 56 control trials, the target noun appeared first (wordfirst trials) to ensure that participants attended to the words; data from these wordfirst trials were not analyzed.
Participants were assigned to one of four fixed trial orders. To prevent participants from ignoring word cues (since nounfirst trials were less numerous), practice trials and the first of four blocks of testing trials included an equal number of digitfirst and nounfirst trials. However, maintaining this balance throughout the entire task would require an unfeasibly long task, so we decreased the ratio of nounfirst to digitfirst trials harmonically to onehalf, onethird, and onequarter in subsequent blocks for Orders A1 and A2. To test whether this sequence of blocks affects participant response patterns, for Orders B1 and B2 the ratio of subsequent blocks was instead onequarter, onehalf, and onethird. To allow for testing of blockspecific order effects, Order A2 was generated from Order A1 by switching the first 60 digitfirst trials with the last 60. Order B2 was generated from Order B1 in an analogous way.
Within each order, the sequence of trials was randomly generated with several constraints. Consecutive identical answers (match vs. nonmatch) did not exceed four trials, and no more than four trials included the same noun or digittriplet type. No more than four nounfirst trials occurred in a row so that these lessnumerous control trials were sufficiently spread out throughout the task.
Each trial occurred in a fixed order (Figure 2). First, the fixation box appeared. Participants pressed the space bar to initiate the trial. The two noun cues appeared for 900 ms inside boxes on either side of the fixation box. Then the two digit cues appeared above the nouns, for 135 ms. Once the cues disappeared, the target (digit or noun) appeared. Participants pressed one of two colorcoded keys to indicate whether the target had previously appeared as a cue (v), or had not (n). If 2 seconds passed without a response, the trial was recorded as wrong. Accuracy feedback (`RIGHT` or `WRONG`) appeared in the center of the screen for 500 ms. Then, the second target (noun or digit) was presented and participants made the same type of judgment and received feedback.
Participants received verbal instructions to strive for both accuracy and speed. They completed a demonstration trial with an experimenter who provided instructions and feedback, then completed 10 practice trials. During the experiment, participants were alerted when onethird and twothirds of the trials were completed. Consistent with procedures adopted by Bassok et al. (2008), participants were told that they would receive a memory test; this was done to encourage attention to the nouns. The results of the memory test did not relate to the purposes of the study and were not recorded or analyzed. The task took approximately 20 minutes. All participants completed this task.
Sentence Verification Task [TOP]
We created the Sentence Verification Task (Figure 3) to assess differences in priming for numberpair products as a function of whether naturalistic linguistic contexts implicate multiplication, and to test whether automaticity of number combinations (fact retrieval or rapid computation) is reduced when multiplication is clearly not implicated. Stimuli consisted of 32 cue sentences, each followed by two prompt statements. For each trial, participants first saw a fixation box, read the cue sentence, and then saw and responded to both prompt statements sequentially, via key press (“Yes”/”No”), to indicate if the prompt was likely to be true based on the cue sentence. Responses to the first target prompt were analyzed; responses to the second, filler prompt were not analyzed. (See Table 3 for sample cue sentences and prompts.)
Cue sentences [TOP]
Each cue was a declarative sentence containing two whole numbers. The semantic content of each cue sentence either implicated multiplication of the two numbers (e.g. Frank sent 4 texts to each of 10 friends), or did not implicate multiplication (e.g. You can smell those 4 pizzas 10 blocks away). Numbers in the cue sentences ranged from 2 to 9 and were associated with multiplication facts of moderate difficulty. We excluded identical number pairs and the numbers 1 and 0 since multiplication with these numbers is relatively easy, and excluded 7 or 8 unless either was paired with 2, since products of 7 and 8 are relatively challenging (e.g., Campbell & Graham, 1985). Lengths of cue sentences were constrained in terms of number of syllables (M = 10.5, SD = 1.7, range 8–14) and words (M = 8.6, SD = 1.1, range 7–10). Numbers always appeared in Arabic notation and never began, ended, or appeared consecutively within the sentence.
Table 3
Prompt Type  Digit Type  Cue Sentence  Target Prompt  Filler (Second) Prompt 

Implicative Contexts  
Reject  Neutral  “The 3 ships each transported 9 crates.”  The ships weighed 5 ounces.  The crates were made of silk. 
Reject  Product  “Evan mowed 10 lawns at 6 dollars each.”  Evan mowed 60 lawns.  Evan was 6 years old. 
Accept  Neutral  “Frank dealt 4 cards each to 10 poker players.”  The full deck had 52 cards.  The cards were on fire. 
Accept  Product  “Gwen bought 6 toys for each of her 4 kids.”  She purchased 24 items.  Gwen's kids were goats. 
NonImplicative Contexts  
Reject  Neutral  “Jacky visited 3 orchards to pick 9 peaches.”  Each peach had a 4 lb pit.  Jacky had enough peaches for 2 pies. 
Reject  Product  “They reserved 6 tables at the 4 Seasons Cafe.”  They reserved the tables for 24 days.  The reservations were for a golf tee time. 
Accept  Neutral  “Don baked 3 batches of lemon squares in 9 pans.”  Each lemon square had 4 corners.  Don used sugar in his baking. 
Accept  Product  “Mike won 6 medals in 4 hours.”  There were 24 hours in each day.  The medals he won were invisible. 
Note. Implicative trials implied a multiplication operation on the two numbers in the cue sentence. Reject trials had target prompts that were keyed as false and Accept trials were keyed as true. Product trials had target prompts that contained the product of the two numbers in the cue sentence; Neutral trials did not. Responses to the second prompt were not analyzed.
Similar to the Number Matching task, the key contextual distinction between the cue sentences was whether the sentences implicated multiplication of the two numbers appearing in the sentence. In the 16 implicative trials, the cue sentence rendered the product of the two numbers meaningful and relevant. For instance, given the cue sentence, The 4 waiters each carried 5 trays, multiplying 4 by 5 yields a meaningful product. In the remaining 16 nonimplicative trials, multiplying the numbers in a cue sentence would not yield a meaningful product (e.g., Brad wished for 4 kids and 5 cars).
Target prompt statements [TOP]
The target prompt statements always included one number that was either the product of the two numbers from the preceding cue sentence (for 16 product trials) or a different number (for 16 neutral trials), counterbalanced across implicative/nonimplicative contexts. Since our primary aim was to investigate whether modulation differed between multiplicative or nonmultiplicative semantic contexts (analogous to the Number Matching task), we expected products to facilitate accepting, or interfere with rejecting, the prompt, as an extension of the LeFevre interference effect. We made target prompts slightly shorter than the cue sentences, in terms of both syllables (M = 8.0, SD = 1.5, range 5–11) and words (M = 5.8, SD = 1.3, range 4–10). The number embedded in the target prompt was an Arabic numeral between 2 and 60, and it never appeared at the beginning or end of the sentence.
One half of target prompts were classified as accept, and the other half were classified as reject. For instance, following the cue sentence, “The 4 waiters each carried 5 trays,” the prompt, “The meal had 20 calories,” was designed to be rejected, as it does not follow from the cue sentence. (Indeed, all pilot participants rejected this statement.) Including both classifications (accept and reject) across conditions ensured that participants could not statistically infer that either response was more frequent and also allowed us to examine possible differences in priming between acceptance and rejection responses. Our classifications were validated during pilot testing, and only items with greater than 80% accuracy among pilot participants were retained.
We analyzed responses for the first prompt sentence only, because priming effects on the second prompt sentence may have been contaminated by the presence of the first prompt sentence. The second prompts included both accept (50%) and reject (50%) trials; the trials either did (10 of 32) or did not (22) contain a number, to prevent participants from anticipating numbers in all prompts or in only the first prompt.
Experimental trials [TOP]
Eight sets of cue sentences and target prompt statements were generated in a 2 (Prompt Type: Accept or Reject) × 2 (Context for Products: Implicative or Nonimplicative) × 2 (Digit Type: Neutral or Product) design. There were four sentences per experimental condition, yielding a total of 32 trials.
Several features of the sentences were balanced across conditions in order to strengthen the validity of reaction time (RT) comparisons. Each experimental condition had exactly one trial in which the first prompt referred to the same unit of measurement as the cue. For instance, there were four trials wherein cues implicated multiplication but the product did not appear in the first target prompt (e.g., the cue sentence, “Frank dealt 4 cards each to 10 poker players” was followed by the target prompt “The full deck had 52 cards,” emphasis added). The three other trials did not share units of measurement across the cue and first prompt. In each set of four product trials, the cue sentences presented the digit pairs 3 and 9, 4 and 10, 8 and 2, and 9 and 4, respectively. A different set of digit pairs was used for neutral trials: 2 and 6, 4 and 5, 6 and 4, and 10 and 6. This ensured a balance of stimuli across implicative/nonimplicative and accept/reject trials.
Pilot testing [TOP]
Cue sentences and prompt sentences were finalized through iterative piloting with 210 adults who completed prior versions of the task, either as volunteer study participants at our university (82 participants) or on Mechanical Turk (127 participants), an online marketplace for contract work where participants were paid for their responses. Based on pilot responses, we modified statements to maximize ease of judging the likelihood of being true. In the final pilot testing, two items were excluded for failing to reach our threshold of 80% accuracy, including an Accept/Implicative condition item (71%) and an Accept/Nonimplicative condition item (59%). These were omitted because unusually difficult items may introduce cognitive complexity and constructirrelevant variance to the measures. All remaining items had accuracy rates of 85% or above.
Administration [TOP]
Participants listened to instructions, completed a single demonstration practice trial that did not involve any numbers, and then received feedback. All participants saw the same stimuli in the same quasirandomly generated order adjusted to limit the number of consecutive trials with the same combination of condition and outcome (no more than two in a row) or the same keyed response (no more than three in a row for the first prompt). No feedback was given on trial responses. The task required about 5 minutes to complete. Two participants were excluded from analyses for failing to respond correctly to any trials in one or more conditions.
Achievement Measures [TOP]
Math fluency [TOP]
Participants completed a threeminute calculation fluency measure, the Math Fluency subtest of the WoodcockJohnson III, during the testing session. This subtest is from a standardized, paperandpencil mathematics achievement measure. Participants were asked to solve as many problems as quickly as possible. Problems appeared in a test booklet, in order of increasing difficulty. The subtest has a median reliability of .92 with adult participants (Mather & Woodcock, 2001). Accuracy (number correct) and total time to complete the task were recorded. We calculated participants’ fluency rate (trials/minute) to create a comparable measure for all participants. One participant’s score was omitted due to experimenter error.
College entrance exam scores (ACT/SAT) [TOP]
Participants were asked to provide their standardized college entrance examination test scores (ACT Math and ACT Reading). The ACT and SAT are widely used standardized college entrance exams. Each exam yields separate Mathematics and Reading scores. Both tests require basic to complex mathematics problemsolving skills or reading skills that tap meaning comprehension. Historically, scores for these exams have been highly correlated, with reported correlations of .92 for composite scores, .89 for Mathematics, and .83 for ACT Reading with SAT Verbal (now labeled Critical Reading; Dorans, Lyu, Pommerich, & Houston, 1997). Accordingly, we collapsed percentile score data across ACT or SAT Mathematics, and across ACT Reading and SAT Critical Reading scores. Of 73 participants who consented to our accessing their standardized test scores, 67 had taken the ACT. Therefore, the ACT scores were the focus of analysis, and 6 sets of SAT scores were transformed to align with the ACT metric using published national percentile norms. Participants took the ACT between 2006 and 2013, but ACT scale scores are constructed to be comparable across these years and can be analyzed directly (ACT, Inc., 2014).
Procedures [TOP]
The study was approved by our institutional human subjects review board. All 92 participants completed the Number Matching, Sentence Verification, and Math Fluency tasks, in that order. (A matching task excluded from the present study was administered as the third of four tasks.) The entire testing session took approximately one hour. In addition, Math and Reading ACT and SAT scores were collected from 73 participants who consented for the University’s Office of Institutional Research to release these scores to the researchers.
Results [TOP]
We carried out separate analyses for our two primary numerical tasks. We used repeated measures ANOVAs to test for hypothesized main effects and interactions involving noun alignment in the Number Matching task (nonmatching trials only) and implication of multiplication in the Sentence Verification task (responses to first prompt sentences only). For the Number Matching task, we first evaluated whether we replicated Bassok and colleagues’ (2008) findings, and then tested our hypotheses concerning further contextual influences of misalignment on classic priming effects. Finally, for both numerical tasks, we used linear mixed models to test for individual differences and the contributions of math fluency and ACT scores to speed of response. In all analyses, the outcome variable of interest was speed, measured by the inverse response time, consistent with prior recommendations for RT modeling (e.g., Baayen & Milin, 2010; Ratcliff, 1993). This transformation increased the normality of the dependent variable in our Number Matching Task (skew = 0.1, kurtosis = 0.1) compared to both the untransformed data (skew = 1.5; kurtosis = 3.1) and a logarithmic transformation (skew = 0.7; kurtosis = 0.5). In the presentation of results, estimated parameters are backtransformed to reaction times (ms per trial), where possible, to aid interpretation and support comparisons to prior results in the research literature. Generalized etasquared estimates are reported for all ANOVAs due to their comparability as effect sizes across research designs (Olejnik & Algina, 2003). We present the R^{2}_{GLMM} defined by Nakagawa and Schielzeth (2013) and generalized by Johnson (2014) as a measure of overall model fit for linear mixed models. We present the marginal R^{2}_{GLMM} to evaluate changes in fixed effects and the conditional R^{2}_{GLMM} to evaluate changes in random effects. These measures are not necessarily comparable to the R^{2} used in linear regression and should be interpreted with caution.
Number Matching Task [TOP]
ANOVAs [TOP]
We first examined the degree to which our results replicate those of Bassok et al. (2008). We collapsed our two misaligned conditions to approximate the misaligned unrelated (MU) condition used by Bassok and colleagues, the latter of which included noun triplets similar to our MCC (e.g., “coats, biscuits, islands”) and MCI conditions (e.g., “tractors, messages, fairies”). We then carried out a 2 (Context for Sums: Aligned vs. Misaligned) × 2 (Digit Type: Sum vs. Neutral) repeated measures ANOVA on the inverse response time (trials/s), or speed, on all correctly answered trials (Table 4). Digit Type referred to whether target numbers appearing after cue digits were the sum of the preceding cue digits or were neutral (matching neither cue digit nor the digits’ sum, product, quotient, or difference).
Table 4
Effect  df(n)  df(d)  F  p  η^{2} 

Combined Misaligned Analysis  
Digit Type  1  91  24.83  < .001  .008 
Context for Sums  1  91  2.25  .138  .001 
Digit Type × Context for Sums  1  91  15.49  < .001  .005 
Expanded Misaligned Analysis  
Digit Type  1  91  14.26  < .001  .004 
Context for Sums  2  182  1.21  > .250  .001 
Digit Type × Context for Sums  2  182  14.79  < .001  .010 
Note. The Combined Misaligned Analysis had two levels of Context: Aligned and Misaligned. The Expanded Misaligned Analysis of the same data differentiated the Misaligned condition further by separating Misaligned ConcreteConcrete (MCC) and Misaligned ConcreteIntangible (MCI) conditions.
Our replication attempt was successful. We found a Context × Digit Type interaction (η^{2} = .005) similar in strength to that found by Bassok and colleagues (2008; η^{2} = .008). Nonmatching Aligned Neutral targets were rejected significantly faster than Aligned Sum targets, ΔM = 41 ms, t(91) = 5.41, d = .565, Holmadjusted p < .001, but there was no difference in speed of rejection of nonmatching Neutral and Sum targets on Misaligned trials, ΔM = 5 ms, t(91) = 0.89, Holmadjusted p > .250, d = .093. Means for the Neutral and Sum trials under misaligned conditions (both ≈ 760 ms) fell between those for the Aligned Sum (M = 788 ms) and Aligned Neutral conditions (M = 747 ms), similar to the results reported by Bassok and colleagues. Presumably due to longer presentation durations for noun cues in our study, our accuracy rates for each condition (Ms ≈ 90%) were substantially higher than those found by Bassok and colleagues (Ms ≈ 70%), but response speeds followed the same qualitative pattern (see Figure 4a).
The findings were only partially similar when we separated the two types of misalignment (Figure 4b and Table 4). A 3 (Context for Sums: ACC, MCC, or MCI) × 2 (Digit Type: Sum vs. Neutral) repeatedmeasures ANOVA on correct trial speeds revealed a stronger Digit Type × Context interaction, which accounted for a greater proportion of the variance (η^{2} = .010) than in the collapsed analysis (η^{2} = .005). Moreover, the MCC trials displayed the classic LeFevre interference effect (LeFevre et al., 1988), wherein rejection of nonmatching Sum targets was slower than rejection of nonmatching Neutral targets, ΔM = 41 ms, t(91) = 3.22, Holmadjusted p = .002, d = .335. This effect is substantial but smaller than that observed in Aligned trials, d = .565. The pattern of means for Sum trials is consistent with the hypothesis that increasing contextual support for summation leads to increasing interference (slower rejection) on Sum trials. A linear contrast testing this trend (MCI < MCC < ACC) was significant, t(182) = 4.81, Holmadjusted p < .001, r_{contrast} = .336.
However, unlike the MCC condition, the MCI condition had a facilitative effect, with faster rejection for nonmatching Sum versus nonmatching Neutral targets, ΔM = 18 ms, t(91) = 2.41, Holmadjusted p = .018, d = .244, although notably weaker than the interference effects observed on ACC and MCC trials. Moreover, speeds were slower for Neutral MCI trials compared to both Neutral MCC trials, ΔM = 24 ms, t(91) = 3.12, Holmadjusted p = .002, d = 0.323, and Neutral ACC trials, ΔM = 24 ms, t(91) = 3.13, Holmadjusted p = .002, d = .326, which did not differ significantly from one another, ΔM = 0 ms, t(91) = 0.03, Holmadjusted p > .250, d = .004.
We examined patterns of individual responses to rule out the potential influence of outliers on the observed facilitative effect of the MCI trials (under the Sum condition). Distributions and SDs were similar across conditions, and inspection of participantlevel distributions of interference (Neutral – Sum) did not reveal outliers. Moreover, a binomial sign test revealed that a statistically significant number of participants (60 of 92) displayed LeFevre interference for ACC trials, p = .004. Separate sign tests revealed the same result for MCC trials (60 of 92 participants), p = .004, and a marginal nonsignificant facilitative effect for MCI trials (55 of 92 participants), p = .08. Thus, despite lower power, the results of nonparametric binomial tests converge with the ANOVA results reported earlier.
Linear Mixed Models [TOP]
Linear mixed modeling can provide additional insight into individual differences and help bring more features of the design under statistical control (e.g., Baayen, Davidson, & Bates, 2008; Bryk & Raudenbush, 1992). However, as no significant evidence emerged regarding individual differences in contextual sensitivity or our covariates, we only briefly summarize the results here and in Table 5. Table 5 shows the series of models that included the 73 participants for whom we had complete data on all covariates. (Models without covariates that included the full sample did not differ appreciably from those for the restricted sample, and are thus not reported.) Model 1 estimated a 3 (Context for Sums: ACC, MCC, or MCI) × 2 (Digit Type: Sum vs. Neutral) linear mixed model. Including random intercepts for each participant dramatically improved model fit based on a likelihood ratio (LR) test, χ^{2}(1) = 1517, p < .001, Δ conditional R^{2}_{GLMM} = .360, as did including random intercepts for each item, χ^{2}(1) = 58.4, p < .001, Δ conditional R^{2}_{GLMM} = .021. As with the repeated measures ANOVA, there was a significant Context × Digit Type interaction, KenwardRoger F(2, 54) = 3.49, p = .037, Δ marginal R^{2}_{GLMM} = .004. Model 2 controls for practice and/or fatigue effects by additionally including a fixed effect for the trial number. For each successive trial, participants performed about 0.0014 trials/s faster, KenwardRoger t(151) = 15.89, p < .001.
Table 5
Effect  Model



(1)

(2)

(3)

(4)


β  SE  β  SE  β  SE  β  SE  
Trial number  0.001***  0.0001  0.001***  0.0001  0.001***  0.0001  
Digit Type (Sum)  –0.065**  0.026  –0.064***  0.020  –0.064***  0.020  –0.064***  0.020 
Context (MCC)  0.007  0.025  –0.002  0.020  –0.002  0.020  –0.002  0.020 
Context (MCI)  –0.045*  0.026  –0.036*  0.020  –0.036*  0.020  –0.036*  0.020 
ACT Reading^{a}  0.016***  0.005  
Math Fluency rate  0.004*  0.002  0.003  0.002  
Digit Type (Sum) × Context (MCC)  0.011  0.036  0.021  0.028  0.021  0.028  0.021  0.028 
Digit Type (Sum) × Context (MCI)  0.092**  0.036  0.068**  0.028  0.068**  0.028  0.068**  0.028 
Constant  1.300***  0.029  1.200***  0.029  1.000***  0.100  0.600***  0.160 
Akaike Inf. Crit.  879  643  642  634  
Bayesian Inf. Crit.  935  706  711  710  
Marginal R^{2}_{GLMM}  .006  .050  .067  .109  
Conditional R^{2}_{GLMM}  .385  .419  .419  .417 
Note. All variables are uncentered. MCC: Misaligned ConcreteConcrete, MCI: Misaligned ConcreteIntangible. All models contained crossed random intercepts for subjects and items. Reference levels were Aligned ConcreteConcrete (ACC) for Context, and Neutral for Digit Type. All p values are based on ttests using the KenwardRoger approximation value for the degrees of freedom. Marginal R^{2}_{GLMM} estimates the variance accounted for by fixed effects while conditional R^{2}_{GLMM} estimates the variance accounted for by both fixed and random effects.
^{a}Includes 6 participants with missing ACT Scores imputed from SAT scores.
*p < .1. **p < .05. ***p < .01.
Math fluency and ACT scores may also capture individual differences relevant to the Number Matching task. Math Fluency had a higher correlation with response speed (r = .19) than did ACT Math (r = .16), so Math Fluency was entered into the regression first in Model 3 (results were comparable regardless of order). Each additional problem correct per minute in the Fluency measure corresponded to a 0.006 trials/s increase in speed, which only approached significance, KenwardRoger t(81) = 1.88, p = .063. If variation in contextual sensitivity is attributable to math fluency, we would expect to see significant interactions with condition variables; however, no interactions with Fluency approached significance.
Achievement measures were then entered into the model. ACT Math was not a significant predictor of speed, β = –0.003 trials/s, KenwardRoger t(80) = –0.41, p > .250, but there was a significant positive main effect of ACT Reading, β = 0.016 trials/s, KenwardRoger t(79) = 3.21, p = .002. With ACT Reading included in the model, Fluency was no longer a significant predictor, β = 0.003 trials/s, KenwardRoger t(79) = 1.39, p = .168. Again, no interactions emerged as significant. There was no evidence of remaining unexplained individual variability in reaction times across experimental conditions, as tests of random slopes for Context, Digit Type, and their interactions were not significant in likelihood ratio tests, ps > .250. Therefore Model 4, with ACT Reading added as a predictor, was considered the final model.
In summary, we found striking evidence of interactions on the Number Matching task, including interference effects consistent with LeFevre et al. (1988) for Sum trials (in ACC and MCC conditions), modulation effects of context like those reported by Bassok et al. (2008), and an unanticipated facilitation effect on rejecting nonmatching cues on MCI trials. Linear mixed models did not reveal contributions of math achievement level, but did show minor contributions of ACT Reading, despite the fact that the contextual variation in the Number Matching task was relatively artificial and did not impose significant comprehension demands. In contrast, the Sentence Verification task described next involved more authentic linguistic contexts.
Sentence Verification Task [TOP]
In this task, cue numbers were presented within complete sentences that either implicated or did not implicate multiplication, and the prompt statements that followed contained either the product of the cue numbers or a neutral number. Two participants were excluded from these analyses for failing to respond correctly to any trials in one or more conditions.
ANOVAs [TOP]
We first carried out ANOVAs to test whether contextual alignment moderated evaluation of the veracity of cue sentences. This 2 (Prompt Type: Accept or Reject) × 2 (Context for Products: Implicative or Nonimplicative) × 2 (Digit Type: Neutral or Product) repeated measures ANOVA focused on participants’ mean response speed on correct trials only. We found a strong Prompt Type × Context × Digit Type interaction, F(1, 89) = 97.59, p < .001, η^{2} = .053, and significant main effects and twoway interactions, excepting the Context × Digit Type interaction (Table 6).
Table 6
Effect  F(1, 89)  p  η^{2} 

Full Analysis  
Context  42.06  < .001  .020 
Prompt Type  25.03  < .001  .031 
Digit Type  32.81  < .001  .017 
Context × Prompt Type  7.70  .007  .004 
Context × Digit Type  1.07  > .250  .001 
Prompt Type × Digit Type  35.41  < .001  .018 
Context × Prompt × Digit Type  97.59  < .001  .053 
Implicative Trials Analysis  
Prompt Type  5.98  .016  .011 
Digit Type  16.88  < .001  .021 
Prompt Type × Digit Type  7.01  .010  .008 
NonImplicative Trials Analysis  
Prompt Type  41.13  < .001  .067 
Digit Type  12.47  .001  .013 
Prompt Type × Digit Type  183.34  < .001  .145 
Note. Full Analysis included both Implicative and NonImplicative Trials.
To further understand this threeway interaction, we evaluated Implicative and Nonimplicative trials separately (Table 6). The 2 (Prompt Type: Accept or Reject) × 2 (Digit Type: Neutral or Product) repeated measures ANOVA on implicative trials showed evidence of a Prompt Type × Digit Type interaction (Figure 5a), F(1, 89) = 7.01, p = .010, η^{2} = .008. Unlike the interference effect seen in Number Matching task, there was little evidence of interference with rejection of incorrect prompts for Implicative Product trials, ΔM = –61 ms, t(89) = –1.43, Holmadjusted p = .156, d = –.151. Participants were, however, faster to accept correct prompts on Product versus Neutral trials, ΔM = 311 ms, t(89) = 4.00, Holmadjusted p < .001, d = .421. This facilitation effect on accepting correct prompts on Implicative trials parallels the interference effect we observed for rejecting nonmatches in the Number Matching task.
The repeated measures ANOVA on trials with NonImplicative contexts revealed clear evidence of a strong Prompt Type × Digit Type crossover interaction, F(1, 89) = 183.3, p < .001, η^{2} = .145 (Figure 5b). Participants were faster to accept correct prompts on Product trials than on Neutral trials, ΔM = 420 ms, t(89) = 5.91, Holmadjusted p < .001, d = .623, and were faster to reject incorrect prompts on Product trials compared to Neutral trials, ΔM = 557 ms, t(89) = 13.5, Holmadjusted p < .001, d = 1.422. For product prompts in the NonImplicative condition, facilitation of correct rejection coupled with interference with correct acceptance is analogous to the effect of the MCI condition seen in the Number Matching task, where facilitation was observed for the rejection of nonmatching targets on sum trials, and interference was observed for the rejection of nonmatching targets on neutral trials. The effect sizes in the NonImplicative condition for the Prompt Type × Digit Type interaction and associated post hoc tests were much larger than those in either Implicative trials or the Number Matching task.
Since exactly one trial in each condition had the same unit paired with one of the cue numbers and also the target number (matched trials), and the other three trials per condition had no units that matched in both the cue and target (unmatched trials), we further investigated whether these unit conditions appeared to change the results on nonimplicative trials. Using a 2 (Prompt Type: Accept or Reject) × 2 (Digit Type: Neutral or Product) × 2 (Unit: Matched or Unmatched) repeated measures ANOVA on the 48 participants who had data in all cells, we found the same Prompt Type × Digit Type interaction, F(1, 47) = 28.62, p < .001, η^{2} = .037, but there was no Prompt Type × Digit Type × Unit interaction, F(1, 47) < 0.001, p > .250, η^{2} < .001. This indicates that the interaction is not simply due to the units associated with the digits. Posthoc comparisons on matched nonimplicative only trials produced qualitative patterns and large effect sizes similar to those for all nonimplicative trials described above, with participants being faster to accept correct prompts on Neutral vs. Product trials, ΔM = 1766 ms, t(47) = 5.204, Holmadjusted p < .001, d = .751, and faster to reject incorrect prompts on Product vs. Neutral trials, ΔM = 1412 ms, t(47) = 9.266, Holmadjusted p < .001, d = 1.337.
Linear Mixed Models [TOP]
We tested for associations between achievement level and Sentence Verification task performance via linear mixed models (Table 7). KenwardRoger approximated degrees of freedom for the Sentence Verification models were sufficiently high (lowest = 249) for t to practically converge to the standard normal distribution, so we instead present ztests for coefficients. Model 1 included random intercepts for each participant along with the same fixed factors as the full repeated measures ANOVA. Because of the small number of items per condition, random effects for item were not included (estimates of other parameters were similar with or without these random effects). Fixedeffect results were similar to those from the repeated measures ANOVA. In contrast to the ANOVA findings, however, the Context × Digit Type interaction now emerged as significant, β = 0.126 trials/s, z = 5.11, p < .001.
Table 7
Effect  Model



(1)

(2)

(3)

(4)


β  SE  β  SE  β  SE  β  SE  
Digit Type  0.029*  0.017  0.029*  0.017  0.029*  0.017  0.050  0.076 
Prompt Type  –0.056***  0.019  –0.055***  0.020  –0.055***  0.020  –0.056***  0.020 
Context  –0.088***  0.017  –0.088***  0.017  –0.011  0.042  –0.012  0.042 
Digit Type × Prompt Type  0.059**  0.026  0.059**  0.025  0.058**  0.025  0.059**  0.025 
Digit Type × Context  0.130***  0.025  0.130***  0.024  0.120***  0.024  0.130***  0.024 
Prompt Type × Context  0.110***  0.026  0.100***  0.025  0.100***  0.025  0.100***  0.025 
Digit Type × Prompt Type × Context  –0.300***  0.036  –0.300***  0.036  –0.300***  0.036  –0.300***  0.036 
Fluency rate  0.004***  0.001  0.004***  0.001  
Context × Fluency rate  –0.002**  0.001  –0.002**  0.001  
ACT Math^{a}  –0.004  0.004  
ACT Reading^{a}  0.013***  0.003  
Digit Type × ACT Math^{a}  0.006**  0.002  
Digit Type × ACT Reading^{a}  –0.006***  0.002  
Constant  0.530***  0.019  0.530***  0.019  0.340***  0.066  0.120  0.110 
Akaike Inf. Crit.  –474  –487  –493  –506  
Bayesian Inf. Crit.  –418  –403  –397  –387  
Marginal R^{2}_{GLMM}  .059  .060  .081  .110  
Conditional R^{2}_{GLMM}  .303  .327  .324  .328 
Note. All variables are uncentered. Model 1 includes random intercepts nested within participants. Models 2, 3, and 4 include random intercepts and random slopes for Digit Type and Prompt Type, nested within participants. Estimated random effect parameters for Models 2, 3, and 4 are similar. Reference levels were Reject for Prompt Type, Nonimplicative for Context, and Neutral for Digit Type. All p values are based on ttests using the KenwardRoger approximation for degrees of freedom. Marginal R^{2}_{GLMM} estimates variance accounted for by fixed effects; conditional R^{2}_{GLMM} estimates variance accounted for by fixed and random effects.
^{a}Includes 6 participants with missing ACT Scores imputed from SAT scores.
*p < .1. **p < .05. ***p < .01.
Random slopes for all main effects were then added to the model, significantly improving the fit according to a likelihood ratio test, χ^{2}(9) = 30.54, p < .001, Δ conditional R^{2}_{GLMM} = .028. However, the random slope for Context was highly correlated with the random slope for Digit Type, r = –0.93, indicating that the model may be overspecified. The random slope for Context exhibited the least variability and was not significantly different from zero, χ^{2}(4) = 7.36, p = .118, Δ conditional R^{2}_{GLMM} = .003. Model 2 therefore excluded this term (see Table 7). This suggests that the main effects of Prompt Type and Digit Type vary by participant, but the effect of Context does not.
In contrast to the Number Matching task, associations between math scores and contextual sensitivity were observed for the Sentence Verification task. Fluency was more strongly correlated with trial speed (r = .16) than was ACT Math (r = .09), so it was entered at the Model 3 stage (the reverse order led to similar conclusions). Math Fluency rate (correct answers/minute) was positively related to speed in the Sentence Verification task, with each additional problem correct per minute associated with a 0.004 trials/s increase in speed, z = 3.30, p = .001. Additionally, a significant Fluency × Context interaction emerged, with each additional correct response on math fluency yielding an increased speed differential of 0.0017 trials/s in nonimplicative trials versus implicative trials, z = 2.01, p = .044. Higherorder interactions of Fluency with the Sentence Verification conditions were not significant, ps > .250, indicating a lack of evidence that Fluency moderates individual differences in the types of contextual sensitivity displayed in the interactions between condition variables.
The addition of ACT Math and Reading to the model provided further explanatory power. The final model for the Sentence Verification task, Model 4 (see Table 7), revealed several effects of ACT Math and Reading. Adding these terms significantly improved fit over Model 3, KenwardRoger F(4, 84.6) = 5.26, p < .001, Δ marginal R^{2}_{GLMM} = .028. Both ACT Reading and Math interacted with Digit Type, with effects that were roughly of equal magnitude but opposite direction. No higherorder threeway interactions of the achievement measures with the condition variables were found, ps > .250. The predictive contribution of Math Fluency was relatively independent from the ACT measures, with little change in regression weights between Models 3 and 4.
Figure 6(a) shows the interaction of ACT Math and Digit Type from Model 4 by displaying the predicted RT across the range of ACT Math scores present in our sample (16–36), with all other variables held constant at their means. For participants with lower ACT Math scores, RTs on Neutral and Product trials differed only slightly, but the highest scoring participants showed an advantage of about 300 ms for responses to Product trials versus Neutral trials. Conversely, Figure 6(b) shows that participants with lower ACT Reading scores showed a substantial advantage for Product trials over Neutral trials, but there was little difference for higherscoring participants.
After adding all significant condition and individual difference terms to the model, we examined whether there remained any evidence of unexplained individual variability in the interactions between variables. There was no evidence of participantspecific differences in the slopes of twoway interactions for Model 4, χ^{2}(22) = 20.68, p > .250, Δ conditional R^{2}_{GLMM} = .016, and only marginal evidence of potential individual differences when Model 4 was compared with a model that included random slopes for all possible condition interactions, χ^{2}(30) = 41.01, p = .087, Δ conditional R^{2}_{GLMM} = .029.
Discussion [TOP]
Cognitive science has demonstrated that automatized cognitive processes, including arithmetic, can be modulated by context (e.g., Spellman, Holyoak, & Morrison, 2001). This includes the effects of semantic misalignment on arithmetic in priming paradigms or word problems (Bassok et al., 2008; Fisher & Bassok, 2009), at least when arithmetic demands are fairly explicit (e.g., when a plus sign appears between digits). Our findings suggest that semantic misalignment is more complex than previously noted. Obligatory addition is affected by factors beyond categorical misalignment, and the direction of semantic modulation may change depending on the degree or type of misalignment. Whereas Bassok and colleagues found diminished priming with misaligned noun sets, facilitation effects from some misaligned noun pairs may have cancelled out interference effects of noun pairs that were only modestly misaligned with addition. We found analogous interference and facilitation effects for full sentences, depending on whether multiplication was implicated, and an unexpected facilitative effect when contexts were very semantically misaligned. These complex behavioral findings raise new questions about the role of semantic misalignment in contextualized numerical cognition and the integration of contextual and numerical processes more broadly. Automatized arithmetic may confer decisionmaking advantages even in apparently nonarithmetic contexts.
When Does Automatic Arithmetic Interfere With Correct Rejection? [TOP]
Our Number Matching task showed that both the LeFevre interference (Figure 1a) and Bassok semantic alignment effects (Figure 1b) persist in the complete absence of computational notation (e.g., +). Our participants were slower to correctly reject nonmatching digits on sum versus neutral trials for the categorically aligned condition (the LeFevre effect), but not for categorically misaligned conditions (when collapsed), replicating earlier work (the Bassok effect, cf. Figure 3a, Figure 1b). Our effect size for the Digit Type × Context for Sums interaction (η^{2} = .005) was similar to Bassok et al.’s Experiment 1(η^{2} = .008).
We hypothesized that semantic misalignment lies on a continuum, with more misaligned noun pairs suppressing obligatory arithmetic to a greater degree than less misaligned nouns (Figure 7, top). When the misalignment conditions were examined separately (Figure 7, bottom), sum trials from the Number Matching task provided partial support for this hypothesis. When misaligned noun sets combined concrete and intangible nouns (the MCI condition), however, participants were faster to reject nonmatches on sum trials even when compared to their performance on sum trials of the Misaligned ConcreteConcrete (MCC) condition. Our two misaligned conditions were clearly not equivalent; they differed in how they modulated obligatory arithmetic. Only when they were combined could we replicate Bassok et al.’s finding (2008).
When Does Obligatory Arithmetic Facilitate Correct Rejection? [TOP]
Our experiment provides evidence that a wholly different effect may occur in specific misaligned conditions, counter to the Bassok effect. In the Misaligned ConcreteConcrete condition, slower rejection of nonmatching digits during sum versus neutral trials suggests that some arithmetic interference occurred during sum trials. However, the reverse occurred for MCI trials, with faster rejecting of nonmatching digits during sum versus neutral trials. These differences across misaligned conditions may have emerged because we controlled for potential confounds that may underlie the lack of interference in Bassok et al.’s (2008) misaligned trials: We included only commonly enumerated nouns in our study and delineated two distinct misalignment conditions. Even if our participants simply viewed intangible nouns (e.g., myths, tactics) as less readily enumerable than concrete nouns, this would not explain the observed facilitation. This explanation is testable using an intangible noun only condition (Misaligned IntangibleIntangible), which we did not include in the present study.
Another possibility is that participants engage in a rapid, efficient, strategic rejection in the MCI condition. Results suggest that participants automatically added in all conditions, but perhaps obligatory arithmetic assists performance on select trials. We deliberately designed the MCI condition to be maximally unsupportive of addition, assuming that combining concrete and intangible nouns is less plausible or logical than combining even misaligned concrete nouns. (For example, combining goats and phones may be more plausible than combining goats and tactics.) But we did not anticipate that this extreme semantic misalignment may trigger an expectation that the sum must be an incorrect response to such an extent that misalignment facilitates immediate recognition (and thus rejection) of the (improbable) sum, rather than suppressing obligatory arithmetic. In contrast, neutral targets (which are not sums, and thus are not obvious incorrect matches) require direct comparisons and thus longer RTs. This is what we found.
Additional evidence for strategic use of semantic misalignment comes from recent ERP studies on a different type of sentence verification task (Guthormsen et al., 2016). Participants in that study saw sentences describing addition (e.g., “Twelve bats plus two caves equals fourteen”) and responded whether the statement was “acceptable” (“Yes”/ “No” but deliberately undefined). ERP responses to the onset of the second noun (e.g. “caves”) in the sentence were examined for the presence of a P600 effect associated with encountering semantic anomalies. The authors found that some participants rejected semantically misaligned but mathematically correct statements (such as the prior example), and that others accepted such statements. Notably, the former group of participants responded to the second noun with a P600 effect, whereas the latter group did not. Although we did not observe individual differences on our Number Matching task (but did for the Sentence Verification task), if our participants strongly recognized a semantic anomaly between concrete and intangible nouns, this may facilitate correctly rejecting nonmatching sums, whereas the semantic anomaly between categorically misaligned concrete nouns was insufficient for this judgment. Seeing a sum in such anomalous circumstances may be “uncanny” enough that participants can quickly make the “wise” choice to reject it as a nonmatch. Further ERP research could clarify this explanation by investigating the presence and strength of P600 effects for MCC and MCI nouns in the absence of explicit addition.
Analogous facilitation effects may also explain our Sentence Verification Task findings. In this task, participants read a sentence that either did or did not implicate multiplication, and then judged if a prompt that followed the sentence was likely to be true or false. When multiplication was not implicated in the initial sentence (i.e., on nonimplicative trials), participants were slower to accept true prompts if the prompt contained products of the digits appearing in the sentence (compared to neutral digits), and they were much faster to correctly reject incorrect statements containing products versus neutral digits (Figures 5 and 7). Like the sums in the MCI Number Matching condition, viewing a product inhibited correctly accepting and facilitated correctly rejecting a prompt sentence when the semantic context of the target sentences did not support arithmetic. This pattern was not repeated in the implicative condition; RTs did not differ when correctly rejecting false statements regardless of whether statements contained a product.
Our findings additionally point to the importance of context beyond the semantic alignment of the nouns that accompany numbers. On the Sentence Verification trials where multiplication was not implicated, the same interaction was observed even on trials where the same unit appeared both in the cue and the target sentences. This suggests that participants react to the broader context of the cue sentence and not only to the semantic alignment of the nouns associated with the cue numbers.
Individual Differences [TOP]
Are these alignment effects subject to individual differences? We found subtle evidence in the Sentence Verification task only, and equally subtle associations with arithmetic fluency scores. Fluency is ostensibly a measure of speed when correctly answering problems in a highly implicative context (an explicit arithmetic task), so it is intriguing that participants with higher math fluency scores made especially efficient use of information in nonimplicative contexts. This suggests that fluency may be partially a matter of choosing operations accurately. Products within prompts interacted with math and reading achievement in opposite directions: Participants with higher Math or lower Reading ACT scores responded more quickly to neutral prompts versus product prompts; there were no differences in RTs between neutral and product prompts for participants with low Math or high Reading ACT scores. Arithmetic products may be more salient among persons with higher math achievement, which may interfere with integrating mathematically ambiguous contextual cues. Conversely, lower reading achievement may slow integration of these components simply due to more labored comprehension. This cost of math achievement echoes findings that highnumeracy participants sometimes are more negatively influenced by numerical framing when making decisions (Peters et al., 2006). Our study adds to this finding by showing that reading achievement may index aspects of comprehending sentences with numbers, and that arithmetic fluency may provide special advantages in contexts that do not elicit arithmetic processing.
The theoretical bases for individual differences on the Sentence Verification task vary from welldocumented individual differences underlying numerosity judgments (e.g., Halberda, Ly, Wilmer, Naiman, & Germine, 2012), and individual differences in cognitive control and its effect on conflict adaptation (Hsu & Novick, 2016). It is not clear why we did not replicate individual differences in LeFevre interference (LeFevre & Kulak, 1994) on our Number Matching task, especially given that individual differences emerge in ERP responses to semantic alignment (Guthormsen et al., 2016). Perhaps subtle task differences linked to the presence or absence of an arithmetic operator as the fixation point affect detection of individual differences; Price, Mazzocco, and Ansari (2013) showed individual differences in the automaticity with which young adults respond to arithmetic computations, not just triplet sets of numbers. It is also possible that detecting individual differences may require more difficult tasks and/or more sensitive measures (such as ERP responses). Our findings on the Sentence Verification task illustrate the need to better understand the individual differences in semantic misalignment and contextual sensitivity that have been theorized to be important for word problem solving and the development of mathematical cognition (Martin & Bassok, 2005; Mazzocco, Chan, & Sera, 2016). Finally, a more diverse sample may elicit individual differences in our Number Matching task, clarifying the connections between the Number Matching and Sentence Verification tasks and their differing levels of contextual richness.
Conclusion [TOP]
How do these findings apply to everyday situations wherein numbers appear in diverse arithmetic and nonarithmetic contexts? Bassok et al. (2008) showed that people do not compute when it is illogical to do so. We show this, too, but we also show that sometimes people do compute when it is illogical, and that this outcome may have costs or benefits depending on whether an arithmetic operation is implicated and appropriate. The lack of any context implicating arithmetic may itself be information that some individuals seem to exploit.