What are the limits of our perception? This question has been a central focus of the scientific study of the mind since the investigation of thresholds began with psychophysics (Fechner, 1966/1860; Weber, 1978/1834). Although Signal Detection Theory (SDT) attempted to do away with the idea of thresholds entirely (e.g., Laming, 1973), the idea that at some point differences can be so small as to be imperceptible remains a prominent concept in modern psychological theory. Indeed, we imagine that all readers share with us the intuition that two clouds of dots can become so close in number that it is impossible for a human to judge which has more after just a brief glance (e.g., two orange trees, one with 200 oranges and the other with 201). Here, we investigate whether, contrary to these intuitions, we are in fact sensitive to very small differences, and that the function governing perceptual discrimination performance is indeed smooth all the way down.

While the argument we will explore applies broadly to all magnitude discriminations, the literature on approximate number cognition is a particularly informative testcase, as limits on our ability to judge number have been widely discussed. The ability to rapidly determine the numerically greater of two collections is attested across species and across human development (e.g., Feigenson et al., 2004). The core system thought to be responsible for representing non-symbolic numerical quantities is known as the Approximate Number System (ANS), and it may be linked to more advanced mathematical abilities across the lifespan (Halberda et al., 2008; Mazzocco et al., 2011; Park & Brannon, 2013; Starr et al., 2017; Wang et al., 2016, 2017; but see De Smedt et al., 2013; Norris & Castronovo, 2016; Qiu et al., 2021; Szűcs & Myers, 2017). ANS acuity is often measured using discrimination tasks, where two ensembles are presented, and the subject decides which contains more objects. With such comparisons, trial difficulty is determined by the ratio of the compared groups (Feigenson et al., 2004). Individuals differ in how precise their numerical estimates tend to be (Halberda et al., 2008; Halberda et al., 2012), and this has been accounted for by Weber’s law – a relationship between the ratio of the values to be discriminated and performance first described by Fechner in the earliest years of experimental psychology (Fechner, 1966/1860).

The idea that some differences are too small to perceive is prevalent throughout the literature on numerical cognition, as exemplified by the following quote: “The difference between eight and nine is not experienced at all, since eight and nine, like any higher successive numerical values, cannot be discriminated” (Carey, 2009, p. 295). The classification of small differences as imperceptible in numerical cognition is particularly noticeable in the developmental literature, where, due in part to practical limitations on experiment length, rather than modeling complete psychometric curves, experimenters often opt to test only a few ratios and determine which result in above-chance performance at their limited sample size. For instance, one of the earliest studies establishing numerical sensitivity in infants concluded that 6-month-olds require a 2:1 ratio in order to perceive a difference between large approximate collections (Xu & Spelke, 2000). In fact, the use of above versus at-chance performance as a distinguishing metric for performance is present widely across the number literature for a variety of populations and species (e.g., ratio 5:4 for mosquitofish, Agrillo et al., 2007; 1.7:1 for angelfish, Gómez-Laplaza & Gerlai, 2011; 5:4 for cotton-top tamarinds, Hauser et al., 2003; 2:1 for newborn infants, Izard et al., 2009; 6:4 for red-backed salamanders, Uller et al., 2003). The implication of these results, either stated or unstated by the authors, is that ratios smaller than some critical value cannot be differentiated by that population under those conditions.

However, the conclusion that these populations are *insensitive* to smaller differences is inconsistent with modern models of psychophysics, such
as SDT. Instead, these models predict smoothly decreasing performance as a function
of the ratio of the comparison, with no drop to “chance performance” even on extremely
difficult ratios. It is an inconsistency, then, that these two conventions – results
claiming to reflect “limits” on number discrimination, as well as modeling approaches
informed by SDT (e.g., Halberda et al., 2008; Piazza et al., 2010; Pica et al., 2004) – coexist in the numerical cognition literature. Further, although success is *theoretically* predicted on such intuitively “impossible” trials, they have yet to be strongly *empirically tested* with ample enough sample sizes to find a significant effect, at least in numerical
cognition. Here we provide this empirical test.

This last point is of particular theoretical significance, as there is precedent for
revision of psychophysical theory in response to new data from edge cases: for example,
previous models of magnitude perception such as Weber’s law were eventually found
to be inadequate to explain behavior at extremes such as extremely high and low stimulus
intensities (McGill & Goldberg, 1968). And, in numerical discrimination tasks, it is conceivable that human participants
may employ behavioral strategies that lead to deviations from the predictions of SDT
(i.e., above chance performance on even the most difficult comparisons). One such
strategy, which we here call the “Give Up” function, assumes that subjects cease to
respond on the basis of their representations when comparisons become especially difficult
– instead opting to respond randomly (for an example of results that appear consistent
with this strategy, see Halberda & Feigenson, 2008). Therefore, even with a thorough understanding and belief in the validity of modern
psychophysical laws, it is not a *given* that these laws will characterize behavior all the way to the limit (i.e., equality).

Here, we empirically test whether the psychophysical theories of magnitude comparison (using number as a case study) are indeed correct even in extreme cases, despite their inconsistency with the intuitive appeal of hard limits and the convention of using such limits to characterize the acuity of perceptual systems. That is, are we above chance at numerical discrimination, even for extremely difficult comparisons? Importantly, most of the comparisons that we presented were far below the levels typically employed in numerical discrimination experiments, with the most difficult ratio being 51:50 dots. Although many authors in numerical cognition may expect the contrary, either due to a belief in true perceptual limits or an intuition that subjects would give up on subjectively “impossible” comparisons, we hypothesized that, with enough trials, people would perform above chance on even the hardest ratios.

## Method

### Subjects

We ran batches until we had at least 100 subjects for each of four conditions. A total
of *N* = 412 people completed the experiment (*n* = 110 for stimulus set A with sequential presentation, *n* = 100 for the stimulus set B with sequential presentation, *n* = 102 participants for stimulus set A with simultaneous presentation, and *n* = 100 participants for stimulus set B with simultaneous presentation; see Materials
for details of the stimulus sets). Subjects were undergraduate students enrolled in
a psychology course who received course credit for their online participation.

### Materials

On each trial, subjects saw two ensembles and indicated which of the two groups contained more dots. There was always a correct answer, so equal numbers of dots were never shown. Each stimulus image was displayed for 1000 ms. There were two conditions. Half of subjects saw the groups sequentially, at the center of the screen (all white dots against a gray background), with a 50 ms blank screen between ensemble presentations. The other half of subjects saw the groups simultaneously, with one group on the left (white dots) and one on the right (black dots). Following the stimuli, a response prompt (“Did the first/left or second/right image contain more dots?”) remained on the screen until a response was recorded on the subject’s keyboard.

Stimulus sets A and B were distinct sets of images generated using the same algorithm, which were both included solely for the purpose of ensuring generalizability of our results beyond one particular stimulus set. This means that there were essentially four “conditions” with approximately equal numbers of participants: stimulus set A with sequential presentation, stimulus set B with sequential presentation, stimulus set A with simultaneous presentation, and stimulus set B with simultaneous presentation.

Finally, during stimulus generation, we implemented non-numerical feature controls in an attempt to focus subjects on number. Surface area and convex hull ratios on each trial were approximately equated to that trial’s number ratio, with equal numbers of congruent and incongruent trials for each feature. However, we note that our broader argument – that subjects are capable of above chance discrimination performance on extremely difficult ratios – would stand whether subjects in our task were relying on number, area, convex hull, or on any combination of features (i.e., the ratio of any feature on the 51:50 trials was equal to 51:50 – and far more difficult than the ratios typically used in number, area and convex hull experiments).

### Procedure

There were four blocks of trials, with participant-paced breaks in between each block.
Each block began with practice trials on easy ratios that scaffolded the participant
towards the difficult test ratio (Confidence Hysteresis; Odic et al., 2014). For every block, the practice trials consisted of 8 trials each of comparisons
of 30:20, then 40:30, then 50:40, such that there were 24 practice trials per block.
Following each round of practice trials, subjects responded to 56 trials of one of
the difficult ratios, with randomized block order: 21:20, 31:30, 41:40, or 51:50.
These ratios were selected because even the easiest (21:20) is much more difficult
than ratios typically employed in numerical comparison tasks (e.g., 2:1 – 10:9; Halberda & Feigenson, 2008), and the most difficult (51:50) is subjectively extremely difficult, yielding a
sense of a complete lack of confidence in pilot testing (see Figure 1). Trial order was randomized across participants within each block, and the order
in which stimuli were presented (left vs. right/1^{st} vs. 2^{nd}) was counterbalanced, such that each option was the correct response on exactly half
of the trials in that condition.

##### Figure 1

Subjects were tested online on their own personal computer during COVID, so factors like total display size and luminance were not tightly controlled – but these should have only minor effects on performance, given the previously-found reliability of internet-based psychological studies (e.g., Germine et al., 2010).

### Computational Models

We fit each subject’s response data with two models (see Figure 2). The first is a modification of a psychophysical model derived from SDT used extensively
in previous ANS research (e.g., Halberda et al., 2008; Piazza et al., 2010; Pica et al., 2004), which here we call the SDT Model. This model takes responses to trials with different
ratios and fits an internal Weber fraction (*w*) and guess rate (*g*). The Weber fraction corresponds to the rate of expansion of the standard deviation
of the ANS’s Gaussian representations and captures the amount of non-overlap between
the Gaussian representations of the compared numbers (e.g., Dehaene, 2003). A smaller *w* corresponds to more precise responding. The guess rate accounts for the amount that
subjects are blindly guessing on any trial regardless of ratio. An increased *g* results in a lower ceiling on performance and affects performance on other ratios
proportionately (i.e., it does not result in increased guessing specifically on difficult
ratios, as would be seen with giving up).

##### Figure 2

The probability (*p*) that a subject responds correctly (i.e., correctly identifies which of the two groups
has the larger number of dots) is a function of the ratio of the numerosity of the
compared ensembles,
$r=\frac{{n}_{larger}}{{n}_{smaller}}$, according to the following function:

As can be seen in left panel of Figure 2, this curve is the top half of a sigmoid, and the inflection point in the sigmoid
would happen at *r* = 1 (when the two groups have equal numbers of dots). This means that the SDT model
predicts that performance will be above chance for all ratios until equality (*r* = 1), with sufficient sample size.

For the alternative model, which we are calling the Give Up Model, we devised a modification
of the SDT model in an attempt to capture the intuition that comparisons eventually
become so difficult as to be imperceptible – i.e., that they give rise to random guessing.
In our Give Up Model, if the ratio was more difficult than the Guess Boundary (i.e.,
*r* < 1 + *Guess Boundary*), the probability of a correct response was 50%. If the ratio was easier than the
Guess Boundary (i.e., *r > 1 + Guess Boundary*), the probability of a correct response was determined using the SDT Model for a
ratio of *r* = *r* – *Guess Boundary* (which allows the function to be continuous; see Figure 2, right). While the specific way in which participants may “blend” their responses
based on the signal (SDT) and their responses based on guesses (Give Up) near the
Guess Boundary has yet to be determined, this straightforward approach will allow
us to detect whether or not a Give Up function is necessary to account for human performance:

## Results

### Exclusions

We excluded subjects based on average accuracy across the whole experiment. Overall,
subjects were correct on 62.0% of trials (*SD* = 4.4%). Subjects who performed more than three standard deviations below the mean
for accuracy (that is, below 48.7% across all trials) were excluded from the analysis.
This exclusion criterion resulted in the removal of 2 participants. As a result, 410
participants were included in analyses. Note that, because performance is expected
to vary by ratio, this criterion leaves plenty of room for observing chance performance
across the group on harder ratios.

### Performance Above Chance

Our main question of interest is whether subjects can perceive the difference between
groups at the most difficult numerical ratios. For each ratio, we calculated each
subject’s accuracy. Then we performed a series of planned one-sample *t*-tests comparing average performance with chance (50%), separately for each ratio.

Subjects performed above chance for all ratios, all *p*s < .001 (see Figure 3). For the easiest comparison in the practice trials (30:20, ratio = 1.5), subjects
averaged 86.0% accuracy (*SD* = 10.8%). On the hardest comparison (51:50, ratio = 1.02), subjects averaged 51.3%
accuracy (*SD* = 6.4%), which was still significantly higher than chance, *t*(409) = 4.11, *p* < .001 (pink dot in Figure 3). Average performance increased smoothly as a function of ratio.

##### Figure 3

To evaluate whether performance varied by condition, we ran a 2-way (stimulus set
x presentation method) between-subjects ANOVA on overall accuracy. There was a marginal
difference in performance between subjects who saw stimulus set A (*M* = 62.3%, *SD* = 4.3%) and stimulus set B (*M* = 61.6%, *SD* = 4.6%), *F*(1,408) = 3.18, *p* = .075. There was a significant difference in performance by presentation method.
Subjects in the sequential presentation condition had higher accuracy (*M* = 63.3%, *SD* = 4.1%) than subjects in the simultaneous presentation condition (*M* = 60.5%, *SD* = 4.3%), *F*(1,408) = 44.47, *p* < .001. The interaction between stimulus set and presentation condition was not significant,
*p* = .309.

Because of the difference in performance between the presentation conditions, we evaluated
whether subjects were significantly above chance at each ratio separately for each
presentation condition. In the sequential presentation condition, subjects were significantly
above chance on every comparison including 51:50 (*M* = 52.0%, *SD* = 6.6%), *p*s < .001. Subjects in the simultaneous presentation condition were above chance on
every comparison, *p*s < .001, *except* for 51:50, which was not significantly different from chance (*M* = 50.5%, *SD* = 6.0%), *t*(201) = 1.28, *p* = .202. Note that the simultaneous presentation (1000 ms total) allowed exactly half
of the amount of time allowed in the serial presentation (2000 ms total) and this,
rather than any lack of ability, may be the cause of the slightly lower percent correct
on these trials. More importantly, this deviation was not large enough in magnitude
for a Give Up model to be preferred even for responses in this condition (see below).

### Modeling Results

For each subject, we fit their responses with the two models using Maximum Likelihood Estimation, then evaluated which model provided a better fit to each subject’s data using the Bayesian Information Criterion (BIC; Neath & Cavanaugh, 2012). For 398 of the 410 subjects, the BIC value yielded by their SDT Model fit was lower than the BIC values yielded by the Give Up Model (see Figure 4). This pattern was similar regardless of stimulus presentation condition (sequential: 201/208 preferred SDT Model; simultaneous: 201/202 preferred SDT Model). Consistent with the better-than-chance performance we found on difficult ratios, the SDT model is a more parsimonious descriptor of the data for nearly all of our participants.

##### Figure 4

Additionally, we also fit all 410 subjects’ data together to compare the two model
fits (see Figure 3 for the group-level model curves). When fitting the group data using the SDT Model,
we found the best fit with *w* = .157, which is consistent with other published values in the literature (e.g.,
Halberda & Feigenson, 2008), and *g* = .214. When we fit the group data with the Give Up model, we found *w* = .147, *g* = .231, and *Guess Boundary* = .004; this would indicate that subjects are expected to start guessing on a comparison
of 251:250 dots. Notably, the Give Up model fit results in a curve that almost completely
overlaps with the SDT model (see Figure 3). Thus, even when given the opportunity to implement a Guess Boundary at any ratio,
the model essentially opts not to. A comparison of the BIC values for the SDT Model
(BIC = 162,482) versus the Give Up model (BIC = 162,486.1) confirmed that the SDT
Model was the most parsimonious description of the data. The small difference in BIC
values between the SDT and Give Up models indicates that they obtain nearly identical
fits (as expected as the Guess Boundary parameter approaches 0), but the SDT model
is preferred because it has fewer parameters than the Give Up model.

## Discussion

The idea that some differences are too small to perceive has intuitive appeal. However, the data presented here suggest that humans are capable of far finer distinctions than this idea would imply. Although comparisons as large as an 8:7 ratio have been cited as the limit of approximate number perception for adults (e.g., Carey, 2009), we found that people are capable of making distinctions as fine as 50 vs. 51 dots at an above chance rate.

Although this success may feel counterintuitive, it is in fact consistent with modern
models of psychophysics. For example, if representations of number are well-ordered,
then there will always be some region of activation for 51 that is greater in magnitude
than the activation for 50 (likewise for 101:100, etc.). If the number representations
are close (e.g., 51:50), just like any two similar signals in the mind, the region
of non-overlap, where 51 has a higher signal than 50, will be small, but it will be
represented. And it is this small region of non-overlap in representation that drives
the success we observed here – a small but significant improvement from chance. Nothing
changes as one progresses to more and more difficult numerical comparisons. The observer
is thinking the same thought (e.g., *who has more*), gathering the same evidence (e.g., *how many*) and making the same comparison. The only thing that *does* change is the observer’s epistemic limitations for making their decision (Halberda, 2016).

In Figure 3, performance at *each and every ratio* can be described by a single parameter – the Weber fraction (*w*) – which determines the amount of overlap between *any* set of numerosity representations. Recall that this argument applies whether subjects
were relying solely on a numerical representation, on area, convex hull or any combination
of magnitude representations.

Why do some previous studies report data consistent with “at chance” performance on difficult ratios (e.g., Hauser et al., 2003; Xu & Spelke, 2000)? As mentioned in the introduction, we hypothesize that one reason may be that these studies did not have sufficient data to find support for what would undoubtedly be small effects. Here, data from 410 subjects who completed 56 trials each at the most difficult ratios gave us ample evidence to distinguish 51.3% performance from chance.

What do the present results mean for our understanding of individual differences in
ANS precision? It has been typical in numerical cognition to use the point at which
performance transitions from above-chance to at-chance as a metric for distinguishing
between populations and species. But although such a description in terms of the minimum
ratio of discriminability is widespread across the numerical literature (see Mehlis et al., 2015), we argue that the data is better represented and explained by differences in *slope*, rather than differences in *limits*. Luckily, psychophysical theories provide us with an easy way to quantify the slope
of a discrimination function.

In the SDT model, individual differences are explained via the Weber fraction parameter.
In such a model, while everyone would be near perfect with the easiest ratios (e.g.,
2:1), performance would only drop to *chance* upon reaching equality (i.e., a ratio of 1:1), regardless of differences in precision.
Some individuals may be more precise and show more rapid improvement with ratio (e.g.,
the darker blue curve in Figure 5, right) while others may lag behind – yet remain above chance (e.g., the lighter
blue curve in Figure 5, right). This behavior looks markedly different from one defined by *limits* to what differences one can perceive (e.g., the lighter versus darker green curves
in Figure 5, left). Although it may be tempting to simply categorize magnitude discrimination
behavior into success or failure, the truth is in fact much more continuous.