Conceptual Replication and Extension of the Relation Between the Number Line Estimation Task and Mathematical Competence Across Seven Studies

A recent meta-analysis demonstrated the overall correlation between the number line estimation (NLE) task and children’s mathematical competence was r = .44 (positively recoded), and this relation increased with age. The goal of the current study was to conceptually replicate and extend these results by further synthesizing this correlation utilizing studies not present in the metaanalysis. Across seven studies, 954 participants, ranging from 3 to 11 years old (Age M = 6.02 years, SD = 1.57), the overall estimationcompetence correlations were similar to those of the meta-analysis and ranged from r = −.40 to −.35. The current conceptual replication demonstrated that the meta-analysis captured a stable overall relation between performance on the NLE task and mathematical competence. However, the current study failed to replicate the same moderation of age group presented in the metaanalysis. Furthermore, the current study extended results by assessing the stability and predictive validity of the NLE task while controlling for covariates. Results suggested that the NLE task demonstrated poor stability and predictive validity in the seven samples present in this study. Thus, although concurrent relations replicated, the differential age moderation, lack of stability, and lack of predictive validity in these studies require a more nuanced approach to understanding the utility of the NLE task. Future research should focus on understanding the connection between children’s developmental progression and NLE measurement before further investigating the predictive and diagnostic importance of the task for broader mathematical competence.


The Number Line Estimation Task
The number line estimation (NLE) task is a tool that has been widely used to assess children's numerical magnitude abilities and mathematical cognition (e.g., Fazio, Bailey, Thompson, & Siegler, 2014;Fuchs, Geary, Fuchs, Compton, & Hamlett, 2014;LeFevre et al., 2013;Lyons, Price, Vaessen, Blomert, & Ansari, 2014). Generally, this task presents children with an empty line where only the starting point and ending point are labeled (e.g., 0 on the far left of the line and 10 on the far right), and children are asked to mark where a given number goes on the number line. Multiple studies across a variety of disciplines have used this task as a measure of early mathematical understanding. Many of these studies vary, however, in the presentation of the task depending on characteristics of the children, such as child age (e.g., using nonsymbolic or symbolic as the number type, or using 1-10 or 0-100 as end points) (Berteletti, Lucangeli, Piazza, Dehaene, & Zorzi, 2010;Booth & Siegler, 2006;Siegler & Booth, 2004;Thompson & Siegler, 2010). Regardless of presentation, moderate associations between the performance on the NLE task and a wide range of mathematical competence measures have been found across multiple studies (Siegler, 2016). Thus, this association is considered a robust finding and has important theoretical and practical implications in the study of mathematical cognition development and achievement.
Evidence for the association between performance on the NLE task and mathematical competence was strengthened by a recent meta-analysis (Schneider et al., 2018). Schneider and colleagues (2018) reverse coded the effect sizes present in the literature such that children who were more precise at placing numbers on the number line also showed higher scores on mathematical competence measures. The authors found that across 263 effect sizes, the correlation between the NLE task and broad mathematical competence was r = .44 [95% CI: 0.406,0.480]. This relation increased with age such that the relation was r = .30 for children younger than 6 years of age, r = .44 for children 6-9 years of age, and r = .50 for children older than 9 years of age. Furthermore, the overall relation remained stable across task variants (e.g., number type, presentation medium, and number range) and mathematical measures, suggesting the NLE task is a robust correlate of mathematical competence.

The Present Study
The current study adds to the literature by leveraging data from seven independent and diverse studies not included in the previous meta-analysis (Schneider et al., 2018), to further examine and conceptually replicate the relation between the NLE task and mathematical competence. Specifically, this study aims to 1) replicate the effect sizes from the Schneider et al. (2018) meta-analysis, 2) replicate the age moderation, and lack of other moderations in the estimation-competence relation present in Schneider et al. (2018), and 3) extend their results by examining the stability and predictive validity of the NLE task while controlling for covariates.
Based on Schneider et al. (2018), our hypotheses are threefold: first, consistent with Schneider et al. (2018), we expect the effect size for the relation between the NLE task and mathematical competence to be significantly greater than zero, and to be moderate in size according to Cohen (1992). Second, also consistent with Schneider et al. (2018), we expect this relation to increase with age. Specifically, we expect weak to moderate (r = .10 to .30) relations between number line performance and math achievement for children under 6 years of age (r = .30; Schneider et al., 2018). However, in children above 6 years of age, we expect a moderate relation (r = .30 to .50) between number line performance and math achievement (r = .44; Schneider et al., 2018). Finally, number type, presentation medium, and number range are not expected to be significant moderators of the relation between NLE and mathematical competence. In addition to testing these replication hypotheses, we also performed exploratory analyses that extend the Schneider et al. (2018) study to investigate the stability and predictive validity of the NLE task while controlling for covariates. Thus, as these are exploratory, we do not have explicit hypotheses.

Method
Six studies for this conceptual replication were conducted in Midwestern cities in the United States of America, and one study was conducted in Chile. Studies were included in the analyses if they had administered a NLE task at least once. There were no data exclusion or outlier protocols. All study samples were collected between 2012 and 2020, whereas Schneider et al. (2018) included studies between 2006 and 2018. Overall, the total sample (number of children that completed the NLE task at least at once) across all seven studies consisted of N = 954 children ranging from 3 to 11 years of age (M age = 6.02 years, SD = 1.57).
For full descriptive information, see Table 1. Across all samples, 60% were younger than 6 years of age, 35% were between 6-9 years of age, and 5% were older than 9 years of age. Only 13% of the NLE tasks were nonsymbolic, whereas the rest (87%) were symbolic. Unlike Schneider et al. (2018), no number line tasks used in the current study utilized fractions as a number type. The number ranges used for the task were 10 (13%), 20 (60%), 100 (10%), or 1,000 (17%). All samples used the number to position approach with a bounded number line, and 62% of the sample completed the paper and pencil version, whereas 38% completed the task on a tablet.
Furthermore, all studies included at least one measure of mathematical competence beyond the NLE task. These measures included standardized mathematical achievement tests (Woodcock-Johnson Applied Problems subtest [WJAP;Woodcock, McGrew, & Mather, 2001], 69% of total sample; Preschool Early Numeracy Screener-Brief Version [PENS-B], 23% of total sample) or nonsymbolic number sense (Panamath; 45% of total sample). Forty-six percent of the sample were included in a longitudinal design and thus, also included a second measure of the NLE task.

Study 1 Participants
This study (Pathways) is an extension of a larger longitudinal project exploring the effects of schooling on executive functions. Data were collected on children attending four local elementary schools in Midwestern U.S. cities. The total sample includes three waves of children who were followed from kindergarten to second grade. Thus, there are three cohorts of children in the total sample. Data from this dataset were previously published (Ahmed, Grammer, & Morrison, 2021;Ellis et al., 2021), however, this publication is the only manuscript from this dataset to date that has examined research questions using the NLE task.
Schools included in this sample served children with a range of socioeconomic backgrounds based on school percentages of free and reduced-price lunch. Children were 4 to 6 years old (N = 277, M age = 5.71 years, SD = 0.40, 53% male) and in kindergarten at the beginning of testing. Participants were primarily English speakers and had no known developmental disorders. Each child was assessed on a battery of executive function and achievement measures by a trained examiner. Around one year later, participants were assessed on the same measures (n = 175).

Number Line Estimation Task
All children completed a modified version of the NLE task. Children were presented with a piece of paper and pencil, and a line labeled 0 and 20 at the left and right ends, respectively. Their task was to draw a hash mark on the line indicating their best guess at where a given random integer between 0-20 fell on the line. The experimenter held one of three flipbooks that presented the child with the number they were to place on the line. Each flipbook contained the same randomly chosen 10 integers (1, 4, 6, 8, 9, 10, 13, 16, 17, and 19) in different orders. A new, unmarked line was used in each trial so that only one hash mark was placed on each line. Before testing, participants completed three practice trials on a shorter line labeled 0-10 to ensure an understanding of the instructions. The experimenter did not provide any feedback about children's placement of the hash marks during the practice trials. Participants then completed 10 test trials on the 0-20 line without feedback. To assess children's numerical magnitude accuracy, their percentage of absolute error (PAE) scores were calculated following the procedure described by Siegler and Booth (2004).

Mathematical Competence
Children completed the standardized English version of the Applied Problems subtest from the Woodcock-Johnson III Tests of Achievement (Woodcock et al., 2001). The Woodcock-Johnson III Tests of Achievement are standardized administrative tasks designed to provide information about a child's abilities compared to the national average. The Applied Problems subtest is a task in which children are presented with a set of questions to assess overall, broad mathematics abilities. Testing on this assessment was complete after six consecutive errors. This task is brief and can determine a wide range of mathematics abilities. It is also widely used in many nationally representative databases.      There are a total of 60 possible items in the Applied Problems subtest and an individual's score is the sum of their correct responses.
Children also completed a computerized nonsymbolic numerical discrimination task, titled Panamath (Halberda, Mazzocco, & Feigenson, 2008). During each trial, participants saw an array of yellow dots on one side of a computer screen and an array of blue dots on the other side; the dots were not displayed long enough to enable counting. Children indicated by button-press whether they thought there were more blue or yellow dots. No feedback was given on this task. The number of dots displayed varied, and the duration of the array presentation was adjusted for the participant's age. A blank screen appeared following the dot presentation, which persisted until the child pushed a button to indicate their response. Participants completed 16 trials (task duration was approximately one minute), and overall performance on the measure was calculated via the percentage of trials answered correctly.

Study 2 Participants
This sample (School Instruction) is part of a larger study exploring the role of classroom instruction on early math skills. Data from this study have not yet been previously published. Kindergarten children (N = 92, M age = 5.55 years, SD = 0.34, 54% male) were recruited from four local elementary schools across 14 kindergarten classrooms in the greater southeast Michigan area. These four schools serve children with a range of socioeconomic backgrounds based on school percentages of free and reduced-price lunch. Participants received small gifts for their participation each time, parents and teachers received monetary compensation. Each child was assessed on a battery of math achievement measures by a trained examiner.

Number Line Estimation Task
Children completed the same version of the NLE task described in Study 1. Children's PAE scores were calculated based on the 10 test trials.

Mathematical Competence
Children completed the standardized mathematical achievement test (Woodcock-Johnson III Applied Problems subtest; Woodcock et al., 2001). The method for administering this test is described in Study 1.

Study 3 Participants
This sample (LENA) is part of a larger study exploring the role of the home environment on early numeracy skills. Data from the broader study were previously published at Susperreguy and Davis-Kean (2016); however, the NLE task was not included in the published article. These children were recruited from Midwestern U.S. cities. Children were 3-5 years old (N = 35, M age = 5.65 years, SD = 0.44, 69% male) and in preschool at the time of testing. Each child was assessed on a battery of math achievement measures by a trained examiner.

Number Line Estimation Task
Children were presented with a number line on paper that ranged from zero to twenty. Children were required to select the appropriate position of a number on a number line between 0 and 20. First, children were shown where both 0 and 20 go in a horizontal line with "0" below the left end and "20" below the right end on a sheet of paper. Then, all numbers from 1 to 19 were presented, one at a time, in a random order, and they were asked to estimate the position of each number on the line, one number per number line. To assess the accuracy of the child's estimates, each child's PAE was calculated.

Mathematical Competence
Children completed the standardized mathematical achievement test (Woodcock-Johnson III Applied Problems subtest; Woodcock et al., 2001). The method for administering this test is described in Study 1.

Study 4 Participants
This sample (Sharing Task) was part of a larger project examining the relation between social division and math achievement in children. Data from this study have not yet been previously published. Children's ages ranged from 3 to 14 years old when they participated in the study (N = 62, M age = 6.92 years, SD = 2.09, 46% male). Children were collected from a local museum and library in a Midwestern U.S. city. Each child was briefly assessed on a battery of math achievement measures by a trained examiner. Of interest in this study, children completed a version of the NLE task and Panamath (Halberda et al., 2008). Due to the testing environment in local libraries and museums where children were free to end the testing at any time, a portion of children only completed a portion of the tasks.

Number Line Estimation Task
Children completed the same version of the NLE task described in Study 1. Children's PAE scores were calculated based on the 10 test trials.

Mathematical Competence
Children completed the nonsymbolic numerical discrimination task (Panamath; Halberda et al., 2008). The method for administering this test is described in Study 1.

Study 5 Participants
This study (Storybook) recruited participated from a larger longitudinal project exploring a storybook intervention on children's mathematical skills. Data from this study have not yet been previously published. Participants included in the current study are the pretest data collected prior to the intervention random assignment. Children were recruited from local preschools in Midwestern U.S. cities. Children were 3 to 5 years old (N = 101, M age = 4.31 years, SD = 0.63, 49% male) and in preschool at testing. All children had no known developmental disorders and were English speaking. Children were assessed on a battery of achievement measures by trained research assistants.

Number Line Estimation Task
Children completed an iPad version of the NLE task ranging from 0-10 (https://hume.ca/ix/estimationline.html). The task presented children with a bounded number line and a number and they were asked to drag a hash mark to mark where they believed that number accurately belonged on the line. Children were asked to perform three practice trials (drag the line to 1, 4, and 9), and then they were presented with nine integers presented in a random order (1,2,3,4,5,6,7,8,9). Children's scores were calculated as the PAE across all non-practice trials.

Mathematical Competence
Children completed the PENS-B (Purpura, Reid, Eiland, & Baroody, 2015). The PENS-B is a 25-item early numeracy skill assessment that examines key mathematical domains identified in early preschool and kindergarten children. Test items assess children's counting skills, numerical relations, arithmetic operations, and numeral knowledge.

Study 6 Participants
This sample (Chile) is part of a more extensive cross-sequential study looking at the early predictors of math skills in Chilean children. Data from the broader study were previously published at del Río et al. (2020); however, the NLE task was not included in the published article. The children (N = 263, M age = 7.88 years, SD = 0.93, 53% male, all Spanish speaking) were recruited across grade 1, grade 2, and grade 3 from five schools in Santiago, Chile, targeting both low and high socioeconomic status families. In the first wave of the study, children were assessed on a battery of early cognitive, linguistic, and numerical skills and math achievement measures by a trained examiner. Around one year later, participants were assessed on the same measures (n = 159).

Mathematical Competence
Children completed the Spanish version of the Woodcock-Johnson III Applied Problems subtest (Batería III Wood cock-Muñoz; Muñoz-Sandoval, Woodcock, McGrew, & Mather, 2005). This task was administered similarly to that described in Study 1.

Study 7 Participants
This sample (ANS) is part of a larger study on children's mathematical, executive function (EF), and literacy develop ment during preschool. Data from the broader study were previously published at Purpura and Logan (2015), and Purpura and Simms (2018); however, the NLE task was not included in these published articles. Preschool children (N = 124, M age = 4.18 years, SD = 0.58, 54% male) were recruited from 12 different preschools in Midwestern U.S. cities. Families from a broad range of socioeconomic status were recruited (36% < 4-year college degree, 21.6% 4-year college degree, 42.4% > 4-year college degree). Each child was assessed on a battery of math achievement measures by a trained examiner in the fall of preschool. During the spring of the same year, participants were assessed on the same measures (n = 114).

Number Line Estimation Task
Children completed a modified version of the NLE task with paper and pencil designed to be a non-verbal number line task where, rather than being presented with numbers, children were presented with sets of dots (1-10). The task also included modifications according to Reid et al. (2015). For example, the set to be represented was presented on a flashcard instead of above the middle of the number line, the number line had a marker at 0, 1, and 10, and used the example of a rabbit hopping was provided to ensure that children understood the task. Therefore, this number line task included three benchmarks as a bounded condition.

Mathematical Competence
Children completed the nonsymbolic numerical discrimination task (Panamath) and the standardized numeracy skill assessment (PENS-B). The methods for administering these tests are described in Study 4 and Study 5, respectively.

Analytic Approach
Data analyses were run using the psych package (version 2.0.7) in RStudio (version 1.1.456). Syntax and a simulated dataset is available at https://osf.io/qswav/. Using the seven different studies, we examined the overall correlation between children's performance on a given version of the NLE task and their mathematical achievement.
Further, we tested if different aspects of the NLE and mathematical competence measures moderated any correla tions using a z test (Soper, 2020). These possible moderators included age group (below 6, 6-9, above 9), number range (0-10, 0-20, 0-100, or 0-1,000), presentation medium (computer or paper), and number type (symbolic or nonsymbolic). Although Schneider et al. (2018) assessed multiple moderators in the relation between performance on the NLE task and mathematical competence, the samples in this manuscript did not include all of the moderators included in their analyses. For example, Schneider et al. (2018) used number line type (bounded or unbounded), task type (position to number, or number to position), and index of NLE proficiency (PAE, estimate deviation, or linear R²). However, all samples in this manuscript used bounded number lines, position to number task type, and PAE as an index of NLE proficiency. Thus, these moderators were excluded from analyses.
Finally, exploratory multiple linear regressions were conducted to extend the results from Schneider et al. (2018) by examining the stability and predictive validity of the NLE task. These regressions tested whether the NLE task was stable across time, stable while controlling for other mathematical competence measures, and whether the NLE task predicted other mathematical measures while controlling for initial mathematical scores. Schneider et al. (2018) examined the predictive validity of the NLE task for later competence, however, the authors were unable to control for other key variables (e.g., participant age) given the meta-analytic data. Therefore, we extend these results by focusing on the stability and predictive validity of the task while controlling for covariates.

Sensitivity Power Analyses
The seven studies used in this manuscript are existing datasets, thus, a sensitivity power analysis was used to calculate the range of minimally detectable effect sizes (MDES) given the sample sizes across the proposed correlations (Cribbie, Beribisky, & Alter, 2019;Giner-Sorolla et al., 2019). Across all proposed moderators, the highest sample size was n = 634 and the smallest sample size was n = 71. G*Power was used to run sensitivity power analyses given this range of sample sizes necessary to detect true significant effect sizes (Faul, Erdfelder, Lang, & Buchner, 2007). A bivariate correlation with 634 participants, α = 0.05, and power (1−β) = 0.80, the sensitivity power analysis suggested that the MDES was 0.11 (Faul et al., 2007). For the smallest sample size, a bivariate correlation with 71 participants, α = 0.05, and power (1−β) = 0.80, the sensitivity power analysis suggested that the MDES was 0.29 (Faul et al., 2007). Thus, our sensitivity power analyses suggest that our larger dataset is powered to detect significant effect sizes as low as 0.11 for our highest sample and as low as 0.29 for our lowest sample.

Overall Estimation-Competence Relation
The overall effect sizes and effect sizes by moderator variables are listed in Table 2. Although Schneider et al. (2018) recoded all effect sizes such that a positive sign indicated higher scores on the NLE task was associated with higher mathematical competence, we chose to report our effect sizes true to the literature such that lower scores on the NLE task (PAE) were associated with higher mathematical competence. The overall correlation between the NLE task and mathematical competence ranged from r = -.40 to -.35 (95% CIs ranged from −0.51 to −0.26). The 95% confidence intervals did not include zero across all mathematical competence measures, suggesting the relation was statistically significant. Thus, these results support the first replication hypothesis that the NLE task is significantly associated with mathematical competence.

Measure of Mathematical Competence
The tests of moderation for the measures of mathematical competence were not found to be statistically significant for the NLE relation (see Table 3 for all z statistics). The overall relation between NLE and mathematical competence measure ranged from r = -.40 to -.35.

Age
Due to our small sample size for children in the above 9 age group (n = 35), we were unable to test conceptual replication results for this group from the Schneider et al. (2018)  The test of moderation for the correlation between the NLE task and mathematical competence was found to be statistically significant for the participants' age group on the Applied Problems subtest measure (z[595] = −3.86, p < .001; see Table 3). However, this moderation by age group was not as hypothesized based on the Schneider et al. (2018) meta-analysis. The correlations were highest among children younger than 6 years of age, and lower for children aged 6-9 years. Interestingly, the test of moderation for the estimation-competence relation was not found to be statistically significant by age group for the other competence measure (Panamath; z[393] = −0.74, p = .46). In sum, the estimation-competence relation demonstrated an age moderation for the Applied Problems subtest measure, but not the Panamath measure.

Number Type
The tests of moderation for the type of numbers presented to children were also not found to be statistically significant for the correlation between performance on the NLE task and mathematical competence (see Table 3). In line with our third hypothesis, relations among mathematical competence and performance on the NLE task were similar across both symbolic

Presentation Medium
The presentation medium moderation test was found to be statistically significant for the estimation-competence relation for children's performance on the Woodcock-Johnson III Applied Problems subtest (z[515] = 5.47, p < .001). However, presentation medium did not significantly moderate the estimation-competence relation for the PENS-B measure. The relation between computer or paper performance on the NLE task and the Woodcock-Johnson III Applied Problems subtest differed in effect size r = .06 and -.40, respectively. However, the estimation-PENS-B relation was similar across computer and paper presentation mediums, r(PENS-B) = -.260 and -.222, respectively.

Number Range
The correlation between performance on the NLE task and the Applied Problems subtest was found to be significantly moderated by the number range that was presented (z[537] = 2.13, p = .033). However, all other comparisons were not significant. Children who completed the 0-10 number line and the 0-20 number line demonstrated similar effect sizes for the estimation-competence relation, r(Panamath) = -.262 and -.267, respectively. Further, the effect sizes of the relation between performance on the NLE task and the Woodcock-Johnson III Applied Problems subtest were also similar across 0-20, 0-100, and 0-1,000 number ranges (r = -.40, -.51, and -.55 respectively).

Extension
To further assess the stability and validity of the NLE task, we also included extension analyses beyond the replication of the study by Schneider and colleagues (2018). Multiple linear regressions were used to examine whether the NLE task was stable over time while controlling for other mathematical measures and whether the NLE task predicted other mathematical measures while controlling for previous time points.

Number Line Estimation Stability
Results from regression analyses assessing the stability of the NLE task are presented in Table 4. Both age and sex were included as covariates. Although the correlation between NLE performance at Time 1 and Time 2 was small to moderate (r = .24), when using a regression and controlling for age and sex, children's NLE performance at Time 1 was not a statistically significant predictor of their NLE performance at Time 2 (Model 1 β(SE) = 0.06 (0.04), p = .187). However, when analyzed as separate studies (Study 1, 6, and 7 in  Table A1 in the Appendix for more details. When controlling for previous mathematical competence, performance on the NLE task at Time 1

Number Line Estimation Predictive Validity
Results from the multiple linear regressions are presented in Table 5 by mathematical competence measure. In all regressions, children's age and sex were included as covariates.

Discussion
The goal of the current study was to conceptually replicate correlational results from a meta-analysis examining the relation between NLE and mathematical competence (Schneider et al., 2018). Results using seven diverse and independent studies demonstrated mixed results. The moderate correlation between the NLE task and mathematical competence replicated. Further, consistent with Schneider et al. (2018), the correlation between the two constructs was moderated by age group, but not in the same direction. However, inconsistent with Schneider et al. (2018), presentation medium and number range also moderated this correlation. Unique to this study was a set of analyses that extended the results of Schneider et al. (2018) by examining the stability and predictive validity of the NLE task while controlling for participant age and sex. Interestingly, these results found that the NLE task was not stable across time, nor was it predictive of later mathematical competence.

Replication
The new findings from the seven independent studies replicated the Schneider et al. (2018) effect sizes and supported the first hypothesis that performance on the NLE task is associated with mathematical competence. In the original study, the overall estimation-competence effect size in the Schneider et al. (2018) meta-analysis was r = .44 (r = -.44 before recoding). Across the seven studies included in this replication, the strength of the association ranged from r = -.40 to -.35, and similar to Schneider at al. (2018), the strength of the association remained stable across mathematical competence measures. Estimation-competence stability across the mathematical competence measures could be due to a few different things. First, the mathematical competence measures are highly correlated. Although, one of the mathematical com petence measures was Panamath which is thought to measure a more innate, non-symbolic numerical processing (Halberda et al., 2008) and the Applied Problems and PENS-B measures assess more symbolic, non-innate mathematical skills, they are both mathematical competence measures that are part of the same mathematical construct. Relatedly, the consistency in the NLE -competence relations across mathematical measures could point to the interdependence of these developmental pathways among two foundational processes (non-symbolic and symbolic). Lau and colleagues (2021) found that earlier symbolic number ability was consistently the strongest predictor of approximate number ability. Therefore, although separate constructs, these two processes may work together to better refine individual mathematical skills. Thus, consistency across this relation, regardless of competence measure, may reflect refinement among these processes in the development of early numerical skills.
Results also supported part of our second hypothesis, such that age group moderated the estimation-competence association. The age moderation replicated results from Schneider et al. (2018); however, due to our small sample size of children above 9 years of age, we were unable to assess this third age group that was present in the original meta-analysis. Interestingly, the two age groups we were able to conceptually replicate demonstrated inconsistencies with Schneider et al. (2018), such that children's age group did not moderate the relation in the same way. Our results suggested that the estimation-competence relation was stronger for children in the below 6 years of age group than for children in the 6 to 9 years of age group for both math competence measures available. Schneider et al. (2018), however, found an increasing estimation-competence relation as age group increased. One possible explanation for this could be that children in the below 6 and 6 to 9 age groups did not receive drastically different number ranges in the current replication studies. Instead, these two age groupings received similar number ranges (e.g., 0 to 20) that may be simpler for the 6 to 9-year-olds than those for children below 6. Thus, the number range was better suited for the lower age range and skill level present in the current sample.
Inconsistent with our third hypothesis and findings from Schneider et al. (2018), our results also revealed presen tation medium and number range moderated the estimation-competence relation, but only for one mathematical competence measure, the Applied Problems subtest. The estimation-competence relation effect size was closer to what we hypothesized across presentation mediums and the number range for the Panamath measure. This divergent pattern of findings leads us to believe that, perhaps in these samples, the moderating variables were not mutually exclusive and were dependent upon the task variations presented to children. Specifically, the presentation medium and number range samples for the Applied Problems subtest were distinct for both Study 1 and Study 6, such that in Study 6, children were presented with more challenging number line ranges (0-100, 0-1,000) on a computer. In contrast, in Study 1, children were presented with a more straightforward number line range for their age group (0-20) with paper and pencil. Thus, these results may be a function of our samples' restrictions, rather than the moderator itself.
Although complicated, these replication results highlight the necessity for future work to focus on the importance of variation in the NLE task. In their meta-analysis, Schneider et al. (2018) hypothesize that NLE tasks that require fraction estimation strategies would allow for more fine-grained assessments of mathematical knowledge because fraction estimation strategies tend to be more complex than whole number estimation strategies (Rinne, Ye, & Jordan, 2017;Schneider & Siegler, 2010). Their results supported this hypothesis, as the estimation-competence correlation was higher for fractions than it was for whole numbers (Schneider et al., 2018). Our replication results, however, demonstrate a different effect regarding the range of whole numbers presented to children during the NLE task. At some age, the number range 0 to 20 becomes too easy for children, demonstrating a decreasing estimation-competence relation. Therefore, the developmental progression of NLE measurement to match certain age ranges remains unclear. As the number line task exists, it is very difficult to disentangle the effects of age, number range, and other factors from the larger "numerical magnitude" skill that this assessment should measure. It could be that the way in which the NLE task is scored (PAE) is not the most accurate way of measuring numerical magnitude across all developmental age groups as this approach does not consider different strategy use, knowledge of numbers, and so on (Xu, Burr, Douglas, Susperreguy, & LeFevre, 2021). Taken together, the inconsistency between our replication and the original meta-analysis suggest it is imperative that our field works to have the correct assessment (i.e., number range) fit the participant (i.e., age), and further examine the best way in which to score performance on the NLE task for young children. In sum, our replication efforts have emphasized the importance of utilizing a developmentally appropriate measure to capture numerical magnitude skills.

Extension
In the current replication effort, we also extended the Schneider et al. (2018) meta-analysis by conducting exploratory analyses to assess the stability and predictive validity of the NLE task. Schneider et al. (2018) included results on the predictive validity of the NLE task for later competence and found that it was a moderate predictor (see Table 1 in Schneider et al.). Across all measures of mathematical competence in this study, children's performance on the NLE task at an earlier time point did not predict their performance on the same task at a later time point when controlling for age and sex, or mathematical competence. However, when analyzed as separate studies this finding was inconsistent such that the NLE task predicted itself in two of three individual studies. Interestingly, only one mathematical competence measure predicted children's estimation performance, the Applied Problems subtest. Therefore, there was inconsistency not only in the stability of the NLE task, but also across the mathematical competence measures that predicted the task. These findings speak to the inconsistency in the field of results with the NLE task, and further support our earlier arguments that the numerical magnitude skill cannot be disentangled from other potential moderating variables when using this task.
Consistently, however, our results suggested that the NLE task did not predict any of our three mathematical competence measures while controlling for prior achievement. However, in each case, mathematical competence dem onstrated strong stability among all competence measures, even while controlling for children's age at testing, sex, and performance on the number line task. These findings add to our previous results to suggest that the NLE task did not demonstrate strong predictive validity for other mathematical competence measures. Taken together, the inconsistency and lack of evidence in stability, the lack of evidence in predictive validity, and the changing relation with mathematical competence across ages raises into question what, specifically, the NLE task measures.
Although both the stability and predictive validity of the NLE task were exploratory, our results are consistent with other recent studies that have assessed the reliability of the task. For example, multiple studies have shown low internal reliability for various versions of the NLE task (Hawes, Nosworthy, Archibald, & Ansari, 2019;Inglis & Gilmore, 2014;Kolkman, Kroesbergen, & Leseman, 2013). One study examined the stability of the NLE task in a sample with similar age groups and number line ranges as the current replication study (O'Connor, Morsanyi, & McCormack, 2019). Analyses revealed that children's performance on the NLE task was not correlated with later performance, demonstrating that the skills measured by this task may be unstable. Further results suggested the way in which children solve questions on the NLE task may qualitatively change over time (O'Connor et al., 2019). Thus, children's performance on the NLE task may reflect various other skills that develop over time, such as familiarity with numbers (Xu et al., 2021), or strategy use (Xu & LeFevre, 2016), not a single underlying numerical magnitude ability.

Conclusion
In sum, the current study successfully replicated the overall findings from the Schneider et al. (2018) meta-analysis. The strength of the estimation-competence association replicated, as did the finding that the age group of the child was important for the strength of this relation, though not increasingly. Furthermore, consistent with Schneider et al. (2018), the association remained stable across mathematical competence measure and number type. Inconsistent with Schneider et al. (2018), the correlation did not remain stable across presentation medium and number range. Exploratory analyses revealed that the NLE task did not demonstrate strong stability or predictive validity. Thus, our results generally replicated the correlational nature of the NLE task and mathematical competence found in Schneider et al. (2018). However, the current study also highlighted the instability and lack of unique variance provided by the NLE task across seven independent studies. Future research should first focus on understanding the connection between children's developmental progression and NLE measurement before further investigating the predictive and diagnostic importance of the task for broader mathematical competence.  Journal of Numerical Cognition (JNC) is an official journal of the Mathematical Cognition and Learning Society (MCLS).